Multilevel Feature Representation for Hybrid Transformers-based Emotion Recognition

Document Type

Conference Proceeding

Publication Title

BioSMART 2023 - Proceedings: 5th International Conference on Bio-Engineering for Smart Technologies


Automated Speech Emotion Recognition (SER) systems and human-computer interaction systems are both heavily reliant on emotion. Global and temporal representation of utterances is crucial to the effectiveness of an SER module. Research conducted by the author demonstrates that the temporal data gathered by the transformer can significantly improve the SER system's overall recognition rate. There are some limitations to all of the existing hybrid models, despite the fact that the performance of hybrid models is higher than that of conventional classifiers. Despite this, the relationship between different speech cues and the learning of high-level global and temporal cues using a transformer has not been studied thoroughly. As a result, this research discovered an efficient transformer-based hybrid technique for emotion recognition via multilevel feature representation of speech signals. To learn deeper information from global and temporal representations, the proposed method comprises a parallel convolutional encoder, a spatial encoder, and a sequential encoder. Furthermore, the learned cues pass through the proposed transformer to capture the salient information for a specific emotion in the input sequence. To verify its effectiveness, we evaluated the proposed approach and achieved state-of-the-art (SOTA) results 75.29% and 88.18% weighted, and 76.34% and 88.49% unweighted accuracy on the IEMOCAP and SITB-OSED corpora.



Publication Date



Emotion Recognition, Human-Computer Interaction, Hybrid Transformer, Multilevel Feature Representation, Speech Signal

This document is currently not available here.