Computer Vision Faculty Publications

MSER: Multimodal speech emotion recognition using cross-attention with deep fusion

Mustaqeem Khan, Sejong UniversityFollow
Wail Gueaieb, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Abdulmotaleb El Saddik, Mohamed Bin Zayed University of Artificial IntelligenceFollow
Soonil Kwon, Sejong University

Document Type

Article

Publication Title

Expert Systems with Applications

Abstract

In human–computer interaction (HCI) and especially speech signal processing, emotion recognition is one of the most important and challenging tasks due to multi-modality and limited data availability. Nowadays, an intelligent system is required for real-world applications to efficiently process and understand the speaker's emotional state and to enhance the analytical abilities to assist communication by a human-machine interface (HMI). Designing a reliable and robust Multimodal Speech Emotion Recognition (MSER) to efficiently recognize emotions through multi-modality such as speech and text is necessary. This paper propose a novel MSER model with a deep feature fusion technique using a multi-headed cross-attention mechanism. The proposed model utilize audio and text cues to predict the emotion label accordingly. Our proposed model process the raw speech signal and text by CNN and feeds to corresponding encoders for discriminative and semantic feature extractions. The cross-attention mechanism is applied to both features to enhance the interaction between text and audio cues by crossway to extract the most relevant information for emotion recognition. Finally, combining the region-wise weights from both encoders enables interaction among different layers and paths by the proposed deep feature fusion scheme. The authors evaluate the proposed system using the IEMOCAP and MELD datasets and conduct extensive experiments that obtain state-of-the-art (SOTA) results and show a 4.5% improved recognition rate, respectively. Our model secured a significant improvement over SOTA methods, which shows the robustness and effectiveness of the proposed MSER model.

DOI

10.1016/j.eswa.2023.122946

Publication Date

7-1-2024

Keywords

Affective computing, Auto-encoders, Deep fusion, Deep learning, Multimodal speech emotion recognition, Speech and text processing

Comments

IR Deposit conditions:

OA version (pathway b) Accepted version

24 months embargo

License: CC BY-NC-ND

Must link to publisher version with DOI

Recommended Citation

M. Khan, W. Gueaieb, A. ElSaddik, and S. Kwoon, "MSER: Multimodal speech emotion recognition using cross-attention with deep fusion", Expert Systems with Applications", vol 245, July 2024. doi:10.1016/j.eswa.2023.122946

Additional Links

https://doi.org/10.1016/j.eswa.2023.122946

Link to Full Text

COinS

Computer Vision Faculty Publications

MSER: Multimodal speech emotion recognition using cross-attention with deep fusion

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Browse

Contribute

Links

Computer Vision Faculty Publications

MSER: Multimodal speech emotion recognition using cross-attention with deep fusion

Authors

Document Type

Publication Title

Abstract

DOI

Publication Date

Keywords

Comments

Recommended Citation

Additional Links

Share

Browse

Contribute

Links