Document Type

Article

Publication Title

Computer Systems Science and Engineering

Abstract

Speech signals play an essential role in communication and provide an efficient way to exchange information between humans and machines. Speech Emotion Recognition (SER) is one of the critical sources for human evaluation, which is applicable in many real-world applications such as healthcare, call centers, robotics, safety, and virtual reality. This work developed a novel TCN-based emotion recognition system using speech signals through a spatial-temporal convolution network to recognize the speaker's emotional state. The authors designed a Temporal Convolutional Network (TCN) core block to recognize long-term dependencies in speech signals and then feed these temporal cues to a dense network to fuse the spatial features and recognize global information for final classification. The proposed network extracts valid sequential cues automatically from speech signals, which performed better than state-of-the-art (SOTA) and traditional machine learning algorithms. Results of the proposed method show a high recognition rate compared with SOTAmethods. The final unweighted accuracy of 80.84%, and 92.31%, for interactive emotional dyadic motion captures (IEMOCAP) and berlin emotional dataset (EMO-DB), indicate the robustness and efficiency of the designed model.

First Page

3355

Last Page

3369

DOI

10.32604/csse.2023.037373

Publication Date

4-3-2023

Keywords

Affective computing, deep learning, emotion recognition, speech signal, temporal convolutional network

Comments

Open Access, archived thanks to Computer Systems Science and Engineering

License: CC by 4.0

Uploaded: June 19, 2024

Share

COinS