AAD-Net: Advanced end-to-end signal processing system for human emotion detection & recognition using attention-based deep echo state network

Document Type


Publication Title

Knowledge-Based Systems


Speech signals are the most convenient way of communication between human beings and the eventual method of Human–Computer Interaction (HCI) to exchange emotions and information. Recognizing emotions from speech signals is a challenging task due to the sparse nature of emotional data and features. In this article, we proposed a Deep Echo-State-Network (DeepESN) system for emotion recognition with a dilated convolution neural network and multi-headed attention mechanism. To reduce the model complexity, we incorporate a DeepESN that combines reservoir computing for higher-dimensional mapping. We also used fine-tuned Sparse Random Projection (SRP) to reduce dimensionality and adopted an early fusion strategy to fuse the extracted cues and passed the joint feature vector via a classification layer to recognize emotions. Our proposed model is evaluated on two public speech corpora, EMO-DB and RAVDESS, and tested for subject/speaker-dependent/independent performance. The results show that our proposed system achieves a high recognition rate, 91.14, 85.57 for EMO-DB, and 82.01, 77.02 for RAVDESS, using speaker-dependent and independent experiments, respectively. Our proposed system outperforms the State-of-The-Art (SOTA) while requiring less computational time.



Publication Date



Affective computing, Attention mechanism, Audio speech signals, Convolution neural network, Echo state networks, Emotion recognition, Human–computer interaction


IR Deposit conditions:

OA version (pathway c) Accepted version

24-month embargo

License: CC BY-NC-ND by 4.0

Must link to publisher version with DOI