Acoustic Feature Excitation-and-Aggregation Network Based on Multi-Task Learning for Speech Emotion Recognition

DOI: 10.3390/electronics14050844 Publication Date: 2025-02-21T12:53:06Z
ABSTRACT
In recent years, substantial research has focused on emotion recognition using multi-stream speech representations. In existing multi-stream speech emotion recognition (SER) approaches, effectively extracting and fusing speech features is crucial. To overcome the bottleneck in SER caused by the fusion of inter-feature information, including challenges like modeling complex feature relations and the inefficiency of fusion methods, this paper proposes an SER framework based on multi-task learning, named AFEA-Net. The framework consists of a speech emotion alignment learning (SEAL), an acoustic feature excitation-and-aggregation mechanism (AFEA), and a continuity learning. First, SEAL aligns sentiment information between WavLM and Fbank features. Then, we design an acoustic feature excitation-and-aggregation mechanism to adaptively calibrate and merge the two features. Furthermore, we introduce a continuity learning strategy to explore the distinctiveness and complementarity of dual-stream features from intra- and inter-speech. Experimental results on the publicly available IEMOCAP and RAVDESS sentiment datasets show that our proposed approach outperforms state-of-the-art SER approaches. Specifically, we achieve 75.1% WA, 75.3% UAR, 76% precision, and 75.4% F1-score on IEMOCAP, and 80.3%, 80.6%, 80.8%, and 80.4% on RAVDESS, respectively.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (69)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....