Multimodal Emotion Recognition from Raw Audio with Sinc-convolution

Sinc function Convolution (computer science)
DOI: 10.48550/arxiv.2402.11954 Publication Date: 2024-02-19
ABSTRACT
Speech Emotion Recognition (SER) is still a complex task for computers with average recall rates usually about 70% on the most realistic datasets. Most SER systems use hand-crafted features extracted from audio signal such as energy, zero crossing rate, spectral information, prosodic, mel frequency cepstral coefficient (MFCC), and so on. More recently, using raw waveform training neural network becoming an emerging trend. This approach advantageous it eliminates feature extraction pipeline. Learning time-domain has shown good results tasks speech recognition, speaker verification etc. In this paper, we utilize Sinc-convolution layer, which efficient architecture preprocessing emotion to extract acoustic signals followed by long short-term memory (LSTM). We also incorporate linguistic append dialogical decoding (DED) strategy. Our achieves weighted accuracy of 85.1\% in four class Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....