Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis
Leverage (statistics)
Modalities
Multimodal learning
Labeled data
DOI:
10.1145/3607865.3613184
Publication Date:
2023-10-17T18:12:36Z
AUTHORS (7)
ABSTRACT
In this paper, we present the solutions to MER-SEMI subchallenge of Multimodal Emotion Recognition Challenge (MER 2023). This focuses on predicting discrete emotions for a small subset unlabeled videos within context semi-supervised learning. Participants are provided with combination labeled and large amounts videos. Our preliminary experiments demonstrate that task is primarily driven by video audio modalities, while text modality plays relatively weaker role in emotion prediction. To address challenge, propose Video-Audio Transformer (VAT), which takes raw signals as inputs extracts multimodal representations. VAT comprises encoder, an cross-modal encoder. leverage vast amount data, introduce contrastive loss align image representations before fusing them through attention. Additionally, enhance model's ability learn from noisy apply momentum distillation, self-training method learns pseudo-targets generated model. Furthermore, fine-tune annotated data specifically recognition. Experimental results have shown effectiveness proposed Notably, our model ranks first (0.891) leaderboard. project publicly available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-MuSe-2023-Challenges.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (59)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....