NFDI4DS | UHH-SEMS - Publication Details

Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis

Leverage (statistics) Modalities Multimodal learning Labeled data

DOI: 10.1145/3607865.3613184 Publication Date: 2023-10-17T18:12:36Z

Abstract Supplemental Material References Cited by

AUTHORS (7)

Chaoyue Ding

Daoming Zong

Baoxiang Li

Ken Zheng

Dinghao Zhou

Jiakui Li

Qunyan Zhou

ABSTRACT

In this paper, we present the solutions to MER-SEMI subchallenge of Multimodal Emotion Recognition Challenge (MER 2023). This focuses on predicting discrete emotions for a small subset unlabeled videos within context semi-supervised learning. Participants are provided with combination labeled and large amounts videos. Our preliminary experiments demonstrate that task is primarily driven by video audio modalities, while text modality plays relatively weaker role in emotion prediction. To address challenge, propose Video-Audio Transformer (VAT), which takes raw signals as inputs extracts multimodal representations. VAT comprises encoder, an cross-modal encoder. leverage vast amount data, introduce contrastive loss align image representations before fusing them through attention. Additionally, enhance model's ability learn from noisy apply momentum distillation, self-training method learns pseudo-targets generated model. Furthermore, fine-tune annotated data specifically recognition. Experimental results have shown effectiveness proposed Notably, our model ranks first (0.891) leaderboard. project publicly available at https://github.com/dingchaoyue/Multimodal-Emotion-Recognition-MER-and-MuSe-2023-Challenges.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (59)

CITATIONS (0)

EXTERNAL LINKS

CROSSREF - Publications OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

Learning Aligned Audiovisual Representations for Multimodal Sentiment Analysis

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....