Enriching Multimodal Sentiment Analysis Through Textual Emotional Descriptions of Visual-Audio Content

Audio visual Content (measure theory) Sentiment Analysis
DOI: 10.1609/aaai.v39i2.32152 Publication Date: 2025-04-11T09:36:51Z
ABSTRACT
Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio video expressions poses formidable challenge, particularly when polarities across various segments appear similar. In this paper, our objective is spotlight emotion-relevant attributes of modalities facilitate multimodal fusion in the context nuanced shifts visual-audio scenarios. To end, we introduce DEVA, progressive framework founded on textual sentiment descriptions aimed at accentuating features content. DEVA employs an Emotional Description Generator (EDG) transmute raw data into textualized descriptions, thereby amplifying their characteristics. These are then integrated with source yield richer, enhanced features. Furthermore, incorporates Text-guided Progressive Fusion Module (TPF), leveraging varying levels text core modality guide. This module progressively fuses minor alleviate disparities between modalities. Experimental results widely used analysis benchmark datasets, including MOSI, MOSEI, CH-SIMS, underscore significant enhancements compared state-of-the-art models. Moreover, fine-grained emotion experiments corroborate robust sensitivity variations.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....