Multi-level textual-visual alignment and fusion network for multimodal aspect-based sentiment analysis
Modalities
Benchmark (surveying)
Sentiment Analysis
DOI:
10.1007/s10462-023-10685-z
Publication Date:
2024-03-01T09:02:18Z
AUTHORS (5)
ABSTRACT
Abstract Multimodal Aspect-Based Sentiment Analysis (MABSA) is an essential task in sentiment analysis that has garnered considerable attention recent years. Typical approaches MABSA often utilize cross-modal Transformers to capture interactions between textual and visual modalities. However, bridging the semantic gap modalities spaces addressing interference from irrelevant objects at different scales remains challenging. To tackle these limitations, we present Multi-level Textual-Visual Alignment Fusion Network (MTVAF) this work, which incorporates three auxiliary tasks. Specifically, MTVAF first transforms multi-level image information into descriptions, facial optical characters. These are then concatenated with input form a textual+visual input, facilitating comprehensive alignment Next, both inputs fed integrated text model relevant representations. Dynamic mechanisms employed generate prompts control fusion. Finally, align probability distributions of space space, effectively reducing noise introduced during process. Experimental results on two benchmark datasets demonstrate effectiveness proposed MTVAF, showcasing its superior performance compared state-of-the-art approaches. Our codes available https://github.com/MKMaS-GUET/MTVAF .
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (58)
CITATIONS (14)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....