AVSegFormer: Audio-Visual Segmentation with Transformer

Audio visual
DOI: 10.1609/aaai.v38i11.29104 Publication Date: 2024-03-25T11:00:14Z
ABSTRACT
Audio-visual segmentation (AVS) aims to locate and segment the sounding objects in a given video, which demands audio-driven pixel-level scene understanding. The existing methods cannot fully process fine-grained correlations between audio visual cues across various situations dynamically. They also face challenges adapting complex scenarios, such as evolving audio, coexistence of multiple objects, more. In this paper, we propose AVSegFormer, novel framework for AVS that leverages transformer architecture. Specifically, It comprises dense audio-visual mixer, can dynamically adjust interested features, sparse decoder, implicitly separates sources automatically matches optimal features. Combining both components provides more robust bidirectional conditional multi-modal representation, improving performance different scenarios. Extensive experiments demonstrate AVSegFormer achieves state-of-the-art results on benchmark. code is available at https://github.com/vvvb-github/AVSegFormer.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (18)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....