Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction

Modality (human–computer interaction) Audio visual ENCODE Representation Sensory cue
DOI: 10.48550/arxiv.2109.08371 Publication Date: 2021-01-01
ABSTRACT
The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglect accompanied audio information, can provide complementary information for scene understanding. In fact, there exists a strong relation between auditory humans generally surrounding by collaboratively sensing these cues. Motivated this, an audio-visual collaborative representation learning method proposed DSP task, explores modality better predict saliency map assisting modality. consists three parts: 1) encoding, 2) location, 3) integration parts. Firstly, refined SoundNet architecture adopted encode obtaining corresponding features, modified 3D ResNet-50 employed learn containing both spatial location temporal motion information. Secondly, part devised locate sound source correspondence Thirdly, adaptively aggregate center-bias prior generate final map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, ETMD, shows superiority over state-of-the-art models.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....