NFDI4DS | UHH-SEMS - Publication Details

Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction

Modality (human–computer interaction) Audio visual ENCODE Representation Sensory cue

DOI: 10.48550/arxiv.2109.08371 Publication Date: 2021-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Hailong Ning

Bin Zhao

Zhanxuan Hu

Lang He

Ercheng Pei

ABSTRACT

The Dynamic Saliency Prediction (DSP) task simulates the human selective attention mechanism to perceive dynamic scene, which is significant and imperative in many vision tasks. Most of existing methods only consider visual cues, while neglect accompanied audio information, can provide complementary information for scene understanding. In fact, there exists a strong relation between auditory humans generally surrounding by collaboratively sensing these cues. Motivated this, an audio-visual collaborative representation learning method proposed DSP task, explores modality better predict saliency map assisting modality. consists three parts: 1) encoding, 2) location, 3) integration parts. Firstly, refined SoundNet architecture adopted encode obtaining corresponding features, modified 3D ResNet-50 employed learn containing both spatial location temporal motion information. Secondly, part devised locate sound source correspondence Thirdly, adaptively aggregate center-bias prior generate final map. Extensive experiments are conducted on six challenging audiovisual eye-tracking datasets, including DIEM, AVAD, Coutrot1, Coutrot2, SumMe, ETMD, shows superiority over state-of-the-art models.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Audio-Visual Collaborative Representation Learning for Dynamic Saliency Prediction

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....