NFDI4DS | UHH-SEMS - Publication Details

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

FOS: Computer and information sciences Computer Science - Machine Learning Sound (cs.SD) Computer Science - Computation and Language Audio and Speech Processing (eess.AS) FOS: Electrical engineering, electronic engineering, information engineering Computation and Language (cs.CL) Computer Science - Sound Electrical Engineering and Systems Science - Audio and Speech Processing Machine Learning (cs.LG)

DOI: 10.21437/interspeech.2024-526 Publication Date: 2024-09-01T07:10:12Z

Abstract Supplemental Material References Cited by

AUTHORS (3)

Jongsuk Kim

Jiwon Shin

Junmo Kim

ABSTRACT

Interspeech 2024<br/>In recent years, advancements in representation learning and language models have propelled Automated Captioning (AC) to new heights, enabling the generation of human-level descriptions. Leveraging these advancements, we propose AVCap, an Audio-Visual Captioning framework, a simple yet powerful baseline approach applicable to audio-visual captioning. AVCap utilizes audio-visual features as text tokens, which has many advantages not only in performance but also in the extensibility and scalability of the model. AVCap is designed around three pivotal dimensions: the exploration of optimal audio-visual encoder architectures, the adaptation of pre-trained models according to the characteristics of generated text, and the investigation into the efficacy of modality fusion in captioning. Our method outperforms existing audio-visual captioning methods across all metrics and the code is available on https://github.com/JongSuk1/AVCap<br/>

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES (0)

CITATIONS (0)

EXTERNAL LINKS

OPENAIRE - Products CROSSREF - Publications

PlumX Metrics

AVCap: Leveraging Audio-Visual Features as Text Tokens for Captioning

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....