NFDI4DS | UHH-SEMS - Publication Details

ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computer Science - Multimedia Multimedia (cs.MM)

DOI: 10.48550/arxiv.2410.12813 Publication Date: 2024-10-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Mengxue Qu

Xiaohong Chen

Wu Liu

Alicia Li

Yao Zhao

ABSTRACT

Video Temporal Grounding (VTG) aims to ground specific segments within an untrimmed video corresponding the given natural language query. Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive prone human biases. To address these challenges, we present ChatVTG, a novel approach that utilizes Dialogue Large Language Models (LLMs) for zero-shot temporal grounding. Our ChatVTG leverages LLMs generate multi-granularity segment captions matches with query coarse grounding, circumventing need paired annotation data. Furthermore, obtain more precise grounding results, employ moment refinement fine-grained caption proposals. Extensive experiments three mainstream datasets, including Charades-STA, ActivityNet-Captions, TACoS, demonstrate effectiveness of ChatVTG. surpasses performance current methods.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....