ChatVTG: Video Temporal Grounding via Chat with Video Dialogue Large Language Models
FOS: Computer and information sciences
Computer Vision and Pattern Recognition (cs.CV)
Computer Science - Computer Vision and Pattern Recognition
Computer Science - Multimedia
Multimedia (cs.MM)
DOI:
10.48550/arxiv.2410.12813
Publication Date:
2024-10-01
AUTHORS (5)
ABSTRACT
Video Temporal Grounding (VTG) aims to ground specific segments within an untrimmed video corresponding the given natural language query. Existing VTG methods largely depend on supervised learning and extensive annotated data, which is labor-intensive prone human biases. To address these challenges, we present ChatVTG, a novel approach that utilizes Dialogue Large Language Models (LLMs) for zero-shot temporal grounding. Our ChatVTG leverages LLMs generate multi-granularity segment captions matches with query coarse grounding, circumventing need paired annotation data. Furthermore, obtain more precise grounding results, employ moment refinement fine-grained caption proposals. Extensive experiments three mainstream datasets, including Charades-STA, ActivityNet-Captions, TACoS, demonstrate effectiveness of ChatVTG. surpasses performance current methods.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....