Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning

Granularity Feature Learning Leverage (statistics) Feature (linguistics) Representation
DOI: 10.48550/arxiv.2401.00701 Publication Date: 2024-01-01
ABSTRACT
In recent years, text-to-video retrieval methods based on CLIP have experienced rapid development. The primary direction of evolution is to exploit the much wider gamut visual and textual cues achieve alignment. Concretely, those with impressive performance often design a heavy fusion block for sentence (words)-video (frames) interaction, regardless prohibitive computation complexity. Nevertheless, these approaches are not optimal in terms feature utilization efficiency. To address this issue, we adopt multi-granularity learning, ensuring model's comprehensiveness capturing content features spanning from abstract detailed levels during training phase. better leverage features, devise two-stage architecture This solution ingeniously balances coarse fine granularity content. Moreover, it also strikes harmonious equilibrium between effectiveness Specifically, phase, parameter-free text-gated interaction (TIB) fine-grained video representation learning embed an extra Pearson Constraint optimize cross-modal learning. use coarse-grained representations fast recall top-k candidates, which then reranked by representations. Extensive experiments four benchmarks demonstrate efficiency effectiveness. Notably, our method achieves comparable current state-of-the-art while being nearly 50 times faster.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....