NFDI4DS | UHH-SEMS - Publication Details

VLG-Net: Video-Language Graph Matching Network for Video Grounding

Semantic Matching Pooling

DOI: 10.48550/arxiv.2011.10132 Publication Date: 2020-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

Mattia Soldan

Mengmeng Xu

Sisi Qu

Jesper Tegnér

Bernard Ghanem

ABSTRACT

Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a query. The solution this challenging task demands understanding videos' and queries' semantic content fine-grained reasoning about their multi-modal interactions. Our key idea is recast challenge into an algorithmic graph matching problem. Fueled by recent advances Graph Neural Networks, we propose leverage Convolutional Networks model video textual information as well alignment. To enable mutual exchange of across modalities, design novel Video-Language Matching Network (VLG-Net) match query graphs. Core ingredients include representation graphs built atop snippets tokens separately used intra-modality relationships. A layer adopted for cross-modal context modeling fusion. Finally, moment candidates are created using masked attention pooling fusing moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely datasets temporal localization moments with queries: ActivityNet-Captions, TACoS, DiDeMo.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

VLG-Net: Video-Language Graph Matching Network for Video Grounding

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....