VLG-Net: Video-Language Graph Matching Network for Video Grounding

Semantic Matching Pooling
DOI: 10.48550/arxiv.2011.10132 Publication Date: 2020-01-01
ABSTRACT
Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a query. The solution this challenging task demands understanding videos' and queries' semantic content fine-grained reasoning about their multi-modal interactions. Our key idea is recast challenge into an algorithmic graph matching problem. Fueled by recent advances Graph Neural Networks, we propose leverage Convolutional Networks model video textual information as well alignment. To enable mutual exchange of across modalities, design novel Video-Language Matching Network (VLG-Net) match query graphs. Core ingredients include representation graphs built atop snippets tokens separately used intra-modality relationships. A layer adopted for cross-modal context modeling fusion. Finally, moment candidates are created using masked attention pooling fusing moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely datasets temporal localization moments with queries: ActivityNet-Captions, TACoS, DiDeMo.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....