Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos
CLIPS
Sliding window protocol
DOI:
10.1609/aaai.v33i01.33018393
Publication Date:
2019-08-21T07:41:20Z
AUTHORS (6)
ABSTRACT
The task of video grounding, which temporally localizes a natural language description in video, plays an important role understanding videos. Existing studies have adopted strategies sliding window over the entire or exhaustively ranking all possible clip-sentence pairs presegmented inevitably suffer from enumerated candidates. To alleviate this problem, we formulate as problem sequential decision making by learning agent regulates temporal grounding boundaries progressively based on its policy. Specifically, propose reinforcement framework improved multi-task and it shows steady performance gains considering additional supervised boundary information during training. Our proposed achieves state-of-the-art ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) Charades-STA (Sigurdsson 2016; Gao while observing only 10 less clips per video.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (107)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....