Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos

CLIPS Sliding window protocol
DOI: 10.1609/aaai.v33i01.33018393 Publication Date: 2019-08-21T07:41:20Z
ABSTRACT
The task of video grounding, which temporally localizes a natural language description in video, plays an important role understanding videos. Existing studies have adopted strategies sliding window over the entire or exhaustively ranking all possible clip-sentence pairs presegmented inevitably suffer from enumerated candidates. To alleviate this problem, we formulate as problem sequential decision making by learning agent regulates temporal grounding boundaries progressively based on its policy. Specifically, propose reinforcement framework improved multi-task and it shows steady performance gains considering additional supervised boundary information during training. Our proposed achieves state-of-the-art ActivityNet’18 DenseCaption dataset (Krishna et al. 2017) Charades-STA (Sigurdsson 2016; Gao while observing only 10 less clips per video.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (0)
CITATIONS (107)