UniMD: Towards Unifying Moment Retrieval and Temporal Action Detection

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition
DOI: 10.48550/arxiv.2404.04933 Publication Date: 2024-04-07
ABSTRACT
Temporal Action Detection (TAD) focuses on detecting pre-defined actions, while Moment Retrieval (MR) aims to identify the events described by open-ended natural language within untrimmed videos. Despite that they focus different events, we observe have a significant connection. For instance, most descriptions in MR involve multiple actions from TAD. In this paper, aim investigate potential synergy between TAD and MR. Firstly, propose unified architecture, termed Unified (UniMD), for both It transforms inputs of two tasks, namely or MR, into common embedding space, utilizes novel query-dependent decoders generate uniform output classification score temporal segments. Secondly, explore efficacy task fusion learning approaches, pre-training co-training, order enhance mutual benefits Extensive experiments demonstrate proposed scheme enables tasks help each other outperform separately trained counterparts. Impressively, UniMD achieves state-of-the-art results three paired datasets Ego4D, Charades-STA, ActivityNet. Our code will be released at https://github.com/yingsen1/UniMD.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....