A Tracking-Based Two-Stage Framework for Spatio-Temporal Action Detection
Pooling
Feature (linguistics)
DOI:
10.3390/electronics13030479
Publication Date:
2024-01-24T12:42:16Z
AUTHORS (8)
ABSTRACT
Spatio-temporal action detection (STAD) is a task receiving widespread attention and has numerous application scenarios, such as video surveillance smart education. Current studies follow localization-based two-stage paradigm, which exploits person detector for localization feature processing model with classifier classification. However, many issues occur due to the imbalance between settings complexity in STAD. Firstly, of heavy offline detectors adds inference overhead. Secondly, frame-level actor proposals are incompatible video-level aggregation Region-of-Interest pooling classification, limits performance under diverse motions results low accuracy. In this paper, we propose tracking-based spatio-temporal framework called TrAD. The key idea TrAD build consistency reduce our STAD by generating track among multiple frames instead single frame. particular, utilize tailored tracking simulate behavior human cognitive actions used captured motion trajectories proposals. We then integrate proposal scaling method module into classification enhance detected tracks. Evaluations AVA dataset demonstrate that achieves SOTA 29.7 mAP, while also facilitating 58% reduction overall computation compared SlowFast.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES (43)
CITATIONS (0)
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....