Boosting Video Representation Learning with Multi-Faceted Integration

Representation Feature Learning Boosting Closed captioning
DOI: 10.48550/arxiv.2201.04023 Publication Date: 2022-01-01
ABSTRACT
Video content is multifaceted, consisting of objects, scenes, interactions or actions. The existing datasets mostly label only one the facets for model training, resulting in video representation that biases to facet depending on training dataset. There no study yet how learn a from multifaceted labels, and whether information helpful learning. In this paper, we propose new learning framework, MUlti-Faceted Integration (MUFI), aggregate different could reflect full spectrum content. Technically, MUFI formulates problem as visual-semantic embedding learning, which explicitly maps into rich semantic space, jointly optimizes two perspectives. One capitalize intra-facet supervision between each its own descriptions, second predicts "semantic representation" other inter-facet supervision. Extensive experiments demonstrate 3D CNN via our framework union four large-scale plus image leads superior capability representation. pre-learnt with also shows clear improvements over approaches several downstream applications. More remarkably, achieves 98.1%/80.9% UCF101/HMDB51 action recognition 101.5% terms CIDEr-D score MSVD captioning.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....