Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning

Closed captioning
DOI: 10.48550/arxiv.2502.13754 Publication Date: 2025-02-19
ABSTRACT
Existing video captioning methods merely provide shallow or simplistic representations of object behaviors, resulting in superficial and ambiguous descriptions. However, behavior is dynamic complex. To comprehensively capture the essence behavior, we propose a action semantic-aware graph transformer. Firstly, multi-scale temporal modeling module designed to flexibly learn long short-term latent features. It not only acquires features across time scales, but also considers local details, enhancing coherence sensitiveness representations. Secondly, visual-action semantic aware proposed adaptively related richness accurateness By harnessing collaborative efforts these two modules,we can acquire rich generate human-like natural Finally, this are used construct objects-action graph, which fed into transformer model complex dependencies between objects actions. avoid adding complexity inference phase, behavioral knowledge will be distilled simple network through distillation. The experimental results on MSVD MSR-VTT datasets demonstrate that method achieves significant performance improvements multiple metrics.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....