NFDI4DS | UHH-SEMS - Publication Details

Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning

Closed captioning

DOI: 10.48550/arxiv.2502.13754 Publication Date: 2025-02-19

Abstract Supplemental Material References Cited by

AUTHORS (5)

Caihua Liu

Li Xu

Wenjing Xue

Wei Tang

Feng Xia

ABSTRACT

Existing video captioning methods merely provide shallow or simplistic representations of object behaviors, resulting in superficial and ambiguous descriptions. However, behavior is dynamic complex. To comprehensively capture the essence behavior, we propose a action semantic-aware graph transformer. Firstly, multi-scale temporal modeling module designed to flexibly learn long short-term latent features. It not only acquires features across time scales, but also considers local details, enhancing coherence sensitiveness representations. Secondly, visual-action semantic aware proposed adaptively related richness accurateness By harnessing collaborative efforts these two modules,we can acquire rich generate human-like natural Finally, this are used construct objects-action graph, which fed into transformer model complex dependencies between objects actions. avoid adding complexity inference phase, behavioral knowledge will be distilled simple network through distillation. The experimental results on MSVD MSR-VTT datasets demonstrate that method achieves significant performance improvements multiple metrics.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENALEX - Publications OPENAIRE - Products

PlumX Metrics

Capturing Rich Behavior Representations: A Dynamic Action Semantic-Aware Graph Transformer for Video Captioning

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....