Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model

FOS: Computer and information sciences Computer Vision and Pattern Recognition (cs.CV) Computer Science - Computer Vision and Pattern Recognition Computer Science - Human-Computer Interaction Computer Science - Multimedia Human-Computer Interaction (cs.HC) Multimedia (cs.MM)
DOI: 10.48550/arxiv.2404.01862 Publication Date: 2024-04-02
ABSTRACT
Co-speech gestures, if presented in the lively form of videos, can achieve superior visual effects human-machine interaction. While previous works mostly generate structural human skeletons, resulting omission appearance information, we focus on direct generation audio-driven co-speech gesture videos this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex movements with crucial information. 2) Gestures and speech exhibit inherent dependencies should be temporally aligned even arbitrary length. To solve these problems, present a novel motion-decoupled framework videos. Specifically, first introduce well-designed nonlinear TPS transformation obtain latent features preserving essential Then transformer-based diffusion model proposed learn temporal correlation between gestures speech, performs space, followed by an optimal selection module produce long-term coherent consistent For better perception, further design refinement network focusing missing details certain areas. Extensive experimental results show that our significantly outperforms existing approaches both video-related evaluations. Our code, demos, more resources available at https://github.com/thuhcsi/S2G-MDDiffusion.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....