- Human Pose and Action Recognition
- Human Motion and Animation
- Hand Gesture Recognition Systems
- Multimodal Machine Learning Applications
- Music and Audio Processing
- Speech and dialogue systems
- Robotics and Automated Systems
- Video Analysis and Summarization
- Hearing Impairment and Communication
Peking University
2022-2024
The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, with control. We leverage the power large-scale Contrastive-Language-Image-Pre-training (CLIP) model and novel CLIP-guided...
Automatic synthesis of realistic co-speech gestures is an increasingly important yet challenging task in artificial embodied agent creation. Previous systems mainly focus on generating end-to-end manner, which leads to difficulties mining the clear rhythm and semantics due complex subtle harmony between speech gestures. We present a novel gesture method that achieves convincing results both semantics. For rhythm, our system contains robust rhythm-based segmentation pipeline ensure temporal...
In this work, we present MoConVQ, a novel unified framework for physics-based motion control leveraging scalable discrete representations. Building upon vector quantized variational autoencoders (VQ-VAE) and model-based reinforcement learning, our approach effectively learns embeddings from large, unstructured dataset spanning tens of hours examples. The resultant representation not only captures diverse skills but also offers robust intuitive interface various applications. We demonstrate...
In this work, we present Semantic Gesticulator , a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful are crucial for effective non-verbal communication, but such often fall within the long tail of distribution natural human motion. The sparsity these movements makes it challenging deep learning-based systems, trained on moderately sized datasets, capture relationship between and corresponding semantics....
The automatic generation of stylized co-speech gestures has recently received increasing attention. Previous systems typically allow style control via predefined text labels or example motion clips, which are often not flexible enough to convey user intent accurately. In this work, we present GestureDiffuCLIP, a neural network framework for synthesizing realistic, with control. We leverage the power large-scale Contrastive-Language-Image-Pre-training (CLIP) model and novel CLIP-guided...
How to automatically synthesize natural-looking dance movements based on a piece of music is an incrementally popular yet challenging task. Most existing data-driven approaches require hard-to-get paired training data and fail generate long sequences motion due error accumulation autoregressive structure. We present novel 3D synthesis system that only needs unpaired for could realistic long-term motions at the same time. For training, we explore disentanglement beat style, propose...
In this work, we present Semantic Gesticulator, a novel framework designed to synthesize realistic gestures accompanying speech with strong semantic correspondence. Semantically meaningful are crucial for effective non-verbal communication, but such often fall within the long tail of distribution natural human motion. The sparsity these movements makes it challenging deep learning-based systems, trained on moderately sized datasets, capture relationship between and corresponding semantics....
In this work, we present MoConVQ, a novel unified framework for physics-based motion control leveraging scalable discrete representations. Building upon vector quantized variational autoencoders (VQ-VAE) and model-based reinforcement learning, our approach effectively learns embeddings from large, unstructured dataset spanning tens of hours examples. The resultant representation not only captures diverse skills but also offers robust intuitive interface various applications. We demonstrate...