- Generative Adversarial Networks and Image Synthesis
- Human Pose and Action Recognition
- Multimodal Machine Learning Applications
- Advanced Vision and Imaging
- Computer Graphics and Visualization Techniques
- 3D Shape Modeling and Analysis
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Video Analysis and Summarization
- Human Motion and Animation
- Advanced Image Processing Techniques
- Gaussian Processes and Bayesian Inference
- Reinforcement Learning in Robotics
- Video Surveillance and Tracking Methods
- Image and Signal Denoising Methods
- Image Processing Techniques and Applications
- 3D Surveying and Cultural Heritage
- Music and Audio Processing
- Neural Networks and Applications
- Anomaly Detection Techniques and Applications
- Topic Modeling
- Image Processing and 3D Reconstruction
- Machine Learning and Algorithms
- Stock Market Forecasting Methods
Beijing Academy of Artificial Intelligence
2022-2024
Shanghai Artificial Intelligence Laboratory
2022-2024
ShangHai JiAi Genetics & IVF Institute
2023-2024
University of Electronic Science and Technology of China
2009-2023
University of Shanghai for Science and Technology
2023
Google (United States)
2020-2022
State Key Laboratory of Mobile Networks and Mobile Multimedia Technology
2022
ZTE (China)
2022
Nanyang Technological University
2021-2022
China Mobile (China)
2022
Relationships among objects play a crucial role in image understanding. Despite the great success of deep learning techniques recognizing individual objects, reasoning about relationships remains challenging task. Previous methods often treat this as classification problem, considering each type relationship (e.g. ride) or distinct visual phrase person-ride-horse) category. Such approaches are faced with significant difficulties caused by high diversity appearance for kind large number...
Despite the substantial progress in recent years, image captioning techniques are still far from being perfect. Sentences produced by existing methods, e.g. those based on RNNs, often overly rigid and lacking variability. This issue is related to a learning principle widely used practice, that is, maximize likelihood of training samples. encourages high resemblance "ground-truth" captions, while suppressing other reasonable descriptions. Conventional evaluation metrics, BLEU METEOR, also...
Visual tempo characterizes the dynamics and temporal scale of an action. Modeling such visual tempos different actions facilitates their recognition. Previous works often capture through sampling raw videos at multiple rates constructing input-level frame pyramid, which usually requires a costly multi-branch network to handle. In this work we propose generic Temporal Pyramid Network (TPN) feature-level, can be flexibly integrated into 2D or 3D backbone networks in plug-and-play manner. Two...
On public benchmarks, current action recognition techniques have achieved great success. However, when used in real-world applications, e.g. sport analysis, which requires the capability of parsing an activity into phases and differentiating between subtly different actions, their performances remain far from being satisfactory. To take to a new level, we develop FineGym, dataset built on top gymnasium videos. Compared existing datasets, FineGym is distinguished richness, quality, diversity....
Learning a good image prior is long-term goal for restoration and manipulation. While existing methods like deep (DIP) capture low-level statistics, there are still gaps toward an that captures rich semantics including color, spatial coherence, textures, high-level concepts. This work presents effective way to exploit the captured by generative adversarial network (GAN) trained on large-scale natural images. As shown in Fig. 1, (DGP) provides compelling results restore missing semantics,...
Existing image restoration methods mostly leverage the posterior distribution of natural images. However, they often assume known degradation and also require supervised training, which restricts their adaptation to complex real applications. In this work, we propose Generative Diffusion Prior (GDP) effectively model distributions in an unsupervised sampling manner. GDP utilizes a pre-train denoising diffusion generative (DDPM) for solving linear inverse, non-linear, or blind problems....
Generating speech-consistent body and gesture movements is a long-standing problem in virtual avatar creation. Previous studies often synthesize pose movement holistic manner, where poses of all joints are generated simultaneously. Such straightforward pipeline fails to generate fine-grained co-speech gestures. One observation that the hierarchical semantics speech structures human gestures can be naturally described into multiple granularities associated together. To fully utilize rich...
Natural scene understanding is a challenging task, particularly when encountering images of multiple objects that are partially occluded. This obstacle given rise by varying object ordering and positioning. Existing paradigms able to parse only the visible parts, resulting in incomplete unstructured interpretation. In this paper, we investigate problem de-occlusion, which aims recover underlying occlusion complete invisible parts occluded objects. We make first attempt address through novel...
Sounds provide rich semantics, complementary to visual data, for many tasks. However, in practice, sounds from multiple sources are often mixed together. In this paper we propose a novel framework, referred as MinusPlus Network (MP-Net), the task of sound separation. MP-Net separates recursively order average energy, removing separated mixture at end each prediction, until becomes empty or contains only noise. way, could be applied mixtures with arbitrary numbers and types sounds. Moreover,...
Image captioning, a popular topic in computer vision, has achieved substantial progress recent years. However, the distinctiveness of natural descriptions is often overlooked previous work. It closely related to quality captions, as distinctive captions are more likely describe images with their unique aspects. In this work, we propose new learning method, Contrastive Learning (CL), for image captioning. Specifically, via two constraints formulated on top reference model, proposed method can...
Most 3D shape completion approaches rely heavily on partial-complete pairs and learn in a fully super-vised manner. Despite their impressive performances in-domain data, when generalizing to partial shapes other forms or real-world scans, they often obtain unsatisfactory results due domain gaps. In contrast previous supervised approaches, this paper we present ShapeInversion, which introduces Generative Adversarial Network (GAN) inversion for the first time. ShapeInversion uses GAN...
Recent advances like StyleGAN have promoted the growth of controllable facial editing. To address its core challenge attribute decoupling in a single latent space, attempts been made to adopt dual-space GAN for better disentanglement style and content representations. Nonetheless, these methods are still incompetent obtain plausible editing results with high controllability, especially complicated attributes. In this study, we highlight importance interaction more We propose TransEditor,...
We present DiffBIR, a general restoration pipeline that could handle different blind image tasks in unified framework. DiffBIR decouples problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost content. Each stage is developed independently but they work seamlessly cascaded manner. In first stage, we use modules to remove degradations and obtain high-fidelity restored results. For second propose IRControlNet...
Current Scene Graph Generation (SGG) methods explore contextual information to predict relationships among entity pairs. However, due the diverse visual appearance of numerous possible subject-object combinations, there is a large intra-class variation within each predicate category, e.g., "man-eating-pizza, giraffe-eating-leaf", and severe inter-class similarity between different classes, "man-holding-plate, man-eating-pizza", in model's latent space. The above challenges prevent current...
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging pre-trained text-to-image (T2I) as basis. It is highly desirable yet challenging task simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving strong creative generation nature T2I model. To this end, we propose LaVie, an integrated video framework that operates on cascaded latent diffusion models, comprising base T2V model, temporal...
Despite the substantial progress in recent years, image captioning techniques are still far from being perfect.Sentences produced by existing methods, e.g. those based on RNNs, often overly rigid and lacking variability. This issue is related to a learning principle widely used practice, that is, maximize likelihood of training samples. encourages high resemblance "ground-truth" captions while suppressing other reasonable descriptions. Conventional evaluation metrics, BLEU METEOR, also favor...
Relationships among objects play a crucial role in image understanding. Despite the great success of deep learning techniques recognizing individual objects, reasoning about relationships remains challenging task. Previous methods often treat this as classification problem, considering each type relationship (e.g. "ride") or distinct visual phrase "person-ride-horse") category. Such approaches are faced with significant difficulties caused by high diversity appearance for kind large number...