- Generative Adversarial Networks and Image Synthesis
- Multimodal Machine Learning Applications
- Face recognition and analysis
- Advanced Vision and Imaging
- Advanced Image Processing Techniques
- Domain Adaptation and Few-Shot Learning
- Advanced Neuroimaging Techniques and Applications
- Video Analysis and Summarization
- Advanced Image and Video Retrieval Techniques
- Facial Nerve Paralysis Treatment and Research
- Video Coding and Compression Technologies
- Human Pose and Action Recognition
- Image Processing Techniques and Applications
- Data Management and Algorithms
- Model Reduction and Neural Networks
- Aesthetic Perception and Analysis
- Handwritten Text Recognition Techniques
- Visual Attention and Saliency Detection
- Vehicle License Plate Recognition
- Cell Image Analysis Techniques
- Image and Signal Denoising Methods
- Data Visualization and Analytics
- Computer Graphics and Visualization Techniques
- Advanced Data Processing Techniques
- Image Retrieval and Classification Techniques
University of California, Santa Barbara
2025
Nanyang Technological University
2021-2024
Southern University of Science and Technology
2022
Wuhan University of Technology
2022
The Ohio State University
2015
Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion mainly focus on uni-modal control, i.e., process is driven by only one modality of condition. To further unleash users' creativity, it desirable for model to be controllable multiple modalities simultaneously, e.g. generating and editing faces describing age (text-driven) while drawing face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained...
Facial editing is an important task in vision and graphics with numerous applications. However, existing works are incapable to deliver a continuous fine-grained mode (e.g., slightly smiling face big laughing one) natural interactions users. In this work, we propose Talk-to-Edit, interactive facial framework that performs attribute manipulation through dialog between the user system. Our key insight model continual "semantic field" GAN latent space. 1) Unlike previous regard as traversing...
This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging pre-trained text-to-image (T2I) as basis. It is highly desirable yet challenging task simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving strong creative generation nature T2I model. To this end, we propose LaVie, an integrated video framework that operates on cascaded latent diffusion models, comprising base T2V model, temporal...
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion from exemplar images, and existing inversion methods mainly focus on capturing object appearances (i.e., the "look"). However, how invert relations, another important pillar in visual world, remains unexplored. In this work, we propose Relation Inversion task, which aims learn a specific relation (represented as "relation...
We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated frames, while maintaining temporal coherence across sequences. (2) To overcome memory computational bottlenecks, we propose Memory-efficient Training...
The rising prevalence of mental health issues identifies the urgent need for accurate, scalable, and timely prediction systems. Deep learning, a subset machine learning inspired by humans neuron structure, has offered an opportunity innovative solutions diagnosis. main idea this paper is analyzing application deep in diagnosing disorders, including but not limited to Alzheimer, Parkinson Schizophrenia. An enormous number techniques will be put into real life while dealing with diagnosis...
Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How invert relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion Relation Inversion task, which aims learn a specific relation (represented as "relation prompt")...
Multi-modality magnetic resonance (MR) images provide complementary information for disease diagnoses. However, modality missing is quite usual in real-life clinical practice. Current methods usually employ convolution-based generative adversarial network (GAN) or its variants to synthesize the modality. With development of vision transformer, we explore application MRI synthesis task this work. We propose a novel supervised deep learning method synthesizing modality, making use...
In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves generation quality on fly. We initially investigate key contributions U-Net architecture to denoising process and identify its main backbone primarily contributes denoising, whereas skip connections mainly introduce high-frequency features into decoder module, causing network overlook semantics. Capitalizing discovery, propose simple yet effective method-termed...
Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into noise initialization diffusion models, discover an implicit training-inference gap that attributes to quality. Our key findings are: 1) spatial-temporal frequency distribution initial latent at is intrinsically different from for training, 2) denoising process significantly...
Facial editing is to manipulate the facial attributes of a given face image. Nowadays, with development generative models, users can easily generate 2D and 3D images high fidelity 3D-aware consistency. However, existing works are incapable delivering continuous fine-grained mode (e.g., slightly smiling big laughing one) natural interactions users. In this work, we propose Talk-to-Edit, an interactive framework that performs attribute manipulation through dialog between user system. Our key...
Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video is indispensable two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal system should provide insights to inform future developments of generation. To this end, we present VBench, suite that dissects "video quality" into specific, hierarchical, and disentangled dimensions, each tailored prompts methods....
Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these often demands sampling hundreds or thousands of images videos, making the process computationally expensive, especially for diffusion-based with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs provide numerical results without clear explanations. In contrast,...
The deep learning community has made rapid progress in low-level visual perception tasks such as object localization, detection and segmentation. However, for Visual Question Answering (VQA) language grounding that require high-level reasoning abilities, huge gaps still exist between artificial systems human intelligence. In this work, we perform a diagnostic study on recent popular VQA terms of analogical reasoning. We term it Analogical VQA, where system needs to reason group images find...
Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion mainly focus on uni-modal control, i.e., process is driven by only one modality of condition. To further unleash users' creativity, it desirable for model to be controllable multiple modalities simultaneously, e.g., generating and editing faces describing age (text-driven) while drawing face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained...
Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video is indispensable two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal system should provide insights to inform future developments of generation. To this end, we present VBench, suite that dissects "video quality" into specific, hierarchical, and disentangled dimensions, each tailored prompts methods....