Ziqi Huang

ORCID: 0000-0001-8008-5873
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Generative Adversarial Networks and Image Synthesis
  • Multimodal Machine Learning Applications
  • Face recognition and analysis
  • Advanced Vision and Imaging
  • Advanced Image Processing Techniques
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neuroimaging Techniques and Applications
  • Video Analysis and Summarization
  • Advanced Image and Video Retrieval Techniques
  • Facial Nerve Paralysis Treatment and Research
  • Video Coding and Compression Technologies
  • Human Pose and Action Recognition
  • Image Processing Techniques and Applications
  • Data Management and Algorithms
  • Model Reduction and Neural Networks
  • Aesthetic Perception and Analysis
  • Handwritten Text Recognition Techniques
  • Visual Attention and Saliency Detection
  • Vehicle License Plate Recognition
  • Cell Image Analysis Techniques
  • Image and Signal Denoising Methods
  • Data Visualization and Analytics
  • Computer Graphics and Visualization Techniques
  • Advanced Data Processing Techniques
  • Image Retrieval and Classification Techniques

University of California, Santa Barbara
2025

Nanyang Technological University
2021-2024

Southern University of Science and Technology
2022

Wuhan University of Technology
2022

The Ohio State University
2015

Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion mainly focus on uni-modal control, i.e., process is driven by only one modality of condition. To further unleash users' creativity, it desirable for model to be controllable multiple modalities simultaneously, e.g. generating and editing faces describing age (text-driven) while drawing face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained...

10.1109/cvpr52729.2023.00589 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

10.1109/cvpr52733.2024.00453 article 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Facial editing is an important task in vision and graphics with numerous applications. However, existing works are incapable to deliver a continuous fine-grained mode (e.g., slightly smiling face big laughing one) natural interactions users. In this work, we propose Talk-to-Edit, interactive facial framework that performs attribute manipulation through dialog between the user system. Our key insight model continual "semantic field" GAN latent space. 1) Unlike previous regard as traversing...

10.1109/iccv48922.2021.01354 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging pre-trained text-to-image (T2I) as basis. It is highly desirable yet challenging task simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving strong creative generation nature T2I model. To this end, we propose LaVie, an integrated video framework that operates on cascaded latent diffusion models, comprising base T2V model, temporal...

10.48550/arxiv.2309.15103 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion from exemplar images, and existing inversion methods mainly focus on capturing object appearances (i.e., the "look"). However, how invert relations, another important pillar in visual world, remains unexplored. In this work, we propose Relation Inversion task, which aims learn a specific relation (represented as "relation...

10.1145/3680528.3687658 article EN 2024-12-03

We present Vchitect-2.0, a parallel transformer architecture designed to scale up video diffusion models for large-scale text-to-video generation. The overall Vchitect-2.0 system has several key designs. (1) By introducing novel Multimodal Diffusion Block, our approach achieves consistent alignment between text descriptions and generated frames, while maintaining temporal coherence across sequences. (2) To overcome memory computational bottlenecks, we propose Memory-efficient Training...

10.48550/arxiv.2501.08453 preprint EN arXiv (Cornell University) 2025-01-14

The rising prevalence of mental health issues identifies the urgent need for accurate, scalable, and timely prediction systems. Deep learning, a subset machine learning inspired by humans neuron structure, has offered an opportunity innovative solutions diagnosis. main idea this paper is analyzing application deep in diagnosing disorders, including but not limited to Alzheimer, Parkinson Schizophrenia. An enormous number techniques will be put into real life while dealing with diagnosis...

10.54254/2755-2721/2025.21185 article EN cc-by Applied and Computational Engineering 2025-02-27

Diffusion models gain increasing popularity for their generative capabilities. Recently, there have been surging needs to generate customized images by inverting diffusion from exemplar images. However, existing inversion methods mainly focus on capturing object appearances. How invert relations, another important pillar in the visual world, remains unexplored. In this work, we propose ReVersion Relation Inversion task, which aims learn a specific relation (represented as "relation prompt")...

10.48550/arxiv.2303.13495 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Multi-modality magnetic resonance (MR) images provide complementary information for disease diagnoses. However, modality missing is quite usual in real-life clinical practice. Current methods usually employ convolution-based generative adversarial network (GAN) or its variants to synthesize the modality. With development of vision transformer, we explore application MRI synthesis task this work. We propose a novel supervised deep learning method synthesizing modality, making use...

10.1109/embc48229.2022.9871183 article EN 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC) 2022-07-11

In this paper, we uncover the untapped potential of diffusion U-Net, which serves as a "free lunch" that substantially improves generation quality on fly. We initially investigate key contributions U-Net architecture to denoising process and identify its main backbone primarily contributes denoising, whereas skip connections mainly introduce high-frequency features into decoder module, causing network overlook semantics. Capitalizing discovery, propose simple yet effective method-termed...

10.48550/arxiv.2309.11497 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Though diffusion-based video generation has witnessed rapid progress, the inference results of existing models still exhibit unsatisfactory temporal consistency and unnatural dynamics. In this paper, we delve deep into noise initialization diffusion models, discover an implicit training-inference gap that attributes to quality. Our key findings are: 1) spatial-temporal frequency distribution initial latent at is intrinsically different from for training, 2) denoising process significantly...

10.48550/arxiv.2312.07537 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Facial editing is to manipulate the facial attributes of a given face image. Nowadays, with development generative models, users can easily generate 2D and 3D images high fidelity 3D-aware consistency. However, existing works are incapable delivering continuous fine-grained mode (e.g., slightly smiling big laughing one) natural interactions users. In this work, we propose Talk-to-Edit, an interactive framework that performs attribute manipulation through dialog between user system. Our key...

10.1109/tpami.2023.3347299 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-12-26

Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video is indispensable two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal system should provide insights to inform future developments of generation. To this end, we present VBench, suite that dissects "video quality" into specific, hierarchical, and disentangled dimensions, each tailored prompts methods....

10.48550/arxiv.2411.13503 preprint EN arXiv (Cornell University) 2024-11-20

Recent advancements in visual generative models have enabled high-quality image and video generation, opening diverse applications. However, evaluating these often demands sampling hundreds or thousands of images videos, making the process computationally expensive, especially for diffusion-based with inherently slow sampling. Moreover, existing evaluation methods rely on rigid pipelines that overlook specific user needs provide numerical results without clear explanations. In contrast,...

10.48550/arxiv.2412.09645 preprint EN arXiv (Cornell University) 2024-12-10

The deep learning community has made rapid progress in low-level visual perception tasks such as object localization, detection and segmentation. However, for Visual Question Answering (VQA) language grounding that require high-level reasoning abilities, huge gaps still exist between artificial systems human intelligence. In this work, we perform a diagnostic study on recent popular VQA terms of analogical reasoning. We term it Analogical VQA, where system needs to reason group images find...

10.1109/icip42928.2021.9506539 article EN 2022 IEEE International Conference on Image Processing (ICIP) 2021-08-23

Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion mainly focus on uni-modal control, i.e., process is driven by only one modality of condition. To further unleash users' creativity, it desirable for model to be controllable multiple modalities simultaneously, e.g., generating and editing faces describing age (text-driven) while drawing face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained...

10.48550/arxiv.2304.10530 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Video generation has witnessed significant advancements, yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video is indispensable two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal system should provide insights to inform future developments of generation. To this end, we present VBench, suite that dissects "video quality" into specific, hierarchical, and disentangled dimensions, each tailored prompts methods....

10.48550/arxiv.2311.17982 preprint EN other-oa arXiv (Cornell University) 2023-01-01
Coming Soon ...