Zhouxia Wang

ORCID: 0000-0003-4677-5760
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Vision and Imaging
  • Generative Adversarial Networks and Image Synthesis
  • Face recognition and analysis
  • Image Retrieval and Classification Techniques
  • Human Pose and Action Recognition
  • Advanced Image Processing Techniques
  • Advanced Neural Network Applications
  • Video Analysis and Summarization
  • Text and Document Classification Technologies
  • Advanced Image and Video Retrieval Techniques
  • Image and Signal Denoising Methods
  • Topic Modeling
  • Computer Graphics and Visualization Techniques
  • Image Enhancement Techniques
  • Facial Nerve Paralysis Treatment and Research
  • Video Coding and Compression Technologies
  • Visual Attention and Saliency Detection
  • Advanced Graph Neural Networks
  • Mental Health via Writing
  • Facial Rejuvenation and Surgery Techniques
  • Advanced Memory and Neural Computing
  • Video Surveillance and Tracking Methods
  • Image Processing and 3D Reconstruction
  • Ferroelectric and Negative Capacitance Devices
  • Human Motion and Animation

University of Hong Kong
2020-2024

Nanyang Technological University
2024

Group Sense (China)
2017-2020

The Sense Innovation and Research Center
2017-2018

Sun Yat-sen University
2017-2018

This paper proposes a novel deep architecture to address multi-label image recognition, fundamental and practical task towards general visual understanding. Current solutions for this usually rely on an extra step of extracting hypothesis regions (i.e., region proposals), resulting in redundant computation sub-optimal performance. In work, we achieve the interpretable contextualized classification by developing recurrent memorized-attention module. module consists two alternately performed...

10.1109/iccv.2017.58 article EN 2017-10-01

Social relationships (e.g., friends, couple etc.) form the basis of social network in our daily life. Automatically interpreting such bears a great potential for intelligent systems to understand human behavior depth and better interact with people at level. Human beings interpret within group not only based on alone, interplay between contextual information around also plays significant role. However, these additional cues are largely overlooked by previous studies. We found that two...

10.24963/ijcai.2018/142 preprint EN 2018-07-01

We observed that recent state-of-the-art results on single image human pose estimation were achieved by multistage Convolution Neural Networks (CNN). Notwithstanding the superior performance static images, application of these models videos is not only computationally intensive, it also suffers from degeneration and flicking. Such suboptimal are mainly attributed to inability imposing sequential geometric consistency, handling severe quality degradation (e.g. motion blur occlusion) as well...

10.1109/cvpr.2018.00546 article EN 2018-06-01

Blind face restoration is to recover a high-quality image from unknown degradations. As contains abundant contextual information, we propose method, RestoreFormer, which explores fully-spatial attentions model information and surpasses existing works that use local operators. RestoreFormer has several benefits compared prior arts. First, unlike the conventional multi-head self-attention in previous Vision Transformers (ViTs), incorporates cross-attention layer learn interactions between...

10.1109/cvpr52688.2022.01699 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI video creation, achieving precise control over motion interactive asset generation remains challenging. To this end, we propose Image Conductor, a method of movements to generate assets from single image. An well-cultivated training strategy is proposed separate distinct...

10.1609/aaai.v39i5.32533 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2025-04-11

This paper proposes a novel deep architecture to address multi-label image recognition, fundamental and practical task towards general visual understanding. Current solutions for this usually rely on an extra step of extracting hypothesis regions (i.e., region proposals), resulting in redundant computation sub-optimal performance. In work, we achieve the interpretable contextualized classification by developing recurrent memorized-attention module. module consists two alternately performed...

10.48550/arxiv.1711.02816 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Blind face restoration aims at recovering high-quality images from those with unknown degradations. Current algorithms mainly introduce priors to complement details and achieve impressive progress. However, most of these ignore abundant contextual information in the its interplay priors, leading sub-optimal performance. Moreover, they pay less attention gap between synthetic real-world scenarios, limiting robustness generalization applications. In this work, we propose RestoreFormer++, which...

10.1109/tpami.2023.3315753 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-09-15

This paper presents a LoRA-free method for stylized image generation that takes text prompt and style reference images as inputs produces an output in single pass. Unlike existing methods rely on training separate LoRA each style, our can adapt to various styles with unified model. However, this poses two challenges: 1) the loses controllability over generated content, 2) inherits both semantic features of image, compromising its content fidelity. To address these challenges, we introduce...

10.48550/arxiv.2309.01770 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Automatically selecting exposure bracketing (images exposed differently) is important to obtain a high dynamic range image by using multi-exposure fusion. Unlike previous methods that have many restrictions such as requiring camera response function, sensor noise model, and stream of preview images with different exposures (not accessible in some scenarios e.g. mobile applications), we propose novel deep neural network automatically select bracketing, named EBSNet, which sufficiently...

10.1109/cvpr42600.2020.00189 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Social relationships (e.g., friends, couple etc.) form the basis of social network in our daily life. Automatically interpreting such bears a great potential for intelligent systems to understand human behavior depth and better interact with people at level. Human beings interpret within group not only based on alone, interplay between contextual information around also plays significant role. However, these additional cues are largely overlooked by previous studies. We found that two...

10.48550/arxiv.1807.00504 preprint EN other-oa arXiv (Cornell University) 2018-01-01

In the past a few years, we witnessed rapid advancement in face super-resolution from very low resolution(VLR) images. However, most of previous studies focus on solving such problem without explicitly considering impact severe real-life image degradation (e.g. blur and noise). We can show that robustly recover details VLR images is task beyond ability current state-of-the-art method. this paper, borrow ideas "facial composite" propose an alternative approach to tackle problem. endow...

10.1109/ictai.2019.00079 article EN 2019-11-01

Due to the limitation of event sensors, spatial resolution data is relatively low compared conventional frame-based camera. However, low-spatial-resolution events recorded by cameras are rich in temporal information which helpful for image deblurring, while intensity images captured frame high and have potential promote quality events. Considering complementarity between images, an alternately performed model proposed this paper deblur high-resolution with help low-resolution This composed...

10.3390/electronics11040631 article EN Electronics 2022-02-18

Diffusion model has demonstrated remarkable capability in video generation, which further sparks interest introducing trajectory control into the generation process. While existing works mainly focus on training-based methods (e.g., conditional adapter), we argue that diffusion itself allows decent over generated content without requiring any training. In this study, introduce a tuning-free framework to achieve trajectory-controllable by imposing guidance both noise construction and...

10.48550/arxiv.2406.16863 preprint EN arXiv (Cornell University) 2024-06-24

Filmmaking and animation production often require sophisticated techniques for coordinating camera transitions object movements, typically involving labor-intensive real-world capturing. Despite advancements in generative AI video creation, achieving precise control over motion interactive asset generation remains challenging. To this end, we propose Image Conductor, a method of movements to generate assets from single image. An well-cultivated training strategy is proposed separate distinct...

10.48550/arxiv.2406.15339 preprint EN arXiv (Cornell University) 2024-06-21

Although deep learning-based image restoration methods have made significant progress, they still struggle with limited generalization to real-world scenarios due the substantial domain gap caused by training on synthetic data. Existing address this issue improving data synthesis pipelines, estimating degradation kernels, employing internal learning, and performing adaptation regularization. Previous sought bridge learning domain-invariant knowledge in either feature or pixel space. However,...

10.48550/arxiv.2406.18516 preprint EN arXiv (Cornell University) 2024-06-26

Recent progress in blind face restoration has resulted producing high-quality restored results for static images. However, efforts to extend these advancements video scenarios have been minimal, partly because of the absence benchmarks that allow a comprehensive and fair comparison. In this work, we first present evaluation benchmark, which introduce Real-world Low-Quality Face Video benchmark (RFV-LQ), evaluate several leading image-based algorithms, conduct thorough systematical analysis...

10.1109/tip.2024.3463414 article EN IEEE Transactions on Image Processing 2024-01-01

We observed that recent state-of-the-art results on single image human pose estimation were achieved by multi-stage Convolution Neural Networks (CNN). Notwithstanding the superior performance static images, application of these models videos is not only computationally intensive, it also suffers from degeneration and flicking. Such suboptimal are mainly attributed to inability imposing sequential geometric consistency, handling severe quality degradation (e.g. motion blur occlusion) as well...

10.48550/arxiv.1712.06316 preprint EN other-oa arXiv (Cornell University) 2017-01-01

Blind face restoration aims at recovering high-quality images from those with unknown degradations. Current algorithms mainly introduce priors to complement details and achieve impressive progress. However, most of these ignore abundant contextual information in the its interplay priors, leading sub-optimal performance. Moreover, they pay less attention gap between synthetic real-world scenarios, limiting robustness generalization applications. In this work, we propose RestoreFormer++, which...

10.48550/arxiv.2308.07228 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Motions in a video primarily consist of camera motion, induced by movement, and object resulting from movement. Accurate control both motion is essential for generation. However, existing works either mainly focus on one type or do not clearly distinguish between the two, limiting their capabilities diversity. Therefore, this paper presents MotionCtrl, unified flexible controller generation designed to effectively independently motion. The architecture training strategy MotionCtrl are...

10.48550/arxiv.2312.03641 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Recovering degraded low-resolution text images is challenging, especially for Chinese with complex strokes and severe degradation in real-world scenarios. Ensuring both fidelity style realness crucial high-quality image super-resolution. Recently, diffusion models have achieved great success natural synthesis restoration due to their powerful data distribution modeling abilities generation capabilities. In this work, we propose an Image Diffusion Model (IDM) restore realistic styles. For...

10.48550/arxiv.2312.08886 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Automatically selecting exposure bracketing (images exposed differently) is important to obtain a high dynamic range image by using multi-exposure fusion. Unlike previous methods that have many restrictions such as requiring camera response function, sensor noise model, and stream of preview images with different exposures (not accessible in some scenarios e.g. mobile applications), we propose novel deep neural network automatically select bracketing, named EBSNet, which sufficiently...

10.48550/arxiv.2005.12536 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...