Xuansong Xie

ORCID: 0000-0002-3671-799X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Generative Adversarial Networks and Image Synthesis
  • Advanced Vision and Imaging
  • Advanced Image Processing Techniques
  • Image Enhancement Techniques
  • Advanced Neural Network Applications
  • Advanced Image and Video Retrieval Techniques
  • Video Surveillance and Tracking Methods
  • Face recognition and analysis
  • Human Pose and Action Recognition
  • Image Processing Techniques and Applications
  • Video Analysis and Summarization
  • 3D Shape Modeling and Analysis
  • Human Motion and Animation
  • Computer Graphics and Visualization Techniques
  • Visual Attention and Saliency Detection
  • Anomaly Detection Techniques and Applications
  • Domain Adaptation and Few-Shot Learning
  • Image Retrieval and Classification Techniques
  • Image and Signal Denoising Methods
  • Autonomous Vehicle Technology and Safety
  • Multimodal Machine Learning Applications
  • Image and Video Quality Assessment
  • Digital Media Forensic Detection
  • Face and Expression Recognition
  • Machine Learning and Data Classification

Alibaba Group (China)
2019-2024

Alibaba Group (Cayman Islands)
2019-2024

Alibaba Group (United States)
2019-2024

Beijing University of Posts and Telecommunications
2023

ETH Zurich
2022

Blind face restoration (BFR) from severely degraded images in the wild is a very challenging problem. Due to high illness of problem and complex unknown degradation, directly training deep neural network (DNN) usually cannot lead acceptable results. Existing generative adversarial (GAN) based methods can produce better results but tend generate over-smoothed restorations. In this work, we propose new method by first learning GAN for high-quality image generation embedding it into U-shaped...

10.1109/cvpr46437.2021.00073 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Neural style transfer has drawn considerable attention from both academic and industrial field. Although visual effect efficiency have been significantly improved, existing methods are unable to coordinate spatial distribution of between the content image stylized image, or render diverse level detail via different brush strokes. In this paper, we tackle these limitations by developing an attention-aware multi-stroke model. We first propose assemble self-attention mechanism into a...

10.1109/cvpr.2019.00156 preprint EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Image inpainting is an underdetermined inverse problem, which naturally allows diverse contents to fill up the missing or corrupted regions realistically. Prevalent approaches using convolutional neural networks (CNNs) can synthesize visually pleasant contents, but CNNs suffer from limited perception fields for capturing global features. With image-level attention, transformers enable model long-range dependencies and generate with autoregressive modeling of pixel-sequence distributions....

10.1145/3474085.3475436 article EN Proceedings of the 30th ACM International Conference on Multimedia 2021-10-17

Semantic human matting aims to estimate the per-pixel opacity of foreground regions. It is quite challenging that usually requires user interactive trimaps and plenty high quality annotated data. Annotating such kind data labor intensive great skills beyond normal users, especially considering very detailed hair part humans. In contrast, coarse dataset much easier acquire collect from public dataset. this paper, we propose leverage coupled with fine boost end-to-end semantic without as extra...

10.1109/cvpr42600.2020.00859 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Image inpainting aims to complete the missing or corrupted regions of images with realistic contents. The prevalent approaches adopt a hybrid objective reconstruction and perceptual quality by using generative adversarial networks. However, loss focus on synthesizing contents different frequencies simply applying them together often leads inter-frequency conflicts compromised inpainting. This paper presents WaveFill, wavelet-based network that decomposes into multiple frequency bands fills...

10.1109/iccv48922.2021.01385 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

This paper reviews the NTIRE 2022 challenge on efficient single image super-resolution with focus proposed solutions and results. The task of was to super-resolve an input a magnification factor ×4 based pairs low corresponding high resolution images. aim design network for that achieved improvement efficiency measured according several metrics including runtime, parameters, FLOPs, activations, memory consumption while at least maintaining PSNR 29.00dB DIV2K validation set. IMDN is set as...

10.1109/cvprw56347.2022.00118 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2022-06-01

This paper proposes a novel active boundary loss for semantic segmentation. It can progressively encourage the alignment between predicted boundaries and ground-truth during end-to-end training, which is not explicitly enforced in commonly used cross-entropy loss. Based on detected from segmentation results using current network parameters, we formulate problem as differentiable direction vector prediction to guide movement of each iteration. Our model-agnostic be plugged training networks...

10.1609/aaai.v36i2.20139 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2022-06-28

Recent attention in instance segmentation has focused on query-based models. Despite being non-maximum suppression (NMS)-free and end-to-end, the superiority of these models high-accuracy real-time benchmarks not been well demonstrated. In this paper, we show strong potential efficient algorithm designs. We present FastInst, a simple, effective framework for segmentation. FastInst can execute at speed (i.e., 32.5 FPS) while yielding an AP more than 40 40.5 AP) COCO test-dev without bells...

10.1109/cvpr52729.2023.02266 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Limited by the nature of low-dimensional representational capacity 3DMM, most 3DMM-based face reconstruction (FR) methods fail to recover high-frequency facial details, such as wrinkles, dimples, etc. Some attempt solve problem introducing detail maps or nonlinear operations, however, results are still not vivid. To this end, we in paper present a novel hierarchical representation network (HRN) achieve accurate and detailed from single image. Specifically, implement geometry disentanglement...

10.1109/cvpr52729.2023.00046 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Existing Visual Object Tracking (VOT) only takes the target area in first frame as a template. This causes tracking to inevitably fail fast-changing and crowded scenes, it cannot account for changes object appearance between frames. To this end, we revamped framework with Progressive Context Encoding Transformer Tracker (ProContEXT), which coherently exploits spatial temporal contexts predict motion trajectories. Specifically, ProContEXT leverages context-aware self-attention module encode...

10.1109/icassp49357.2023.10094971 article EN ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023-05-05

Semantic segmentation on driving-scene images is vital for autonomous driving. Although encouraging performance has been achieved daytime images, the nighttime are less satisfactory due to insufficient exposure and lack of labeled data. To address these issues, we present an add-on module called dual image-adaptive learnable filters (DIAL-Filters) improve semantic in driving conditions, aiming at exploiting intrinsic features under different illuminations. DIAL-Filters consist two parts,...

10.1109/tcsvt.2023.3260240 article EN IEEE Transactions on Circuits and Systems for Video Technology 2023-03-22

Image colorization is a challenging problem due to multi-modal uncertainty and high ill-posedness. Directly training deep neural network usually leads incorrect semantic colors low color richness. While transformer-based methods can deliver better results, they often rely on manually designed priors, suffer from poor generalization ability, introduce bleeding effects. To address these issues, we propose DDColor, an end-to-end method with dual decoders for image colorization. Our approach...

10.1109/iccv51070.2023.00037 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

This paper reviews the video colorization challenge on New Trends in Image Restoration and Enhancement (NTIRE) workshop, held conjunction with CVPR 2023. The target of this is converting grayscale videos into color better performance temporal consistency. consists two tracks. For Track 1, goal achieving best FID (Fréchet Inception Distance) while being constrained to maintain or improve over baseline method terms temporal-consistency metric. Color Distribution Consistency (CDC) index used as...

10.1109/cvprw59228.2023.00159 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) 2023-06-01

Despite the great success of GANs in images translation with different conditioned inputs such as semantic segmentation and edge maps, generating high-fidelity realistic reference styles remains a grand challenge conditional image-to-image translation. This paper presents general image framework that incorporates optimal transport for feature alignment between style exemplars The introduction mitigates constraint many-to-one matching significantly while building up accurate correspondences...

10.1109/cvpr46437.2021.01478 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Human pose estimation is a challenging task due to its structured data sequence nature. Existing methods primarily focus on pair-wise interaction of body joints, which insufficient for scenarios involving overlapping joints and rapidly changing poses. To overcome these issues, we introduce novel approach, the High-order Directed Transformer (HDFormer), leverages high-order bone joint relationships improved estimation. Specifically, HDFormer incorporates both self-attention attention...

10.24963/ijcai.2023/65 article EN 2023-08-01

Vision Transformers (ViTs) have demonstrated powerful representation ability in various visual tasks thanks to their intrinsic data-hungry nature. However, we unexpectedly find that ViTs perform vulnerably when applied face recognition (FR) scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover existing data augmentation approach hard sample mining strategy are incompatible ViTs-based FR backbone due lack of tailored consideration on preserving...

10.1109/iccv51070.2023.01887 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Illumination estimation from a single image is critical in 3D rendering and it has been investigated extensively the computer vision graphic research community. On other hand, existing works estimate illumination by either regressing light parameters or generating maps that are often hard to optimize tend produce inaccurate predictions. We propose Earth Mover’s Light (EMLight), an framework leverages regression network neural projector for accurate estimation. decompose map into spherical...

10.1609/aaai.v35i4.16440 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-05-18

Different from general photo retouching tasks, portrait (PPR), which aims to enhance the visual quality of a collection flat-looking photos, has its special and practical requirements such as human-region priority (HRP) group-level consistency (GLC). HRP requires that more attention should be paid human regions, while GLC group photos retouched consistent tone. Models trained on existing datasets, however, can hardly meet these PPR. To facilitate research this high-frequency task, we...

10.1109/cvpr46437.2021.00071 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Skeleton-based action recognition aims to recognize human actions given joint coordinates with skeletal interconnections. By defining a graph joints as vertices and their natural connections edges, previous works successfully adopted Graph Convolutional networks (GCNs) model co-occurrences achieved superior performance. More recently, limitation of GCNs is identified, i.e., the topology fixed after training. To relax such restriction, Self-Attention (SA) mechanism has been make adaptive...

10.48550/arxiv.2211.09590 preprint EN other-oa arXiv (Cornell University) 2022-01-01

10.1109/wacv61041.2025.00255 article EN 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) 2025-02-26

The existence of noisy labels in real-world data negatively impacts the performance deep learning models. Although much research effort has been devoted to improving robustness classification tasks, problem metric (DML) remains open. In this paper, we propose a noise-resistant training technique for DML, which name Probabilistic Ranking-based Instance Selection with Memory (PRISM). PRISM identifies minibatch using average similarity against image features extracted by several previous...

10.1109/cvpr46437.2021.00674 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

Leveraging StyleGAN's expressivity and its disentangled latent codes, existing methods can achieve realistic editing of different visual attributes such as age gender facial images. An intriguing yet challenging problem arises: Can generative models counterfactual against their learnt priors? Due to the lack samples in natural datasets, we investigate this a text-driven manner with Contrastive-Language-Image-Pretraining (CLIP), which offer rich semantic knowledge even for various concepts....

10.1145/3503161.3547935 article EN Proceedings of the 30th ACM International Conference on Multimedia 2022-10-10

In the realm of autonomous driving, real-time perception or streaming remains under-explored. This research introduces DAMO-StreamNet, a novel framework that merges cutting-edge elements YOLO series with detailed examination spatial and temporal techniques. DAMO-StreamNet's main inventions include: (1) robust neck structure employing deformable convolution, bolstering receptive field feature alignment capabilities; (2) dual-branch synthesizing short-path semantic features long-path features,...

10.24963/ijcai.2023/90 article EN 2023-08-01
Coming Soon ...