Ying Shan

ORCID: 0000-0001-7673-8325
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Vision and Imaging
  • Multimodal Machine Learning Applications
  • Advanced Image and Video Retrieval Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Human Pose and Action Recognition
  • Video Analysis and Summarization
  • Computer Graphics and Visualization Techniques
  • Domain Adaptation and Few-Shot Learning
  • Advanced Image Processing Techniques
  • Advanced Neural Network Applications
  • 3D Shape Modeling and Analysis
  • Music and Audio Processing
  • Human Motion and Animation
  • Topic Modeling
  • Face recognition and analysis
  • Natural Language Processing Techniques
  • Video Surveillance and Tracking Methods
  • Image Retrieval and Classification Techniques
  • Speech and Audio Processing
  • Anomaly Detection Techniques and Applications
  • Image Processing and 3D Reconstruction
  • Image Processing Techniques and Applications
  • Robotics and Sensor-Based Localization
  • Image and Signal Denoising Methods
  • Visual Attention and Saliency Detection

Tencent (China)
2020-2025

Peking University Shenzhen Hospital
2022-2025

Second Hospital of Shandong University
2025

Shandong Center for Disease Control and Prevention
2025

Shandong University
2025

Zhejiang University
2025

Beijing Jishuitan Hospital
2025

Capital Medical University
2025

Peking University
2022-2025

Fudan University
2022-2024

Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN a practical restoration application (namely, Real-ESRGAN), which is trained pure synthetic data. Specifically, high-order degradation modeling process introduced better simulate degradations. We also consider common ringing overshoot artifacts...

10.1109/iccvw54120.2021.00217 article EN 2021-10-01

Manually crafted combinatorial features have been the "secret sauce" behind many successful models. For web-scale applications, however, variety and volume of make these manually expensive to create, maintain, deploy. This paper proposes Deep Crossing model which is a deep neural network that automatically combines produce superior The input set individual can be either dense or sparse. important crossing are discovered implicitly by networks, comprised an embedding stacking layer, as well...

10.1145/2939672.2939704 article EN 2016-08-08

Blind face restoration usually relies on facial priors, such as geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich diverse priors encapsulated a pretrained GAN for blind restoration. This Generative Facial Prior (GFP) is incorporated into process via...

10.1109/cvpr46437.2021.00905 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage the knowledge learned by model, especially when flexible accurate controlling (e.g., structure color) is needed. In this paper, we aim to ``dig out" capabilities that T2I have implicitly learned, then explicitly use them control generation more granularly....

10.1609/aaai.v38i5.28226 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose new T2V generation setting—One-Shot Video Tuning, where only one text-video pair presented. Our model built on state-of-the-art T2I diffusion models pre-trained massive image data. We make two key observations: 1) can generate still images that...

10.1109/iccv51070.2023.00701 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

We present QueryInst, a new perspective for instance segmentation. QueryInst is multi-stage end-to-end system that treats instances of interest as learnable queries, enabling query based object detectors, e.g., Sparse RCNN, to have strong segmentation performance. The attributes such categories, bounding boxes, masks, and association embeddings are represented by queries in unified manner. In shared both detection via dynamic convolutions driven parallellysupervised learning. conduct...

10.1109/iccv48922.2021.00683 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent editing results. For example, approaches usually fail synthesize multiple images of same objects/characters but with different views or poses. Meanwhile, either achieve effective complex nonrigid while maintaining overall textures identity, require time-consuming fine-tuning capture image-specific appearance. In this paper, we develop MasaCtrl,...

10.1109/iccv51070.2023.02062 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Generating talking head videos through a face image and piece of speech audio still contains many challenges. i.e., unnatural movement, distorted expression, identity modification. We argue that these issues are mainly caused by learning from the coupled 2D motion fields. On other hand, explicitly using 3D information also suffers problems stiff expression incoherent video. present SadTalker, which generates coefficients (head pose, expression) 3DMM implicitly modulates novel 3D-aware render...

10.1109/cvpr52729.2023.00836 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Pretraining a model to learn transferable video-text representation for retrieval has attracted lot of attention in recent years. Previous dominant works mainly adopt two separate encoders efficient retrieval, but ignore local associations between videos and texts. Another line research uses joint encoder interact video with texts, results low efficiency since each text-video pair needs be fed into the model. In this work, we enable fine-grained interactions while maintaining high via novel...

10.1109/cvpr52688.2022.01569 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that CNN-based VQ-VAE with commonly used training recipes (EMA Code Reset) allows us to obtain high-quality discrete representations. For GPT, incorporate corruption strategy during the alleviate training-testing discrepancy. Despite its...

10.1109/cvpr52729.2023.01415 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

10.1109/cvpr52733.2024.01599 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Mainstream Video-Language Pre-training (VLP) models [10, 26, 64] consist of three parts, a video encoder, text and video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal Transformers, resulting in increased parameters with lower efficiency downstream tasks. In this work, we for the first time introduce an end-to-end VLP model, namely all-in-one Transformer, that embeds raw textual signals into joint representations using unified...

10.1109/cvpr52729.2023.00638 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness generation progress, is still challenging to apply such for real-world visual content editing, especially videos. In this paper, we propose FateZero, a zero-shot editing method on videos without per-prompt training or use-specific mask. To edit consistently, several techniques based the pre-trained models. Firstly, contrast straightforward DDIM...

10.1109/iccv51070.2023.01460 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Finding relevant moments and highlights in videos according to natural language queries is a highly valuable common need the current video content explosion era. Nevertheless, jointly conducting moment retrieval highlight detection an emerging research topic, even though its component problems some related tasks have already been studied for while. In this paper, we present first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization...

10.1109/cvpr52688.2022.00305 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Recent CLIP-guided 3D optimization methods, such as DreamFields [19] and PureCLIPNeRF [24], have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training random initialization without prior knowledge, these methods often fail generate accurate faithful structures that conform the input text. In this paper, we make first attempt introduce explicit shape priors into process. Specifically, a high-quality from text text-to-shape stage prior. We then use it...

10.1109/cvpr52729.2023.02003 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in reconstruct the pixels. However, find that baseline heavily relies spatial cues while ignoring temporal relations reconstruction, thus leading sub-optimal matching representations VOT VOS. To alleviate problem, propose DropMAE, which adaptively...

10.1109/cvpr52729.2023.01399 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an often has a complex scene of multiple objects, how make strong control over both global layout map and each detailed object remains challenging task. In this paper, we propose model named LayoutDiffusion that can obtain higher quality greater controllability than previous works. To overcome difficult multimodal fusion layout, construct structural patch...

10.1109/cvpr52729.2023.02154 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

10.1109/cvpr52733.2024.00097 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

10.1109/cvpr52733.2024.00698 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

10.1109/cvpr52733.2024.01263 article DE 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Creating a vivid video from the event or scenario in our imagination is truly fascinating experience. Recent advancements text-to-video synthesis have unveiled potential to achieve this with prompts only. While text convenient conveying overall scene context, it may be insufficient control precisely. In paper, we explore customized generation by utilizing as context description and motion structure (e.g. frame- wise depth) concrete guidance. Our method, dubbed Make-Your-Video, involves...

10.1109/tvcg.2024.3365804 article EN IEEE Transactions on Visualization and Computer Graphics 2024-01-01

10.1109/cvpr52733.2024.00825 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Facial expressions exhibit not only facial feature motions, but also subtle changes in illumination and appearance (e.g., creases wrinkles). These details are important visual cues, they difficult to synthesize. Traditional expression mapping techniques consider motions while the ignored. In this paper, we present a novel technique for mapping. We capture change of one person's what call an ratio image (ERI). Together with geometric warping, map ERI any other face generate more expressive...

10.1145/383259.383289 article EN 2001-08-01
Coming Soon ...