NFDI4DS | UHH-SEMS - Publication Details

Ying Shan

ORCID: 0000-0001-7673-8325

Publications

Citations

Views

---

Saved

---

About

Contact & Profiles

A5102004349

Research Areas

Advanced Vision and Imaging
Multimodal Machine Learning Applications
Advanced Image and Video Retrieval Techniques
Generative Adversarial Networks and Image Synthesis
Human Pose and Action Recognition
Video Analysis and Summarization
Computer Graphics and Visualization Techniques
Domain Adaptation and Few-Shot Learning
Advanced Image Processing Techniques
Advanced Neural Network Applications
3D Shape Modeling and Analysis
Music and Audio Processing
Human Motion and Animation
Topic Modeling
Face recognition and analysis
Natural Language Processing Techniques
Video Surveillance and Tracking Methods
Image Retrieval and Classification Techniques
Speech and Audio Processing
Anomaly Detection Techniques and Applications
Image Processing and 3D Reconstruction
Image Processing Techniques and Applications
Robotics and Sensor-Based Localization
Image and Signal Denoising Methods
Visual Attention and Saliency Detection

Tencent (China)
2020-2025

Peking University Shenzhen Hospital
2022-2025

Second Hospital of Shandong University
2025

Shandong Center for Disease Control and Prevention
2025

Shandong University
2025

Zhejiang University
2025

Beijing Jishuitan Hospital
2025

Capital Medical University
2025

Peking University
2022-2025

Fudan University
2022-2024

Real-ESRGAN: Training Real-World Blind Super-Resolution with Pure Synthetic Data

OPENALEX - Publications

Xintao Wang Liangbin Xie Chao Dong Ying Shan

Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN a practical restoration application (namely, Real-ESRGAN), which is trained pure synthetic data. Specifically, high-order degradation modeling process introduced better simulate degradations. We also consider common ringing overshoot artifacts...

10.1109/iccvw54120.2021.00217 article EN 2021-10-01

Deep Crossing

OPENALEX - Publications

Ying Shan T. Ryan Hoens Jian Jiao Haijing Wang Dong Yu and 1 more

Manually crafted combinatorial features have been the "secret sauce" behind many successful models. For web-scale applications, however, variety and volume of make these manually expensive to create, maintain, deploy. This paper proposes Deep Crossing model which is a deep neural network that automatically combines produce superior The input set individual can be either dense or sparse. important crossing are discovered implicitly by networks, comprised an embedding stacking layer, as well...

10.1145/2939672.2939704 article EN 2016-08-08

Towards Real-World Blind Face Restoration with Generative Facial Prior

OPENALEX - Publications

Xintao Wang Yu Li Honglun Zhang Ying Shan

Blind face restoration usually relies on facial priors, such as geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich diverse priors encapsulated a pretrained GAN for blind restoration. This Generative Facial Prior (GFP) is incorporated into process via...

10.1109/cvpr46437.2021.00905 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

T2I-Adapter: Learning Adapters to Dig Out More Controllable Ability for Text-to-Image Diffusion Models

OPENALEX - Publications

Chong Mou Xintao Wang Liangbin Xie Yanze Wu Jian Zhang and 2 more

The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage the knowledge learned by model, especially when flexible accurate controlling (e.g., structure color) is needed. In this paper, we aim to ``dig out" capabilities that T2I have implicitly learned, then explicitly use them control generation more granularly....

10.1609/aaai.v38i5.28226 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2024-03-24

Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation

OPENALEX - Publications

Jay Zhangjie Wu Yixiao Ge Xintao Wang Stan Weixian Lei Yuchao Gu and 5 more

To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose new T2V generation setting—One-Shot Video Tuning, where only one text-video pair presented. Our model built on state-of-the-art T2I diffusion models pre-trained massive image data. We make two key observations: 1) can generate still images that...

10.1109/iccv51070.2023.00701 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

Instances as Queries

OPENALEX - Publications

Y.K. Fang Shusheng Yang Xinggang Wang Yu Li Fang Chen and 3 more

We present QueryInst, a new perspective for instance segmentation. QueryInst is multi-stage end-to-end system that treats instances of interest as learnable queries, enabling query based object detectors, e.g., Sparse RCNN, to have strong segmentation performance. The attributes such categories, bounding boxes, masks, and association embeddings are represented by queries in unified manner. In shared both detection via dynamic convolutions driven parallellysupervised learning. conduct...

10.1109/iccv48922.2021.00683 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing

OPENALEX - Publications

Mingdeng Cao Xintao Wang Zhongang Qi Ying Shan Xiaohu Qie and 1 more

Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent editing results. For example, approaches usually fail synthesize multiple images of same objects/characters but with different views or poses. Meanwhile, either achieve effective complex nonrigid while maintaining overall textures identity, require time-consuming fine-tuning capture image-specific appearance. In this paper, we develop MasaCtrl,...

10.1109/iccv51070.2023.02062 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation

OPENALEX - Publications

Wenxuan Zhang Xiaodong Cun Xuan Wang Yong Zhang Xi Shen and 3 more

Generating talking head videos through a face image and piece of speech audio still contains many challenges. i.e., unnatural movement, distorted expression, identity modification. We argue that these issues are mainly caused by learning from the coupled 2D motion fields. On other hand, explicitly using 3D information also suffers problems stiff expression incoherent video. present SadTalker, which generates coefficients (head pose, expression) 3DMM implicitly modulates novel 3D-aware render...

10.1109/cvpr52729.2023.00836 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Bridging Video-text Retrieval with Multiple Choice Questions

OPENALEX - Publications

Yuying Ge Yixiao Ge Xihui Liu Dian Li Ying Shan and 2 more

Pretraining a model to learn transferable video-text representation for retrieval has attracted lot of attention in recent years. Previous dominant works mainly adopt two separate encoders efficient retrieval, but ignore local associations between videos and texts. Another line research uses joint encoder interact video with texts, results low efficiency since each text-video pair needs be fed into the model. In this work, we enable fine-grained interactions while maintaining high via novel...

10.1109/cvpr52688.2022.01569 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Generating Human Motion from Textual Descriptions with Discrete Representations

OPENALEX - Publications

Jianrong Zhang Yangsong Zhang Xiaodong Cun Yong Zhang Hongwei Zhao and 3 more

In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that CNN-based VQ-VAE with commonly used training recipes (EMA Code Reset) allows us to obtain high-quality discrete representations. For GPT, incorporate corruption strategy during the alleviate training-testing discrepancy. Despite its...

10.1109/cvpr52729.2023.01415 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

YOLO-World: Real-Time Open-Vocabulary Object Detection

OPENALEX - Publications

Tianheng Cheng Lin Song Yixiao Ge Wenyu Liu Xinggang Wang and 1 more

10.1109/cvpr52733.2024.01599 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

All in One: Exploring Unified Video-Language Pre-Training

OPENALEX - Publications

Jinpeng Wang Yixiao Ge Rui Yan Yuying Ge Kevin Qinghong Lin and 7 more

Mainstream Video-Language Pre-training (VLP) models [10, 26, 64] consist of three parts, a video encoder, text and video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal Transformers, resulting in increased parameters with lower efficiency downstream tasks. In this work, we for the first time introduce an end-to-end VLP model, namely all-in-one Transformer, that embeds raw textual signals into joint representations using unified...

10.1109/cvpr52729.2023.00638 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

FateZero: Fusing Attentions for Zero-shot Text-based Video Editing

OPENALEX - Publications

Chenyang Qi Xiaodong Cun Yong Zhang Chenyang Lei Xintao Wang and 2 more

The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness generation progress, is still challenging to apply such for real-world visual content editing, especially videos. In this paper, we propose FateZero, a zero-shot editing method on videos without per-prompt training or use-specific mask. To edit consistently, several techniques based the pre-trained models. Firstly, contrast straightforward DDIM...

10.1109/iccv51070.2023.01460 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

UMT: Unified Multi-modal Transformers for Joint Video Moment Retrieval and Highlight Detection

OPENALEX - Publications

Ye Liu Siyuan Li Yang Wu Chang Wen Chen Ying Shan and 1 more

Finding relevant moments and highlights in videos according to natural language queries is a highly valuable common need the current video content explosion era. Nevertheless, jointly conducting moment retrieval highlight detection an emerging research topic, even though its component problems some related tasks have already been studied for while. In this paper, we present first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization...

10.1109/cvpr52688.2022.00305 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models

OPENALEX - Publications

Jiale Xu Xintao Wang Weihao Cheng Yan‐Pei Cao Ying Shan and 2 more

Recent CLIP-guided 3D optimization methods, such as DreamFields [19] and PureCLIPNeRF [24], have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training random initialization without prior knowledge, these methods often fail generate accurate faithful structures that conform the input text. In this paper, we make first attempt introduce explicit shape priors into process. Specifically, a high-quality from text text-to-shape stage prior. We then use it...

10.1109/cvpr52729.2023.02003 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

DropMAE: Masked Autoencoders with Spatial-Attention Dropout for Tracking Tasks

OPENALEX - Publications

Qiangqiang Wu Tianyu Yang Ziquan Liu Baoyuan Wu Ying Shan and 1 more

In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in reconstruct the pixels. However, find that baseline heavily relies spatial cues while ignoring temporal relations reconstruction, thus leading sub-optimal matching representations VOT VOS. To alleviate problem, propose DropMAE, which adaptively...

10.1109/cvpr52729.2023.01399 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio, Video, Point Cloud, Time-Series and Image Recognition

OPENALEX - Publications

Xiaohan Ding Yiyuan Zhang Yixiao Ge Sijie Zhao Lin Song and 2 more

10.1109/cvpr52733.2024.00527 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation

OPENALEX - Publications

Guangcong Zheng Xianpan Zhou Xuewei Li Zhongang Qi Ying Shan and 1 more

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an often has a complex scene of multiple objects, how make strong control over both global layout map and each detailed object remains challenging task. In this paper, we propose model named LayoutDiffusion that can obtain higher quality greater controllability than previous works. To overcome difficult multimodal fusion layout, construct structural patch...

10.1109/cvpr52729.2023.02154 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models

OPENALEX - Publications

Yukang Cao Yan‐Pei Cao Kai Han Ying Shan Kenneth K. Wong

10.1109/cvpr52733.2024.00097 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models

OPENALEX - Publications

Haoxin Chen Yong Zhang Xiaodong Cun Menghan Xia Xintao Wang and 2 more

10.1109/cvpr52733.2024.00698 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

SEED-Bench: Benchmarking Multimodal Large Language Models

OPENALEX - Publications

Bohao Li Yuying Ge Yixiao Ge Guangzhi Wang Rui Wang and 2 more

10.1109/cvpr52733.2024.01263 article DE 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Make-Your-Video: Customized Video Generation Using Textual and Structural Guidance

OPENALEX - Publications

Jinbo Xing Menghan Xia Yuxin Liu Yuechen Zhang Yong Zhang and 7 more

Creating a vivid video from the event or scenario in our imagination is truly fascinating experience. Recent advancements text-to-video synthesis have unveiled potential to achieve this with prompts only. While text convenient conveying overall scene context, it may be insufficient control precisely. In paper, we explore customized generation by utilizing as context description and motion structure (e.g. frame- wise depth) concrete guidance. Our method, dubbed Make-Your-Video, involves...

10.1109/tvcg.2024.3365804 article EN IEEE Transactions on Visualization and Computer Graphics 2024-01-01

PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding

OPENALEX - Publications

Zhen Li Mingdeng Cao Xintao Wang Zhongang Qi Ming‐Ming Cheng and 1 more

10.1109/cvpr52733.2024.00825 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

Expressive expression mapping with ratio images

OPENALEX - Publications

Zicheng Liu Ying Shan Zhengyou Zhang

Facial expressions exhibit not only facial feature motions, but also subtle changes in illumination and appearance (e.g., creases wrinkles). These details are important visual cues, they difficult to synthesize. Traditional expression mapping techniques consider motions while the ignored. In this paper, we present a novel technique for mapping. We capture change of one person's what call an ratio image (ERI). Together with geometric warping, map ERI any other face generate more expressive...

10.1145/383259.383289 article EN 2001-08-01

Coming Soon ...