- Advanced Vision and Imaging
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Generative Adversarial Networks and Image Synthesis
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Computer Graphics and Visualization Techniques
- Domain Adaptation and Few-Shot Learning
- Advanced Image Processing Techniques
- Advanced Neural Network Applications
- 3D Shape Modeling and Analysis
- Music and Audio Processing
- Human Motion and Animation
- Topic Modeling
- Face recognition and analysis
- Natural Language Processing Techniques
- Video Surveillance and Tracking Methods
- Image Retrieval and Classification Techniques
- Speech and Audio Processing
- Anomaly Detection Techniques and Applications
- Image Processing and 3D Reconstruction
- Image Processing Techniques and Applications
- Robotics and Sensor-Based Localization
- Image and Signal Denoising Methods
- Visual Attention and Saliency Detection
Tencent (China)
2020-2025
Peking University Shenzhen Hospital
2022-2025
Second Hospital of Shandong University
2025
Shandong Center for Disease Control and Prevention
2025
Shandong University
2025
Zhejiang University
2025
Beijing Jishuitan Hospital
2025
Capital Medical University
2025
Peking University
2022-2025
Fudan University
2022-2024
Though many attempts have been made in blind super-resolution to restore low-resolution images with unknown and complex degradations, they are still far from addressing general real-world degraded images. In this work, we extend the powerful ESRGAN a practical restoration application (namely, Real-ESRGAN), which is trained pure synthetic data. Specifically, high-order degradation modeling process introduced better simulate degradations. We also consider common ringing overshoot artifacts...
Manually crafted combinatorial features have been the "secret sauce" behind many successful models. For web-scale applications, however, variety and volume of make these manually expensive to create, maintain, deploy. This paper proposes Deep Crossing model which is a deep neural network that automatically combines produce superior The input set individual can be either dense or sparse. important crossing are discovered implicitly by networks, comprised an embedding stacking layer, as well...
Blind face restoration usually relies on facial priors, such as geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich diverse priors encapsulated a pretrained GAN for blind restoration. This Generative Facial Prior (GFP) is incorporated into process via...
The incredible generative ability of large-scale text-to-image (T2I) models has demonstrated strong power learning complex structures and meaningful semantics. However, relying solely on text prompts cannot fully take advantage the knowledge learned by model, especially when flexible accurate controlling (e.g., structure color) is needed. In this paper, we aim to ``dig out" capabilities that T2I have implicitly learned, then explicitly use them control generation more granularly....
To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose new T2V generation setting—One-Shot Video Tuning, where only one text-video pair presented. Our model built on state-of-the-art T2I diffusion models pre-trained massive image data. We make two key observations: 1) can generate still images that...
We present QueryInst, a new perspective for instance segmentation. QueryInst is multi-stage end-to-end system that treats instances of interest as learnable queries, enabling query based object detectors, e.g., Sparse RCNN, to have strong segmentation performance. The attributes such categories, bounding boxes, masks, and association embeddings are represented by queries in unified manner. In shared both detection via dynamic convolutions driven parallellysupervised learning. conduct...
Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent editing results. For example, approaches usually fail synthesize multiple images of same objects/characters but with different views or poses. Meanwhile, either achieve effective complex nonrigid while maintaining overall textures identity, require time-consuming fine-tuning capture image-specific appearance. In this paper, we develop MasaCtrl,...
Generating talking head videos through a face image and piece of speech audio still contains many challenges. i.e., unnatural movement, distorted expression, identity modification. We argue that these issues are mainly caused by learning from the coupled 2D motion fields. On other hand, explicitly using 3D information also suffers problems stiff expression incoherent video. present SadTalker, which generates coefficients (head pose, expression) 3DMM implicitly modulates novel 3D-aware render...
Pretraining a model to learn transferable video-text representation for retrieval has attracted lot of attention in recent years. Previous dominant works mainly adopt two separate encoders efficient retrieval, but ignore local associations between videos and texts. Another line research uses joint encoder interact video with texts, results low efficiency since each text-video pair needs be fed into the model. In this work, we enable fine-grained interactions while maintaining high via novel...
In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that CNN-based VQ-VAE with commonly used training recipes (EMA Code Reset) allows us to obtain high-quality discrete representations. For GPT, incorporate corruption strategy during the alleviate training-testing discrepancy. Despite its...
Mainstream Video-Language Pre-training (VLP) models [10, 26, 64] consist of three parts, a video encoder, text and video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal Transformers, resulting in increased parameters with lower efficiency downstream tasks. In this work, we for the first time introduce an end-to-end VLP model, namely all-in-one Transformer, that embeds raw textual signals into joint representations using unified...
The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness generation progress, is still challenging to apply such for real-world visual content editing, especially videos. In this paper, we propose FateZero, a zero-shot editing method on videos without per-prompt training or use-specific mask. To edit consistently, several techniques based the pre-trained models. Firstly, contrast straightforward DDIM...
Finding relevant moments and highlights in videos according to natural language queries is a highly valuable common need the current video content explosion era. Nevertheless, jointly conducting moment retrieval highlight detection an emerging research topic, even though its component problems some related tasks have already been studied for while. In this paper, we present first unified framework, named Unified Multi-modal Transformers (UMT), capable of realizing such joint optimization...
Recent CLIP-guided 3D optimization methods, such as DreamFields [19] and PureCLIPNeRF [24], have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training random initialization without prior knowledge, these methods often fail generate accurate faithful structures that conform the input text. In this paper, we make first attempt introduce explicit shape priors into process. Specifically, a high-quality from text text-to-shape stage prior. We then use it...
In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in reconstruct the pixels. However, find that baseline heavily relies spatial cues while ignoring temporal relations reconstruction, thus leading sub-optimal matching representations VOT VOS. To alleviate problem, propose DropMAE, which adaptively...
Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an often has a complex scene of multiple objects, how make strong control over both global layout map and each detailed object remains challenging task. In this paper, we propose model named LayoutDiffusion that can obtain higher quality greater controllability than previous works. To overcome difficult multimodal fusion layout, construct structural patch...
Creating a vivid video from the event or scenario in our imagination is truly fascinating experience. Recent advancements text-to-video synthesis have unveiled potential to achieve this with prompts only. While text convenient conveying overall scene context, it may be insufficient control precisely. In paper, we explore customized generation by utilizing as context description and motion structure (e.g. frame- wise depth) concrete guidance. Our method, dubbed Make-Your-Video, involves...
Facial expressions exhibit not only facial feature motions, but also subtle changes in illumination and appearance (e.g., creases wrinkles). These details are important visual cues, they difficult to synthesize. Traditional expression mapping techniques consider motions while the ignored. In this paper, we present a novel technique for mapping. We capture change of one person's what call an ratio image (ERI). Together with geometric warping, map ERI any other face generate more expressive...