- Generative Adversarial Networks and Image Synthesis
- Face recognition and analysis
- Advanced Image Processing Techniques
- Video Surveillance and Tracking Methods
- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Human Pose and Action Recognition
- Advanced Vision and Imaging
- Multimodal Machine Learning Applications
- Image Enhancement Techniques
- Computer Graphics and Visualization Techniques
- Speech and Audio Processing
- Human Motion and Animation
- Image Retrieval and Classification Techniques
- Image Processing Techniques and Applications
- 3D Surveying and Cultural Heritage
- 3D Shape Modeling and Analysis
- Biological Activity of Diterpenoids and Biflavonoids
- Advanced Manufacturing and Logistics Optimization
- Digital Media Forensic Detection
- Gait Recognition and Analysis
- Visual Attention and Saliency Detection
- Stress Responses and Cortisol
- Blind Source Separation Techniques
Beijing Institute of Technology
2024
Baidu (China)
2024
Tencent (China)
2020-2023
Zhejiang University
2016-2021
Seoul National University
2021
Inner Mongolia University for Nationalities
2020
Alibaba Group (China)
2017
Huazhong University of Science and Technology
2013
Prevailing video frame interpolation algorithms, that generate the intermediate frames from consecutive inputs, typically rely on complex model architectures with heavy parameters or large delay, hindering them diverse real-time applications. In this work, we devise an efficient encoder-decoder based network, termed IFRNet, for fast in-termediate synthesizing. It first extracts pyramid features given and then refines bilateral flow fields together a powerful intermedi-ate feature until...
Abnormal event detection in large videos is an important task research and industrial applications, which has attracted considerable attention recent years. Existing methods usually solve this problem by extracting local features then learning outlier model on training videos. However, most previous approaches merely employ hand-crafted visual features, a clear disadvantage due to their limited representation capacity. In paper, we present novel unsupervised deep feature algorithm for the...
In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the shape of source and generate photo-realistic results. Unlike other existing works that only use recognition model to keep identity similarity, 3D shape-aware control with geometric supervision from 3DMM reconstruction method. Meanwhile, introduce Semantic Facial Fusion module optimize combination encoder decoder features make adaptive blending, makes results more photo-realistic....
Motion blur is a common photography artifact in dynamic environments that typically comes jointly with the other types of degradation. This paper reviews NTIRE 2021 Challenge on Image Deblurring. In this challenge report, we describe specifics and evaluation results from 2 competition tracks proposed solutions. While both aim to recover high-quality clean image blurry image, different artifacts are involved. track 1, images low resolution while compressed JPEG format. each competition, there...
Vehicle detection is a challenging problem in autonomous driving systems, due to its large structural and appearance variations. In this paper, we propose novel vehicle scheme based on multi-task deep convolutional neural networks (CNNs) region-of-interest (RoI) voting. the design of CNN architecture, enrich supervised information with subcategory, region overlap, bounding-box regression, category each training RoI as learning framework. This allows model share visual knowledge among...
Model generalization to the unseen scenes is crucial real-world applications, such as autonomous driving, which requires robust vision systems. To enhance model generalization, domain through learning domain-invariant representation has been widely studied. However, most existing works learn shared feature space within multi-source domains but ignore characteristic of itself (e.g., sensitivity domain-specific style). Therefore, we propose Domain-invariant Representation Learning (DIRL) for...
Blind face restoration, which aims to reconstruct high-quality images from low-quality inputs, can benefit many applications. Although existing generative-based methods achieve significant progress in producing images, they often fail restore natural shapes and high-fidelity facial details severely-degraded inputs. In this work, we propose integrate shape generative priors guide the challenging blind restoration. Firstly, set up a restoration module recover reason-able geometry with 3D...
Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen styles due limited semantics. They either ignore the one-shot setting quality of generated faces. In this paper, we propose a more generalized framework. Specifically, supplement style text prompts use an Aligned Multi-modal Emotion...
Recent advances in image inpainting have shown impressive results for generating plausible visual details on rather simple backgrounds. However, complex scenes, it is still challenging to restore reasonable contents as the contextual information within missing regions tends be ambiguous. To tackle this problem, we introduce pretext tasks that are semantically meaningful estimating contents. In particular, perform knowledge distillation models and adapt features inpainting. The learned...
Super-Resolution (SR) is a fundamental computer vision task that aims to obtain high-resolution clean image from the given low-resolution counterpart. This paper reviews NTIRE 2021 Challenge on Video Super-Resolution. We present evaluation results two competition tracks as well proposed solutions. Track 1 develop conventional video SR methods focusing restoration quality. 2 assumes more challenging environment with lower frame rates, casting spatio-temporal problem. In each competition, 247...
In this paper, we present VideoGen, a text-to-video generation approach, which can generate high-definition video with high frame fidelity and strong temporal consistency using reference-guided latent diffusion. We leverage an off-the-shelf text-to-image model, e.g., Stable Diffusion, to image content quality from the text prompt, as reference guide generation. Then, introduce efficient cascaded diffusion module conditioned on both for generating representations, followed by flow-based...
We propose a new attention model for video question answering. The main idea of the models is to locate on most informative parts visual data. mechanisms are quite popular these days. However, existing regard as whole. They ignore word-level semantics where each word can have different attentions and some words need no attention. Neither do they consider semantic structure sentences. Although Extended Soft Attention (E-SA) answering leverages attention, it performs poorly long In this paper,...
As one of the most popular unsupervised learning approaches, autoencoder aims at transforming inputs to outputs with least discrepancy. The conventional and its variants only consider one-to-one reconstruction, which ignores intrinsic structure data may lead overfitting. In order preserve latent geometric information in data, we propose stacked similarity-aware autoencoders. To train each single autoencoder, first obtain pseudo class label sample by clustering input features. Then hidden...
Most semantic segmentation models treat as a pixel-wise classification task and use error their optimization criterions. However, the ignores strong dependencies among pixels in an image, which limits performance of model. Several ways to incorporate structure information objects have been investigated, \eg, conditional random fields (CRF), image priors based methods, generative adversarial network (GAN). Nevertheless, these methods usually require extra model branches or additional...
A caricature is an artistic form of a person's picture in which certain striking characteristics are abstracted or exaggerated order to create humor sarcasm effect. For numerous related applications such as attribute recognition and editing, face parsing essential pre-processing step that provides complete facial structure understanding. However, current state-of-the-art methods require large amounts labeled data on the pixel-level process for tedious labor-intensive. real photos, there...
Abstract Caricature is an artistic drawing created to abstract or exaggerate facial features of a person. Rendering visually pleasing caricatures difficult task that requires professional skills, and thus it great interest design method automatically generate such drawings. To deal with large shape changes, we propose algorithm based on semantic transform produce diverse plausible exaggerations. Specifically, predict pixel-wise correspondences perform image warping the input photo achieve...
Fine-grained object retrieval, which aims at finding objects belonging to the same sub-category as probe from a large database, is becoming increasingly popular because of its research and application significance. Recently, convolutional neural network (CNN) based deep learning models have achieved promising retrieval performance, they can learn both feature representations discriminative distance metrics jointly. Specifically, generic method extract activations fully-connected layer...
In this article, we propose a novel deep Siamese architecture based on convolutional neural network (CNN) and multi-level similarity perception for the person re-identification (re-ID) problem. According to distinct characteristics of diverse feature maps, effectively apply different constraints both low-level high-level maps during training stage. Due introduction appropriate comparison mechanisms at levels, proposed approach can adaptively learn discriminative local global representations,...
While considerable progress has been made in achieving accurate lip synchronization for 3D speech-driven talking face generation, the task of incorporating expressive facial detail synthesis aligned with speaker's speaking status remains challenging. Existing efforts either focus on learning a dynamic head pose synchronized speech rhythm or aim stylized movements guided by external reference such as emotional labels video clips. The former works often yield coarse alignment, neglecting...