- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Generative Adversarial Networks and Image Synthesis
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Advanced Image Processing Techniques
- Advanced Vision and Imaging
- Image Retrieval and Classification Techniques
- Video Surveillance and Tracking Methods
- Image Processing Techniques and Applications
- Topic Modeling
- Image and Signal Denoising Methods
- Visual Attention and Saliency Detection
- Computer Graphics and Visualization Techniques
- Image Enhancement Techniques
- Robot Manipulation and Learning
- Human Motion and Animation
- Anomaly Detection Techniques and Applications
- Face recognition and analysis
- Adversarial Robustness in Machine Learning
- CCD and CMOS Imaging Sensors
- Music and Audio Processing
- Natural Language Processing Techniques
Microsoft Research (United Kingdom)
2018-2025
Yantai University
2025
Microsoft Research Asia (China)
2015-2024
Lanzhou University
2024
Microsoft Research (India)
2024
Northwest Normal University
2024
Microsoft (United States)
2017-2023
Université de Bordeaux
2023
Laboratoire Bordelais de Recherche en Informatique
2023
Peking University
2023
Recognizing fine-grained categories (e.g., bird species) is difficult due to the challenges of discriminative region localization and feature learning. Existing approaches predominantly solve these independently, while neglecting fact that detection learning are mutually correlated thus can reinforce each other. In this paper, we propose a novel recurrent attention convolutional neural network (RA-CNN) which recursively learns region-based representation at multiple scales in mutual...
Recognizing fine-grained categories (e.g., bird species) highly relies on discriminative part localization and part-based feature learning. Existing approaches predominantly solve these challenges independently, while neglecting the fact that head of a bird) learning shape) are mutually correlated. In this paper, we propose novel approach by multi-attention convolutional neural network (MA-CNN), where generation can reinforce each other. MA-CNN consists convolution, channel grouping...
We study on image super-resolution (SR), which aims to recover realistic textures from a low-resolution (LR) image. Recent progress has been made by taking high-resolution images as references (Ref), so that relevant can be transferred LR images. However, existing SR approaches neglect use attention mechanisms transfer (HR) Ref images, limits these in challenging cases. In this paper, we propose novel Texture Transformer Network for Image Super-Resolution (TTSR), the and are formulated...
In this paper, we present a new tracking architecture with an encoder-decoder transformer as the key component. The encoder models global spatio-temporal feature dependencies between target objects and search regions, while decoder learns query embedding to predict spatial positions of objects. Our method casts object direct bounding box prediction problem, without using any proposals or predefined anchors. With transformer, just uses simple fully-convolutional network, which estimates...
High-quality image inpainting requires filling missing regions in a damaged with plausible content. Existing works either fill the by copying high-resolution patches or generating semantically-coherent from region context, while neglecting fact that both visual and semantic plausibility are highly-demanded. In this paper, we propose Pyramid-context Encoder Network (denoted as PEN-Net) for deep generative models. The proposed PEN-Net is built upon U-Net structure three tailored components,...
Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays significant role in fine-grained image recognition. Existing attention-based approaches localize amplify parts to learn details, which often suffer from limited number of heavy computational cost. In this paper, we propose such hundreds part proposals by Trilinear Attention Sampling Network (TASN) an efficient teacher-student manner. Specifically, TASN consists 1) trilinear attention module, generates maps...
The Visual Object Tracking challenge VOT2019 is the seventh annual tracker benchmarking activity organized by VOT initiative. Results of 81 trackers are presented; many state-of-the-art published at major computer vision conferences or in journals recent years. evaluation included standard and other popular methodologies for short-term tracking analysis as well methodology long-term analysis. was composed five challenges focusing on different domains: (i) VOTST2019 focused RGB, (ii)...
We address the problem of retrieving a specific moment from an untrimmed video by query sentence. This is challenging because target may take place in relations to other temporal moments video. Existing methods cannot tackle this challenge well since they consider individually and neglect dependencies. In paper, we model between two-dimensional map, where one dimension indicates starting time end time. 2D map can cover diverse with different lengths, while representing their adjacent...
We propose Pixel-BERT to align image pixels with text by deep multi-modal transformers that jointly learn visual and language embedding in a unified end-to-end framework. aim build more accurate thorough connection between semantics directly from sentence pairs instead of using region-based features as the most recent vision tasks. Our which aligns semantic pixel level solves limitation task-specific representation for It also relieves cost bounding box annotations overcomes unbalance labels...
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. General efficacy has been proven in natural language processing. However, computer vision, its not well studied and even remains controversial, e.g., whether relative can work equally as absolute position? In order clarify this, we first review existing methods analyze their pros cons when applied vision transformers. We then propose new dedicated 2D images, called image RPE (iRPE)....
Inspired by the recent success of text-based question answering, visual answering (VQA) is proposed to automatically answer natural language questions with reference a given image. Compared QA, VQA more challenging because reasoning process on domain needs both effective semantic embedding and fine-grained understanding. Existing approaches predominantly infer answers from abstract low-level features, while neglecting modeling high-level image semantics rich spatial context regions. To solve...
Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that depth, embedding dimension, number heads can largely affect performance transformers. Previous configure these dimensions based upon manual crafting. In this work, we propose a new one-shot architecture search framework, namely AutoFormer, dedicated to search. AutoFormer entangles...
We study joint learning of Convolutional Neural Network (CNN) and Transformer for vision-language pre-training (VLPT) which aims to learn cross-modal alignments from millions image-text pairs. State-of-the-art approaches extract salient image regions align with words step-by-step. As region-based visual features usually represent parts an image, it is challenging existing models fully understand the semantics paired natural languages. In this paper, we propose SOHO "Seeing Out tHe bOx" that...
Most of the current action localization methods follow an anchor-based pipeline: depicting instances by pre-defined anchors, learning to select anchors closest ground truth, and predicting confidence with refinements. Pre-defined set prior about location duration for instances, which facilitates common but limits flexibility tackling drastic varieties, especially extremely short or long ones. To address this problem, paper proposes a novel anchor-free module that assists temporal points....
Image inpainting that completes large free-form missing regions in images is a promising yet challenging task. State-of-the-art approaches have achieved significant progress by taking advantage of generative adversarial networks (GAN). However, these can suffer from generating distorted structures and blurry textures high-resolution (e.g., 512×512). The challenges mainly drive (1) image content reasoning distant contexts, (2) fine-grained texture synthesis for region. To overcome two...
Object tracking has achieved significant progress over the past few years. However, state-of-the-art trackers become increasingly heavy and expensive, which limits their deployments in resource-constrained applications. In this work, we present LightTrack, uses neural architecture search (NAS) to design more lightweight efficient object trackers. Comprehensive experiments show that our LightTrack is effective. It can find achieve superior performance compared handcrafted SOTA trackers, such...
We study on weakly-supervised object detection (WSOD) which plays a vital role in relieving human involvement from object-level annotations. Predominant works integrate region proposal mechanisms with convolutional neural networks (CNN). Although CNN is proficient extracting discriminative local features, grand challenges still exist to measure the likelihood of bounding box containing complete (i.e., "objectness"). In this paper, we propose novel WSOD framework Objectness Distillation...
Dense crowd counting aims to predict thousands of human instances from an image, by calculating integrals a density map over image pixels. Existing approaches mainly suffer the extreme variations. Such pattern shift poses challenges even for multi-scale model ensembling. In this paper, we propose simple yet effective approach tackle problem. First, patch-level is extracted estimation and further grouped into several levels which are determined full datasets. Second, each patch automatically...
Video super-resolution (VSR) aims to restore a sequence of high-resolution (HR) frames from their low-resolution (LR) counterparts. Although some progress has been made, there are grand challenges effectively utilize temporal dependency in entire video sequences. Existing approaches usually align and aggregate limited adjacent (e.g., 5 or 7 frames), which prevents these satisfactory results. In this paper, we take one step further enable effective spatio-temporal learning videos. We propose...
Vision Transformer (ViT) models have recently drawn much attention in computer vision due to their high model capability. However, ViT suffer from huge number of parameters, restricting applicability on devices with limited memory. To alleviate this problem, we propose MiniViT, a new compression framework, which achieves parameter reduction transformers while retaining the same performance. The central idea MiniViT is multiplex weights consecutive transformer blocks. More specifically, make...