- Advanced Image and Video Retrieval Techniques
- Multimodal Machine Learning Applications
- Human Pose and Action Recognition
- Domain Adaptation and Few-Shot Learning
- Video Surveillance and Tracking Methods
- Image Retrieval and Classification Techniques
- Advanced Image Processing Techniques
- Advanced Neural Network Applications
- Image Enhancement Techniques
- Video Analysis and Summarization
- Generative Adversarial Networks and Image Synthesis
- Advanced Vision and Imaging
- Image and Signal Denoising Methods
- Gait Recognition and Analysis
- Image Processing Techniques and Applications
- Topic Modeling
- Advanced Image Fusion Techniques
- Digital Media Forensic Detection
- Face and Expression Recognition
- Visual Attention and Saliency Detection
- Face recognition and analysis
- Text and Document Classification Technologies
- COVID-19 diagnosis using AI
- Handwritten Text Recognition Techniques
- Anomaly Detection Techniques and Applications
University of Science and Technology of China
2016-2025
Hebei Agricultural University
2025
Beijing Information Science & Technology University
2025
Chinese Academy of Sciences
2013-2024
Institute of Computing Technology
2020-2024
University of Science and Technology Chittagong
2023
Tianjin University
2022
University of Science and Technology Beijing
2020
Microsoft Research (United Kingdom)
2020
City University of Hong Kong
2019
Due to the popularity of social media websites, extensive research efforts have been dedicated tag-based image search. Both visual information and tags investigated in field. However, most existing methods use characteristics either separately or sequentially order estimate relevance images. In this paper, we propose an approach that simultaneously utilizes both textual user tagged The estimation is determined with a hypergraph learning approach. method, constructed, where vertices represent...
Learning subtle yet discriminative features (e.g., beak and eyes for a bird) plays significant role in fine-grained image recognition. Existing attention-based approaches localize amplify parts to learn details, which often suffer from limited number of heavy computational cost. In this paper, we propose such hundreds part proposals by Trilinear Attention Sampling Network (TASN) an efficient teacher-student manner. Specifically, TASN consists 1) trilinear attention module, generates maps...
Recent deep learning based person re-identification approaches have steadily improved the performance for benchmarks, however they often fail to generalize well from one domain another. In this work, we propose a novel adaptive transfer network (ATNet) effective cross-domain re-identification. ATNet looks into essential causes of gap and addresses it following principle "divide-and-conquer". It decomposes complicated set factor-wise sub-transfers, each which concentrates on style with...
Taking full advantage of the information from both vision and language is critical for video captioning task. Existing models lack adequate visual representation due to neglect interaction between object, sufficient training content-related words long-tailed problems. In this paper, we propose a complete system including novel model an effective strategy. Specifically, object relational graph (ORG) based encoder, which captures more detailed features enrich representation. Meanwhile, design...
With the explosive growth of web videos on Internet, it becomes challenging to efficiently browse hundreds or even thousands videos. When searching an event query, users are often bewildered by vast quantity returned search engines. Exploring such results will be time consuming and also degrade user experience. In this paper, we present approach for driven video summarization tag localization key-shot mining. We first localize tags that associated with each into its shots. Then, estimate...
Human actions in videos are three-dimensional (3D) signals. Recent attempts use 3D convolutional neural networks (CNNs) to explore spatio-temporal information for human action recognition. Though promising, CNNs have not achieved high performance on this task with respect their well-established two-dimensional (2D) counterparts visual recognition still images. We argue that the training complexity of fusion and huge memory cost convolution hinder current CNNs, which stack convolutions layer...
Visual grounding, a task to ground (i.e., localize) natural language in images, essentially requires composite visual reasoning. However, existing methods over-simplify the nature of into monolithic sentence embedding or coarse composition subject-predicate-object triplet. In this paper, we propose an intuitive, explainable, and fashion as it should be. particular, develop novel modular network called Neural Module Tree (NMTree) that regularizes grounding along dependency parsing tree...
Recently, the phenomenal advent of photo-sharing services, such as Flickr and Panoramio, have led to volumous community-contributed photos with text tags, timestamps, geographic references on Internet. The photos, together their time- geo-references, become digital footprints photo takers implicitly document spatiotemporal movements. This study aims leverage wealth these enriched online analyze people’s travel patterns at local level a tour destination. Specifically, we focus our analysis...
Scene text detection has witnessed rapid development in recent years. However, there still exists two main challenges: 1) many methods suffer from false positives their representations; 2) the large scale variance of scene texts makes it hard for network to learn samples. In this paper, we propose ContourNet, which effectively handles these problems taking a further step toward accurate arbitrary-shaped detection. At first, scale-insensitive Adaptive Region Proposal Network (Adaptive-RPN) is...
Vehicle Re-Identification is to find images of the same vehicle from various views in cross-camera scenario. The main challenges this task are large intra-instance distance caused by different and subtle inter-instance discrepancy similar vehicles. In paper, we propose a parsing-based view-aware embedding network (PVEN) achieve feature alignment enhancement for ReID. First, introduce parsing parse into four then align features mask average pooling. Such provides fine-grained representation...
Person re-identification aims at identifying a certain person across non-overlapping multi-camera networks. It is fundamental and challenging task in automated video surveillance. Most existing researches mainly rely on hand-crafted features, resulting unsatisfactory performance. In this paper, we propose multi-scale triplet convolutional neural network which captures visual appearance of various scales. We to optimize the parameters by comparative similarity loss massive sample triplets,...
Uyghur text localization in images with complex backgrounds is a challenging yet important task for many applications. Generally, characters consist of strokes uniform features, and they are distinct from color, intensity, texture. Based on these differences, we propose FASTroke keypoint extractor, which fast stroke-specific. Compared the commonly used MSER detector, produces less than twice amount components recognizes at least 10% more characters. While line usually have features such as...
Unsupervised Domain Adaptive (UDA) person re-identification (ReID) aims at adapting the model trained on a labeled source-domain dataset to target-domain without any further annotations. Most successful UDA-ReID approaches combine clustering-based pseudo-label prediction with representation learning and perform two steps in an alternating fashion. However, offline interaction between these may allow noisy pseudo labels substantially hinder capability of model. In this paper, we propose...
Existing methods for single image super-resolution (SR) are typically evaluated with synthetic degradation models such as bicubic or Gaussian downsampling. In this paper, we investigate SR from the perspective of camera lenses, named CameraSR, which aims to alleviate intrinsic tradeoff between resolution (R) and field-of-view (V) in realistic imaging systems. Specifically, view R-V a latent model process learn reverse it low- high-resolution pairs. To obtain paired images, propose two novel...
Delicate attention of the discriminative regions plays a critical role in Fine-Grained Visual Categorization (FGVC). Unfortunately, most existing models perform poorly FGVC, due to pivotal limitations proposing and region-based feature learning. 1) The are predominantly located based on filter responses over images, which can not be directly optimized with performance metric. 2) Existing methods train extractor as one-hot classification task individually, while neglecting knowledge from...
Generalized zero-shot learning aims to recognize images from seen and unseen domains. Recent methods focus on a unified semantic-aligned visual representation transfer knowledge between two domains, while ignoring the effect of semantic-free in alleviating biased recognition problem. In this paper, we propose novel Domain-aware Visual Bias Eliminating (DVBE) network that constructs complementary representations, i.e., semantic-aligned, treat domains separately. Specifically, explore...
Many unsupervised domain adaptive (UDA) person ReID approaches combine clustering-based pseudo-label prediction with feature fine-tuning. However, because of gap, the pseudo-labels are not always reliable and there noisy/incorrect labels. This would mislead representation learning deteriorate performance. In this paper, we propose to estimate exploit credibility assigned each sample alleviate influence noisy labels, by suppressing contribution samples. We build our baseline framework using...
Existing deep learning based de-raining approaches have resorted to the convolutional architectures. However, intrinsic limitations of convolution, including local receptive fields and independence input content, hinder model's ability capture long-range complicated rainy artifacts. To overcome these limitations, we propose an effective efficient transformer-based architecture for image de-raining. First, introduce general priors vision tasks, i.e., locality hierarchy, into network so that...
Few-shot class-incremental learning is to recognize the new classes given few samples and not forget old classes. It a challenging task since representation optimization prototype reorganization can only be achieved under little supervision. To address this problem, we propose novel incremental scheme. Our scheme consists of random episode selection strategy that adapts feature various generated episodes enhance corresponding extensibility, self-promoted refinement mechanism which...
Deep convolutional neural networks (CNNs) have become dominant in the single image de-raining area. However, most deep CNNs-based methods are designed by stacking vanilla layers, which can only be used to model local relations. Therefore, long-range contextual information is rarely considered for this specific task. To address above problem, we propose a simple yet effective dual graph network (GCN) rain removal. Specifically, design two graphs perform global relational modeling and...
Deep learning-based dense object detectors have achieved great success in the past few years and been applied to numerous multimedia applications such as video understanding. However, current training pipeline for is compromised lots of conjunctions that may not hold. In this paper, we investigate three important conjunctions: 1) only samples assigned positive classification head are used train regression head; 2) share same input feature computational fields defined by parallel...
Non-exemplar class-incremental learning is to recognize both the old and new classes when class samples cannot be saved. It a challenging task since representation optimization feature retention can only achieved under supervision from classes. To address this problem, we propose novel self-sustaining expansion scheme. Our scheme consists of structure reorganization strategy that fuses main-branch side-branch updating maintain features, distillation transfer invariant knowledge. Furthermore,...
RGB-infrared person re-identification is an emerging cross-modality task, which very challenging due to significant modality discrepancy between RGB and infrared images. In this work, we propose a novel modality-adaptive mixup invariant decomposition (MID) approach for towards learning modality-invariant discriminative representations. MID designs scheme generate suitable mixed images mitigating the inherent at pixel-level. It formulates procedure as Markov decision process, where...
Change captioning aims to describe the semantic change between two similar images. In this process, as most typical distractor, viewpoint leads pseudo changes about appearance and position of objects, thereby overwhelming real change. Besides, since visual signal appears in a local region with weak feature, it is difficult for model directly translate learned features into sentence. paper, we propose syntax-calibrated multi-aspect relation transformer learn effective under different scenes,...