- Advanced Image and Video Retrieval Techniques
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Neural Network Applications
- Image Retrieval and Classification Techniques
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Video Analysis and Summarization
- Visual Attention and Saliency Detection
- Remote-Sensing Image Classification
- Topic Modeling
- Advanced Vision and Imaging
- Face and Expression Recognition
- Natural Language Processing Techniques
- Generative Adversarial Networks and Image Synthesis
- Medical Image Segmentation Techniques
- Anomaly Detection Techniques and Applications
- Image Processing Techniques and Applications
- Robotics and Sensor-Based Localization
- Image Enhancement Techniques
- Music and Audio Processing
- Gait Recognition and Analysis
- Image and Object Detection Techniques
- Text and Document Classification Technologies
- Infrared Target Detection Methodologies
Chinese Academy of Sciences
2016-2025
Shandong Institute of Automation
2013-2025
Institute of Microelectronics
2025
Institute of Automation
2014-2024
Mitsubishi Electric (United States)
2024
University of Chinese Academy of Sciences
2018-2024
Beijing Academy of Artificial Intelligence
2020-2024
Shandong University of Traditional Chinese Medicine
2017-2024
Jinling Institute of Technology
2024
China University of Mining and Technology
2024
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on self-attention mechanism. Unlike previous works that capture contexts multi-scale features fusion, propose a Dual Attention Networks (DANet) to adaptively integrate local with their global dependencies. Specifically, append two types of attention modules top traditional dilated FCN, which model semantic interdependencies in spatial and channel dimensions respectively. The position module...
In this paper, a new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), is proposed. To exploit the discriminative information in scenarios, we perform spectral clustering to learn cluster labels of input samples, during which feature selection performed simultaneously. The joint and matrix enables NDFS select most features. more accurate labels, nonnegative constraint explicitly imposed class indicators. reduce redundant or even noisy features,...
In this article, we propose a Dual Relation-aware Attention Network (DRANet) to handle the task of scene segmentation. How efficiently exploit context is essential for pixel-level recognition. To address issue, adaptively capture contextual information based on relation-aware attention mechanism. Especially, append two types modules top dilated fully convolutional network (FCN), which model dependencies in spatial and channel dimensions, respectively. modules, adopt self-attention mechanism...
Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of First, propose Normalized Self-Attention (NSA), a reparameterization that brings benefits normalization inside SA. While is previously only applied outside SA, introduce novel method and demonstrate it both possible beneficial perform on hidden activations Second, compensate for major limit Transformer fails model geometry structure input...
Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, stacked one to integrate contextual information and bring fine recovery of localization information. Meanwhile, inter-unit intra-unit connections designed assist network training...
Recent works attempt to improve scene parsing performance by exploring different levels of contexts, and typically train a well-designed convolutional network exploit useful contexts across all pixels equally. However, in this paper, we find that the context demands are varying from or regions each image. Based on observation, propose an Adaptive Context Network (ACNet) capture pixel-aware competitive fusion global local according per-pixel demands. Specifically, when given pixel, demand is...
In person re-identification (re-ID), extracting part-level features from images has been verified to be crucial offer fine-grained information. Most of the existing CNN-based methods only locate human parts coarsely, or rely on pretrained parsing models and fail in locating identifiable nonhuman (e.g., knapsack). this article, we introduce an alignment scheme transformer architecture for first time propose auto-aligned (AAformer) automatically both ones at patch level. We "Part tokens...
In this paper, we propose an adversarial learning network for the task of multi-style image captioning (MSCap) with a standard factual caption dataset and multi-stylized language corpus without paired images. How to learn single model unpaired data is challenging necessary task, whereas rarely studied in previous works. The proposed framework mainly includes four contributive modules following typical encoder. First, style dependent generator output sentence conditioned on encoded specified...
Image captioning attempts to generate a sentence composed of several linguistic words, which are used describe objects, attributes, and interactions in an image, denoted as visual semantic units this paper. Based on view, we propose explicitly model the object semantics geometry based Graph Convolutional Networks (GCNs), fully exploit alignment between words for image captioning. Particularly, construct graph graph, where each node corresponds unit, i.e., object, attribute, or (geometrical)...
In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes sequentialized raw images as input to Transformer. Compared "CNN+Transformer" design paradigm, our model can global context at every encoder layer beginning is totally convolution-free. Extensive experiments demonstrate effectiveness of proposed surpass conventional methods on MSCOCO dataset. Besides, provide detailed visualizations...
Hashing has shown great potential in large-scale image retrieval due to its storage and computation efficiency, especially the recent deep supervised hashing methods. To achieve promising performance, methods require a large amount of training data from different classes. However, when images new categories emerge, existing have retrain CNN model generate hash codes for all database again, which is impractical system. In this paper, we propose novel framework, called Deep Incremental Network...
In the intelligent traffic system, real-time and accurate detections of vehicles in images video data are very important challenging work. Especially situations with complex scenes, different models, high density, it is difficult to accurately locate classify these during flows. Therefore, we propose a single-stage deep neural network YOLOv3-DL, which based on Tensorflow framework improve this problem. The structure optimized by introducing idea spatial pyramid pooling, then loss function...
Image captioning is a challenging task. Meanwhile, it important for the machine to understand meaning of an image better. In recent years, usually use long-short-term-memory (LSTM) as decoder generate sentence, and these models show excellent performance. Although LSTM can memorize dependencies, structure has complicated inherently sequential across time problems. To address issues, works have shown benefits Transformer translation. Inspired by their success, we develop Captioning (CT) model...
In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on selfattention mechanism. Unlike previous works that capture contexts multi-scale features fusion, propose a Dual Attention Networks (DANet) to adaptively integrate local with their global dependencies. Specifically, append two types of attention modules top traditional dilated FCN, which model semantic interdependencies in spatial and channel dimensions respectively. The position module...
The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on recognition tasks, such as classification. However, the routine of current ViT model is to maintain a full-length patch sequence during inference, which redundant and lacks hierarchical representation. To this end, we propose Hierarchical Transformer (HVT) progressively pools visual tokens shrink length hence reduces computational cost, analogous feature maps downsampling in...
Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) computer vision. However, Transformer training and inference previous works can be prohibitively expensive due quadratic complexity self-attention over long sequence representations, especially for high-resolution dense prediction tasks. To this end, we present novel Less attention vIsion (LIT), building upon fact that early layers still...
Multi-agent collaborative perception as a potential application for vehicle-to-everything communication could significantly improve the performance of autonomous vehicles over single-agent perception. However, several challenges remain in achieving pragmatic information sharing this emerging research. In paper, we propose SCOPE, novel frame-work that aggregates spatio-temporal awareness characteristics across on-road agents an end-to-end manner. Specifically, SCOPE has three distinct...
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language models, VALOR jointly models relationships among vision, audio, language in an end-to-end manner. It consists of three separate encoders single modality representations a decoder conditional text We design two pretext tasks to pretrain model: Multimodal Grouping Alignment (MGA) Captioning (MGC). MGA projects language,...
Image annotation has been an active research topic in recent years due to its potential impact on both image understanding and web retrieval. Existing relevance-model-based methods perform by maximizing the joint probability of images words, which is calculated expectation over training images. However, semantic gap dependence data restrict their performance scalability. In this paper, a dual cross-media relevance model (DCMRM) proposed for automatic annotation, estimates words pre-defined...
This paper tries to separate fine-grained images by jointly learning the encoding parameters and codebooks through low-rank sparse coding (LRSC) with general class-specific codebook generation. Instead of treating each local feature independently, we encode features within a spatial region LRSC. ensures that spatially nearby similar visual characters are encoded correlated parameters. In this way, can make more consistent for image representation. Besides, also learn number in combination...