Jing Liu

ORCID: 0000-0003-0903-9131
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Image and Video Retrieval Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Neural Network Applications
  • Image Retrieval and Classification Techniques
  • Video Surveillance and Tracking Methods
  • Human Pose and Action Recognition
  • Video Analysis and Summarization
  • Visual Attention and Saliency Detection
  • Remote-Sensing Image Classification
  • Topic Modeling
  • Advanced Vision and Imaging
  • Face and Expression Recognition
  • Natural Language Processing Techniques
  • Generative Adversarial Networks and Image Synthesis
  • Medical Image Segmentation Techniques
  • Anomaly Detection Techniques and Applications
  • Image Processing Techniques and Applications
  • Robotics and Sensor-Based Localization
  • Image Enhancement Techniques
  • Music and Audio Processing
  • Gait Recognition and Analysis
  • Image and Object Detection Techniques
  • Text and Document Classification Technologies
  • Infrared Target Detection Methodologies

Chinese Academy of Sciences
2016-2025

Shandong Institute of Automation
2013-2025

Institute of Microelectronics
2025

Institute of Automation
2014-2024

Mitsubishi Electric (United States)
2024

University of Chinese Academy of Sciences
2018-2024

Beijing Academy of Artificial Intelligence
2020-2024

Shandong University of Traditional Chinese Medicine
2017-2024

Jinling Institute of Technology
2024

China University of Mining and Technology
2024

In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on self-attention mechanism. Unlike previous works that capture contexts multi-scale features fusion, propose a Dual Attention Networks (DANet) to adaptively integrate local with their global dependencies. Specifically, append two types of attention modules top traditional dilated FCN, which model semantic interdependencies in spatial and channel dimensions respectively. The position module...

10.1109/cvpr.2019.00326 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

In this paper, a new unsupervised learning algorithm, namely Nonnegative Discriminative Feature Selection (NDFS), is proposed. To exploit the discriminative information in scenarios, we perform spectral clustering to learn cluster labels of input samples, during which feature selection performed simultaneously. The joint and matrix enables NDFS select most features. more accurate labels, nonnegative constraint explicitly imposed class indicators. reduce redundant or even noisy features,...

10.1609/aaai.v26i1.8289 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2021-09-20

In this article, we propose a Dual Relation-aware Attention Network (DRANet) to handle the task of scene segmentation. How efficiently exploit context is essential for pixel-level recognition. To address issue, adaptively capture contextual information based on relation-aware attention mechanism. Especially, append two types modules top dilated fully convolutional network (FCN), which model dependencies in spatial and channel dimensions, respectively. modules, adopt self-attention mechanism...

10.1109/tnnls.2020.3006524 article EN IEEE Transactions on Neural Networks and Learning Systems 2020-08-03

Self-attention (SA) network has shown profound value in image captioning. In this paper, we improve SA from two aspects to promote the performance of First, propose Normalized Self-Attention (NSA), a reparameterization that brings benefits normalization inside SA. While is previously only applied outside SA, introduce novel method and demonstrate it both possible beneficial perform on hidden activations Second, compensate for major limit Transformer fails model geometry structure input...

10.1109/cvpr42600.2020.01034 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Recent progress in semantic segmentation has been driven by improving the spatial resolution under Fully Convolutional Networks (FCNs). To address this problem, we propose a Stacked Deconvolutional Network (SDN) for segmentation. In SDN, multiple shallow deconvolutional networks, which are called as SDN units, stacked one to integrate contextual information and bring fine recovery of localization information. Meanwhile, inter-unit intra-unit connections designed assist network training...

10.1109/tip.2019.2895460 article EN IEEE Transactions on Image Processing 2019-01-25

Recent works attempt to improve scene parsing performance by exploring different levels of contexts, and typically train a well-designed convolutional network exploit useful contexts across all pixels equally. However, in this paper, we find that the context demands are varying from or regions each image. Based on observation, propose an Adaptive Context Network (ACNet) capture pixel-aware competitive fusion global local according per-pixel demands. Specifically, when given pixel, demand is...

10.1109/iccv.2019.00685 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

In person re-identification (re-ID), extracting part-level features from images has been verified to be crucial offer fine-grained information. Most of the existing CNN-based methods only locate human parts coarsely, or rely on pretrained parsing models and fail in locating identifiable nonhuman (e.g., knapsack). this article, we introduce an alignment scheme transformer architecture for first time propose auto-aligned (AAformer) automatically both ones at patch level. We "Part tokens...

10.1109/tnnls.2023.3301856 article EN IEEE Transactions on Neural Networks and Learning Systems 2023-08-25

10.1016/j.patcog.2008.04.012 article EN Pattern Recognition 2008-05-04

In this paper, we propose an adversarial learning network for the task of multi-style image captioning (MSCap) with a standard factual caption dataset and multi-stylized language corpus without paired images. How to learn single model unpaired data is challenging necessary task, whereas rarely studied in previous works. The proposed framework mainly includes four contributive modules following typical encoder. First, style dependent generator output sentence conditioned on encoded specified...

10.1109/cvpr.2019.00433 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

Image captioning attempts to generate a sentence composed of several linguistic words, which are used describe objects, attributes, and interactions in an image, denoted as visual semantic units this paper. Based on view, we propose explicitly model the object semantics geometry based Graph Convolutional Networks (GCNs), fully exploit alignment between words for image captioning. Particularly, construct graph graph, where each node corresponds unit, i.e., object, attribute, or (geometrical)...

10.1145/3343031.3350943 preprint EN Proceedings of the 30th ACM International Conference on Multimedia 2019-10-15

In this paper, we consider the image captioning task from a new sequence-to-sequence prediction perspective and propose CaPtion TransformeR (CPTR) which takes sequentialized raw images as input to Transformer. Compared "CNN+Transformer" design paradigm, our model can global context at every encoder layer beginning is totally convolution-free. Extensive experiments demonstrate effectiveness of proposed surpass conventional methods on MSCOCO dataset. Besides, provide detailed visualizations...

10.48550/arxiv.2101.10804 preprint EN other-oa arXiv (Cornell University) 2021-01-01

Hashing has shown great potential in large-scale image retrieval due to its storage and computation efficiency, especially the recent deep supervised hashing methods. To achieve promising performance, methods require a large amount of training data from different classes. However, when images new categories emerge, existing have retrain CNN model generate hash codes for all database again, which is impractical system. In this paper, we propose novel framework, called Deep Incremental Network...

10.1109/cvpr.2019.00928 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019-06-01

In the intelligent traffic system, real-time and accurate detections of vehicles in images video data are very important challenging work. Especially situations with complex scenes, different models, high density, it is difficult to accurately locate classify these during flows. Therefore, we propose a single-stage deep neural network YOLOv3-DL, which based on Tensorflow framework improve this problem. The structure optimized by introducing idea spatial pyramid pooling, then loss function...

10.3390/app10093079 article EN cc-by Applied Sciences 2020-04-28

Image captioning is a challenging task. Meanwhile, it important for the machine to understand meaning of an image better. In recent years, usually use long-short-term-memory (LSTM) as decoder generate sentence, and these models show excellent performance. Although LSTM can memorize dependencies, structure has complicated inherently sequential across time problems. To address issues, works have shown benefits Transformer translation. Inspired by their success, we develop Captioning (CT) model...

10.3390/app8050739 article EN cc-by Applied Sciences 2018-05-07

In this paper, we address the scene segmentation task by capturing rich contextual dependencies based on selfattention mechanism. Unlike previous works that capture contexts multi-scale features fusion, propose a Dual Attention Networks (DANet) to adaptively integrate local with their global dependencies. Specifically, append two types of attention modules top traditional dilated FCN, which model semantic interdependencies in spatial and channel dimensions respectively. The position module...

10.48550/arxiv.1809.02983 preprint EN other-oa arXiv (Cornell University) 2018-01-01

The recently proposed Visual image Transformers (ViT) with pure attention have achieved promising performance on recognition tasks, such as classification. However, the routine of current ViT model is to maintain a full-length patch sequence during inference, which redundant and lacks hierarchical representation. To this end, we propose Hierarchical Transformer (HVT) progressively pools visual tokens shrink length hence reduces computational cost, analogous feature maps downsampling in...

10.1109/iccv48922.2021.00043 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Transformers have become one of the dominant architectures in deep learning, particularly as a powerful alternative to convolutional neural networks (CNNs) computer vision. However, Transformer training and inference previous works can be prohibitively expensive due quadratic complexity self-attention over long sequence representations, especially for high-resolution dense prediction tasks. To this end, we present novel Less attention vIsion (LIT), building upon fact that early layers still...

10.1609/aaai.v36i2.20099 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2022-06-28

Multi-agent collaborative perception as a potential application for vehicle-to-everything communication could significantly improve the performance of autonomous vehicles over single-agent perception. However, several challenges remain in achieving pragmatic information sharing this emerging research. In paper, we propose SCOPE, novel frame-work that aggregates spatio-temporal awareness characteristics across on-road agents an end-to-end manner. Specifically, SCOPE has three distinct...

10.1109/iccv51070.2023.02137 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model (VALOR) for multimodal understanding and generation. Unlike widely-studied vision-language models, VALOR jointly models relationships among vision, audio, language in an end-to-end manner. It consists of three separate encoders single modality representations a decoder conditional text We design two pretext tasks to pretrain model: Multimodal Grouping Alignment (MGA) Captioning (MGC). MGA projects language,...

10.1109/tpami.2024.3479776 article EN IEEE Transactions on Pattern Analysis and Machine Intelligence 2024-01-01

Image annotation has been an active research topic in recent years due to its potential impact on both image understanding and web retrieval. Existing relevance-model-based methods perform by maximizing the joint probability of images words, which is calculated expectation over training images. However, semantic gap dependence data restrict their performance scalability. In this paper, a dual cross-media relevance model (DCMRM) proposed for automatic annotation, estimates words pre-defined...

10.1145/1291233.1291380 article EN Proceedings of the 30th ACM International Conference on Multimedia 2007-09-29

This paper tries to separate fine-grained images by jointly learning the encoding parameters and codebooks through low-rank sparse coding (LRSC) with general class-specific codebook generation. Instead of treating each local feature independently, we encode features within a spatial region LRSC. ensures that spatially nearby similar visual characters are encoded correlated parameters. In this way, can make more consistent for image representation. Besides, also learn number in combination...

10.1109/tnnls.2016.2545112 article EN IEEE Transactions on Neural Networks and Learning Systems 2016-04-07
Coming Soon ...