- Advanced Image and Video Retrieval Techniques
- Advanced Neural Network Applications
- Domain Adaptation and Few-Shot Learning
- Multimodal Machine Learning Applications
- Video Analysis and Summarization
- Video Surveillance and Tracking Methods
- Human Pose and Action Recognition
- Advanced Vision and Imaging
- Visual Attention and Saliency Detection
- Generative Adversarial Networks and Image Synthesis
- Robotics and Sensor-Based Localization
- Image Enhancement Techniques
- Remote-Sensing Image Classification
- Advanced Image Processing Techniques
- Brain Tumor Detection and Classification
- Image Retrieval and Classification Techniques
- Image and Object Detection Techniques
- Industrial Vision Systems and Defect Detection
- Robot Manipulation and Learning
- Face recognition and analysis
- Soft Robotics and Applications
- Advanced Measurement and Detection Methods
- Machine Learning and Data Classification
- Music and Audio Processing
- Optical Network Technologies
Harbin Institute of Technology
2004-2024
Sun Yat-sen University
2021-2024
China Electronics Technology Group Corporation
2024
Chongqing Three Gorges University
2024
Central South University
2024
First Affiliated Hospital of Zhengzhou University
2023
Guangxi University
2023
Huazhong Agricultural University
2023
Chinese Academy of Tropical Agricultural Sciences
2023
South Subtropical Crops Research Institute
2023
This paper aims to highlight vision related tasks centered around "car", which has been largely neglected by community in comparison other objects. We show that there are still many interesting car-related problems and applications, not yet well explored researched. To facilitate future research, this we present our on-going effort collecting a large-scale dataset, "CompCars", covers only different car views, but also their internal external parts, rich attributes. Importantly, the dataset...
In this paper we present a new computer vision task, named video instance segmentation. The goal of task is simultaneous detection, segmentation and tracking instances in videos. words, it the first time that image problem extended to domain. To facilitate research on propose large-scale benchmark called YouTube-VIS, which consists 2,883 high-resolution YouTube videos, 40-category label set 131k high-quality masks. addition, novel algorithm MaskTrack R-CNN for task. Our method introduces...
Video object segmentation targets segmenting a specific throughout video sequence when given only an annotated first frame. Recent deep learning based approaches find it effective to fine-tune general-purpose model on the frame using hundreds of iterations gradient descent. Despite high accuracy that these methods achieve, fine-tuning process is inefficient and fails meet requirements real world applications. We propose novel approach uses single forward pass adapt appearance object....
Learning long-term spatial-temporal features are critical for many video analysis tasks. However, existing segmentation methods predominantly rely on static image techniques, and capturing temporal dependency have to depend pretrained optical flow models, leading suboptimal solutions the problem. End-to-end sequential learning explore spatialtemporal is largely limited by scale of available datasets, i.e., even largest dataset only contains 90 short clips. To solve this problem, we build a...
Vision transformers (ViTs) have been successfully applied in image classification tasks recently. In this paper, we show that, unlike convolution neural networks (CNNs)that can be improved by stacking more convolutional layers, the performance of ViTs saturate fast when scaled to deeper. More specifically, empirically observe that such scaling difficulty is caused attention collapse issue: as transformer goes deeper, maps gradually become similar and even much same after certain layers....
We introduce a robust, real-time, high-resolution human video matting method that achieves new state-of-the-art performance. Our is much lighter than previous approaches and can process 4K at 76 FPS HD 104 on an Nvidia GTX 1080Ti GPU. Unlike most existing methods perform frame-by-frame as independent images, our uses recurrent architecture to exploit temporal information in videos significant improvements coherence quality. Furthermore, we propose novel training strategy enforces network...
Visual Place Recognition (VPR) estimates the location of query images by matching them with in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification reranking. However, RANSAC only employs information but ignores other possible that could be useful reranking, e.g. local feature correlations, attention values. In this paper, we propose unified place recognition framework handles both reranking novel...
Dense captioning is a newly emerging computer vision topic for understanding images with dense language descriptions. The goal to densely detect visual concepts (e.g., objects, object parts, and interactions between them) from images, labeling each short descriptive phrase. We identify two key challenges of that need be properly addressed when tackling the problem. First, concept annotations in image are associated highly overlapping target regions, making accurate localization challenging....
We present a simple and general method to train single neural network executable at different widths (number of channels in layer), permitting instant adaptive accuracy-efficiency trade-offs runtime. Instead training individual networks with width configurations, we shared switchable batch normalization. At runtime, the can adjust its on fly according on-device benchmarks resource constraints, rather than downloading offloading models. Our trained networks, named slimmable achieve similar...
Dense video captioning is an extremely challenging task since accurate and coherent description of events in a requires holistic understanding contents as well contextual reasoning individual events. Most existing approaches handle this problem by first detecting event proposals from then on subset the proposals. As result, generated sentences are prone to be redundant or inconsistent they fail consider temporal dependency between To tackle challenge, we propose novel dense framework, which...
Deep neural networks with adaptive configurations have gained increasing attention due to the instant and flexible deployment of these models on platforms different resource budgets. In this paper, we investigate a novel option achieve goal by enabling bit-widths weights activations in model. We first examine benefits challenges training quantized model bit-widths, then experiment several approaches including direct adaptation, progressive joint training. discover that is able produce...
Model quantization helps to reduce model size and latency of deep neural networks. Mixed precision is favorable with customized hardwares supporting arithmetic operations at multiple bit-widths achieve maximum efficiency. We propose a novel learning-based algorithm derive mixed models end-to-end under target computation constraints sizes. During the optimization, bit-width each layer / kernel in fractional status two consecutive which can be adjusted gradually. With differentiable...
Search space design is very critical to neural architecture search (NAS) algorithms. We propose a fine-grained comprised of atomic blocks, minimal unit that much smaller than the ones used in recent NAS This allows mix operations by composing different types while previous methods only homogeneous operations. Based on this space, we resource-aware framework which automatically assigns computational resources (e.g., output channel numbers) for each operation jointly considering performance...
High-resolution representations (HR) are essential for dense prediction tasks such as segmentation, detection, and pose estimation. Learning HR is typically ignored in previous Neural Architecture Search (NAS) methods that focus on image classification. This work proposes a novel NAS method, called HR-NAS, which able to find efficient accurate networks different tasks, by effectively encoding multiscale contextual information while maintaining high-resolution representations. In we renovate...
Non-Local (NL) blocks have been widely studied in various vision tasks. However, it has rarely explored to embed the NL mobile neural networks, mainly due following challenges: 1) generally heavy computation cost which makes difficult be applied applications where computational resources are limited, and 2) is an open problem discover optimal configuration into networks. We propose AutoNL overcome above two obstacles. Firstly, we a Lightweight (LightNL) block by squeezing transformation...
Video inpainting aims to fill spatiotemporal "corrupted" regions with plausible content. To achieve this goal, it is necessary find correspondences from neighbouring frames faithfully hallucinate the unknown con-tent. Current methods goal through attention, flow-based warping, or 3D temporal convolution. However, warping can create artifacts when optical flow not accurate, while convolution may suffer spatial misalignment. We propose ‘Progressive Temporal Feature Alignment Network’, which...
Video instance segmentation is a complex task in which we need to detect, segment, and track each object for any given video. Previous approaches only utilize single-frame features the detection, segmentation, tracking of objects they suffer video scenario due several distinct challenges such as motion blur drastic appearance change. To eliminate ambiguities introduced by using features, propose novel comprehensive feature aggregation approach (CompFeat) refine atboth frame-level...
This paper introduces the COCONut-PanCap dataset, created to enhance panoptic segmentation and grounded image captioning. Building upon COCO dataset with advanced COCONut masks, this aims overcome limitations in existing image-text datasets that often lack detailed, scene-comprehensive descriptions. The incorporates fine-grained, region-level captions ensuring consistency improving detail of generated captions. Through human-edited, densely annotated descriptions, supports improved training...
Quantization reduces computation costs of neural networks but suffers from performance degeneration. Is this accuracy drop due to the reduced capacity, or inefficient training during quantization procedure? After looking into gradient propagation process by viewing weights and intermediate activations as random variables, we discover two critical rules for efficient training. Recent approaches violates results in degenerated convergence. To deal with problem, propose a simple yet effective...