- Advanced Neural Network Applications
- Multimodal Machine Learning Applications
- Domain Adaptation and Few-Shot Learning
- Advanced Image and Video Retrieval Techniques
- Topic Modeling
- Advanced Vision and Imaging
- Natural Language Processing Techniques
- Human Pose and Action Recognition
- Robotics and Sensor-Based Localization
- Image Retrieval and Classification Techniques
- Advanced Image Processing Techniques
- Visual Attention and Saliency Detection
- Generative Adversarial Networks and Image Synthesis
- Image Enhancement Techniques
- Machine Learning and Data Classification
- CCD and CMOS Imaging Sensors
- COVID-19 diagnosis using AI
- Anomaly Detection Techniques and Applications
- Video Analysis and Summarization
- Cardiovascular Health and Disease Prevention
- Video Surveillance and Tracking Methods
- Adversarial Robustness in Machine Learning
- Reinforcement Learning in Robotics
- Evaluation Methods in Various Fields
- Biometric Identification and Security
Shanghai Artificial Intelligence Laboratory
2022-2024
Tsinghua University
2010-2024
Kunming University of Science and Technology
2022-2024
Beijing Academy of Artificial Intelligence
2022-2024
ShangHai JiAi Genetics & IVF Institute
2023
InternetLab
2023
Group Sense (China)
2020-2022
Shanghai Jiao Tong University
2022
Sensetime (China)
2020-2021
Chinese University of Hong Kong
2020
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due the fixed structures in their building modules. In this work, we introduce two new modules enhance transformation modeling capability of CNNs, namely, deformable convolution and RoI pooling. Both based on idea augmenting spatial sampling locations with additional offsets learning from target tasks, without supervision. The can readily replace plain counterparts existing CNNs be easily trained...
We present region-based, fully convolutional networks for accurate and efficient object detection. In contrast to previous region-based detectors such as Fast/Faster R-CNN that apply a costly per-region subnetwork hundreds of times, our detector is with almost all computation shared on the entire image. To achieve this goal, we propose position-sensitive score maps address dilemma between translation-invariance in image classification translation-variance Our method can thus naturally adopt...
The superior performance of Deformable Convolutional Networks arises from its ability to adapt the geometric variations objects. Through an examination adaptive behavior, we observe that while spatial support for neural features conforms more closely than regular ConvNets object structure, this may nevertheless extend well beyond region interest, causing be influenced by irrelevant image content. To address problem, present a reformulation improves focus on pertinent regions, through...
We present MMDetection, an object detection toolbox that contains a rich set of and instance segmentation methods as well related components modules. The started from codebase MMDet team who won the track COCO Challenge 2018. It gradually evolves into unified platform covers many popular contemporary not only includes training inference codes, but also provides weights for more than 200 network models. believe this is by far most complete toolbox. In paper, we introduce various features...
Although it is well believed for years that modeling relations between objects would help object recognition, there has not been evidence the idea working in deep learning era. All state-of-the-art detection systems still rely on recognizing instances individually, without exploiting their during learning. This work proposes an relation module. It processes a set of simultaneously through interaction appearance feature and geometry, thus allowing relations. lightweight in-place. does require...
Semantic segmentation research has recently witnessed rapid progress, but many leading methods are unable to identify object instances. In this paper, we present Multitask Network Cascades for instance-aware semantic segmentation. Our model consists of three networks, respectively differentiating instances, estimating masks, and categorizing objects. These networks form a cascaded structure, designed share their convolutional features. We develop an algorithm the nontrivial end-to-end...
We present the first fully convolutional end-to-end solution for instance-aware semantic segmentation task. It inherits all merits of FCNs [29] and instance mask proposal [5]. performs prediction classification jointly. The underlying representation is shared between two sub-tasks, as well regions interest. network architecture highly integrated efficient. achieves state-of-the-art performance in both accuracy efficiency. wins COCO 2016 competition by a large margin. Code would be released...
Large-scale data is of crucial importance for learning semantic segmentation models, but annotating per-pixel masks a tedious and inefficient procedure. We note that the topic interactive image segmentation, scribbles are very widely used in academic research commercial software, recognized as one most userfriendly ways interacting. In this paper, we propose to use annotate images, develop an algorithm train convolutional networks supervised by scribbles. Our based on graphical model jointly...
Recent leading approaches to semantic segmentation rely on deep convolutional networks trained with human-annotated, pixel-level masks. Such pixel-accurate supervision demands expensive labeling effort and limits the performance of that usually benefit from more training data. In this paper, we propose a method achieves competitive accuracy but only requires easily obtained bounding box annotations. The basic idea is iterate between automatically generating region proposals networks. These...
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due limitation of Transformer attention modules processing image maps. To mitigate these issues, we Deformable DETR, whose only attend a small set key sampling points around reference. can achieve better performance than (especially on objects) with 10 times less...
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT short). VL-BERT adopts the simple yet powerful Transformer model as backbone, and extends it to take both visual linguistic embedded features input. In it, each element of input is either word from sentence, or region-of-interest (RoI) image. It designed fit most downstream tasks. To better exploit representation, we pre-train on massive-scale Conceptual Captions...
Deep convolutional neutral networks have achieved great success on image recognition tasks. Yet, it is non-trivial to transfer the state-of-the-art videos as per-frame evaluation too slow and unaffordable. We present deep feature flow, a fast accurate framework for video recognition. It runs expensive sub-network only sparse key frames propagates their maps other via flow field. achieves significant speedup computation relatively fast. The end-to-end training of whole architecture...
Extending state-of-the-art object detectors from image to video is challenging. The accuracy of detection suffers degenerated appearances in videos, e.g., motion blur, defocus, rare poses, etc. Existing work attempts exploit temporal information on box level, but such methods are not trained end-to-end. We present flow-guided feature aggregation, an accurate and end-to-end learning framework for detection. It leverages coherence level instead. improves the per-frame features by aggregation...
The topic of semantic segmentation has witnessed considerable progress due to the powerful features learned by convolutional neural networks (CNNs) [13]. current leading approaches for exploit shape information extracting CNN from masked image regions. This strategy introduces artificial boundaries on images and may impact quality extracted features. Besides, operations raw domain require compute thousands a single image, which is time-consuming. In this paper, we propose via masking...
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, models based on convolutional neural networks (CNNs) are still an early state. This work presents a new CNN-based foundation model, termed InternImage, which can obtain gain from increasing parameters and training data like ViTs. Different CNNs that focus large dense kernels, InternImage takes deformable convolution as core operator, so our model not only has effective receptive field required for...
Current semantic segmentation methods focus only on mining "local" context, i.e., dependencies between pixels within individual images, by context-aggregation modules (e.g., dilated convolution, neural attention) or structure-aware optimization criteria IoU-like loss). However, they ignore "global" context of the training data, rich relations across different images. Inspired recent advance in unsupervised contrastive representation learning, we propose a pixel-wise algorithm for fully...
Convolutional neural networks (CNNs) are inherently limited to model geometric transformations due the fixed structures in its building modules. In this work, we introduce two new modules enhance transformation modeling capacity of CNNs, namely, deformable convolution and RoI pooling. Both based on idea augmenting spatial sampling locations with additional offsets learning from target tasks, without supervision. The can readily replace their plain counterparts existing CNNs be easily trained...
Attention mechanisms have become a popular component in deep neural networks, yet there has been little examination of how different influencing factors and methods for computing attention from these affect performance. Toward better general understanding mechanisms, we present an empirical study that ablates various spatial elements within generalized formulation, encompassing the dominant Transformer as well prevalent deformable convolution dynamic modules. Conducted on variety...
There has been significant progresses for image object detection in recent years. Nevertheless, video received little attention, although it is more challenging and important practical scenarios. Built upon the works [37, 36], this work proposes a unified approach based on principle of multi-frame end-to-end learning features cross-frame motion. Our extends prior with three new techniques steadily pushes forward performance envelope (speed-accuracy tradeoff), towards high detection.