Shilong Liu

ORCID: 0009-0003-5796-0627
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Neural Network Applications
  • Advanced Image and Video Retrieval Techniques
  • Multimodal Machine Learning Applications
  • Domain Adaptation and Few-Shot Learning
  • Advanced Vision and Imaging
  • Natural Language Processing Techniques
  • Handwritten Text Recognition Techniques
  • Topic Modeling
  • Anomaly Detection Techniques and Applications
  • Visual Attention and Saliency Detection
  • Video Analysis and Summarization
  • Speech and dialogue systems
  • Rock Mechanics and Modeling
  • Adversarial Robustness in Machine Learning
  • Machine Learning and Data Classification
  • Generative Adversarial Networks and Image Synthesis
  • Human Pose and Action Recognition
  • Video Surveillance and Tracking Methods
  • Geomechanics and Mining Engineering
  • Face recognition and analysis
  • Multi-Agent Systems and Negotiation
  • Industrial Vision Systems and Defect Detection
  • Geoscience and Mining Technology
  • Network Packet Processing and Optimization
  • Robotics and Sensor-Based Localization

Soonchunhyang University
2024

Northwest Normal University
2024

Southwest University of Science and Technology
2024

Tsinghua University
2021-2023

Shanghai University
2023

Robert Bosch (United States)
2023

Beijing University of Posts and Telecommunications
2019

Tianjin University of Science and Technology
2017

Discovery Institute
2014

We present in this paper a novel denoising training method to speedup DETR (DEtection TRansformer) and offer deepened understanding of the slow convergence issue DETR-like methods. show that results from instability bipartite graph matching which causes inconsistent optimization goals early stages. To address issue, except for Hungarian loss, our additionally feeds ground-truth bounding boxes with noises into Transformer decoder trains model reconstruct original boxes, effectively reduces...

10.1109/cvpr52688.2022.01325 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

We present DINO (\textbf{D}ETR with \textbf{I}mproved de\textbf{N}oising anch\textbf{O}r boxes), a state-of-the-art end-to-end object detector. % in this paper. improves over previous DETR-like models performance and efficiency by using contrastive way for denoising training, mixed query selection method anchor initialization, look forward twice scheme box prediction. achieves $49.4$AP $12$ epochs $51.3$AP $24$ on COCO ResNet-50 backbone multi-scale features, yielding significant improvement...

10.48550/arxiv.2203.03605 preprint EN other-oa arXiv (Cornell University) 2022-01-01

In this paper we present Mask DINO, a unified object detection and segmentation framework. DINO extends (DETR with Improved Denoising Anchor Boxes) by adding mask prediction branch which supports all image tasks (instance, panoptic, semantic). It makes use of the query embeddings from to dot-product high-resolution pixel embedding map predict set binary masks. Some key components in are extended for through shared architecture training process. is simple, efficient, scalable, it can benefit...

10.1109/cvpr52729.2023.00297 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

We present in this paper a novel query formulation using dynamic anchor boxes for DETR (DEtection TRansformer) and offer deeper understanding of the role queries DETR. This new directly uses box coordinates as Transformer decoders dynamically updates them layer-by-layer. Using not only helps explicit positional priors to improve query-to-feature similarity eliminate slow training convergence issue DETR, but also allows us modulate attention map width height information. Such design makes it...

10.48550/arxiv.2201.12329 preprint EN other-oa arXiv (Cornell University) 2022-01-01

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects human inputs such as category names or referring expressions. The key solution of detection is introducing language to a closed-set for concept generalization. To effectively fuse and vision modalities, conceptually divide into three phases propose tight fusion solution, includes feature enhancer,...

10.48550/arxiv.2303.05499 preprint EN other-oa arXiv (Cornell University) 2023-01-01

Recent DEtection TRansformer-based (DETR) models have obtained remarkable performance. Its success cannot be achieved without the re-introduction of multi-scale feature fusion in encoder. However, excessively increased tokens features, especially for about 75% low-level are quite computationally inefficient, which hinders real applications DETR models. In this paper, we present Lite DETR, a simple yet efficient end-to-end object detection framework that can effectively reduce GFLOPs head by...

10.1109/cvpr52729.2023.01780 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

We present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that jointly learns from different segmentation detection datasets. To bridge the gap of vocabulary annotation granularity, we first introduce pre-trained text encoder to encode all visual concepts in two tasks learn common semantic space for them. This gives us reasonably good results compared with counterparts trained on task only. further reconcile them, identify discrepancies: i) discrepancy – requires...

10.1109/iccv51070.2023.00100 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that suffers from inconsistent mask predictions between consecutive decoder layers, leads to optimization goals and low utilization of queries. To address this problem, we propose training approach, additionally feeds noised ground-truth masks trains the model reconstruct original ones. Compared with predicted used mask-attention, serve as...

10.1109/cvpr52729.2023.01733 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables detection and segmentation of any regions based on arbitrary text inputs opens a door connecting various vision models. As shown in Fig.1, wide range tasks can be achieved by using versatile SAM pipeline. For example, automatic annotation pipeline solely input images realized incorporating models such BLIP Recognize Anything....

10.48550/arxiv.2401.14159 preprint EN other-oa arXiv (Cornell University) 2024-01-01

This paper presents a simple and effective approach to solving the multi-label classification problem. The proposed leverages Transformer decoders query existence of class label. use is rooted in need extracting local discriminative features adaptively for different labels, which strongly desired property due multiple objects one image. built-in cross-attention module decoder offers an way label embeddings as queries probe pool class-related from feature map computed by vision backbone...

10.48550/arxiv.2107.10834 preprint EN other-oa arXiv (Cornell University) 2021-01-01

In this paper, we introduce Semantic-SAM, a universal image segmentation model to enable segment and recognize anything at any desired granularity. Our offers two key advantages: semantic-awareness granularity-abundance. To achieve semantic-awareness, consolidate multiple datasets across three granularities decoupled classification for objects parts. This allows our capture rich semantic information. For the multi-granularity capability, propose multi-choice learning scheme during training,...

10.48550/arxiv.2307.04767 preprint EN other-oa arXiv (Cornell University) 2023-01-01

We present in this paper a novel denoising training method to speed up DETR (DEtection TRansformer) and offer deepened understanding of the slow convergence issue DETR-like methods. show that results from instability bipartite graph matching which causes inconsistent optimization goals early stages. To address issue, except for Hungarian loss, our additionally feeds GT bounding boxes with noises into Transformer decoder trains model reconstruct original boxes, effectively reduces difficulty...

10.1109/tpami.2023.3335410 article EN cc-by-nc-nd IEEE Transactions on Pattern Analysis and Machine Intelligence 2023-12-01

We present the Recognize Anything Model (RAM): a strong foundation model for image tagging. RAM makes substantial step large models in computer vision, demonstrating zero-shot ability to recognize any common category with high accuracy. introduces new paradigm tagging, leveraging large-scale image-text pairs training instead of manual annotations. The development comprises four key steps. Firstly, annotation-free tags are obtained at scale through automatic text semantic parsing....

10.48550/arxiv.2306.03514 preprint EN other-oa arXiv (Cornell University) 2023-01-01

This paper presents a comprehensive survey of vision-language (VL) intelligence from the perspective time. is inspired by remarkable progress in both computer vision and natural language processing, recent trends shifting single modality processing to multiple comprehension. We summarize development this field into three time periods, namely task-specific methods, pre-training (VLP) larger models empowered large-scale weakly-labeled data. first take some common VL tasks as examples introduce...

10.48550/arxiv.2203.01922 preprint EN other-oa arXiv (Cornell University) 2022-01-01

This paper is concerned with the matching stability problem across different decoder layers in DEtection TRansformers (DETR). We point out that unstable DETR caused by a multi-optimization path problem, which highlighted one-to-one design DETR. To address this we show most important to use and only positional metrics (like IOU) supervise classification scores of positive examples. Under principle, propose two simple yet effective modifications integrating DETR's loss cost, named...

10.1109/iccv51070.2023.00597 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into unified space object detection. Existing lifting approaches, such as Lift-Splat-based and attention-based, either use estimated depth to get pseudo LiDAR then splat them space, is one-pass operation without refinement, or ignore lift by attention mechanisms, achieve finer semantics while suffering from ambiguity problem. contrast,...

10.1109/iccv51070.2023.00615 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2023-10-01

We study the problem, of unsupervised discovery and segmentation object parts, which, as an intermediate local representation, are capable finding intrinsic structure providing more explainable recognition results. Recent methods have greatly relaxed dependency on annotated data which costly to obtain, but still rely additional information such mask or saliency map. To remove a further improve part performance, we develop novel approach by disentangling appearance shape representations parts...

10.1109/cvpr46437.2021.00825 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021-06-01

The DEtection TRansformer (DETR) algorithm has received considerable attention in the research community and is gradually emerging as a mainstream approach for object detection other perception tasks. However, current field lacks unified comprehensive benchmark specifically tailored DETR-based models. To address this issue, we develop unified, highly modular, lightweight codebase called detrex, which supports majority of instance recognition algorithms, covering various fundamental tasks,...

10.48550/arxiv.2306.07265 preprint EN other-oa arXiv (Cornell University) 2023-01-01

This paper presents a novel end-to-end framework with Explicit box Detection for multi-person Pose estimation, called ED-Pose, where it unifies the contextual learning between human-level (global) and keypoint-level (local) information. Different from previous one-stage methods, ED-Pose re-considers this task as two explicit detection processes unified representation regression supervision. First, we introduce human decoder encoded tokens to extract global features. It can provide good...

10.48550/arxiv.2302.01593 preprint EN other-oa arXiv (Cornell University) 2023-01-01

In this paper, we study the problem of visual grounding by considering both phrase extraction and (PEG). contrast to previous phrase-known-at-test setting, PEG requires a model extract phrases from text locate objects image simultaneously, which is more practical setting in real applications. As can be regarded as 1D segmentation problem, formulate dual detection propose novel DQ-DETR model, introduces queries probe different features for object prediction mask prediction. Each pair are...

10.1609/aaai.v37i2.25261 article EN Proceedings of the AAAI Conference on Artificial Intelligence 2023-06-26

This paper introduces Grounding DINO 1.5, a suite of advanced open-set object detection models developed by IDEA Research, which aims to advance the "Edge" detection. The encompasses two models: 1.5 Pro, high-performance model designed for stronger generalization capability across wide range scenarios, and Edge, an efficient optimized faster speed demanded in many applications requiring edge deployment. Pro advances its predecessor scaling up architecture, integrating enhanced vision...

10.48550/arxiv.2405.10300 preprint EN arXiv (Cornell University) 2024-05-16
Coming Soon ...