- Advanced Neural Network Applications
- Multimodal Machine Learning Applications
- Advanced Image and Video Retrieval Techniques
- Domain Adaptation and Few-Shot Learning
- Topic Modeling
- Advanced Vision and Imaging
- Natural Language Processing Techniques
- Image Processing Techniques and Applications
- Autonomous Vehicle Technology and Safety
- 3D Shape Modeling and Analysis
- Visual Attention and Saliency Detection
- Computer Graphics and Visualization Techniques
- Image Retrieval and Classification Techniques
- Robotics and Sensor-Based Localization
- Generative Adversarial Networks and Image Synthesis
- CCD and CMOS Imaging Sensors
- Visual perception and processing mechanisms
- Remote Sensing and LiDAR Applications
- Anomaly Detection Techniques and Applications
- Human Pose and Action Recognition
- Advanced Image Processing Techniques
- Industrial Vision Systems and Defect Detection
- Adversarial Robustness in Machine Learning
- 3D Surveying and Cultural Heritage
- Advanced Memory and Neural Computing
Group Sense (China)
2023-2024
The Sense Innovation and Research Center
2023
Microsoft Research (United Kingdom)
2019
DETR has been recently proposed to eliminate the need for many hand-designed components in object detection while demonstrating good performance. However, it suffers from slow convergence and limited feature spatial resolution, due limitation of Transformer attention modules processing image maps. To mitigate these issues, we Deformable DETR, whose only attend a small set key sampling points around reference. can achieve better performance than (especially on objects) with 10 times less...
We introduce a new pre-trainable generic representation for visual-linguistic tasks, called Visual-Linguistic BERT (VL-BERT short). VL-BERT adopts the simple yet powerful Transformer model as backbone, and extends it to take both visual linguistic embedded features input. In it, each element of input is either word from sentence, or region-of-interest (RoI) image. It designed fit most downstream tasks. To better exploit representation, we pre-train on massive-scale Conceptual Captions...
Compared to the great progress of large-scale vision transformers (ViTs) in recent years, models based on convolutional neural networks (CNNs) are still an early state. This work presents a new CNN-based foundation model, termed InternImage, which can obtain gain from increasing parameters and training data like ViTs. Different CNNs that focus large dense kernels, InternImage takes deformable convolution as core operator, so our model not only has effective receptive field required for...
Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design multi-task paradigm with separate heads. However, they might suffer from accumulative errors deficient task coordination. Instead, we argue that favorable framework should be devised optimized...
We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and bet-suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pretrained backbones like Vo Vn et, hindering the synergy between booming detectors. To address this limitation, we prioritize easing optimization of by introducing view supervision. end, propose two-stage detector; where proposals from head fed into bird' s-eye-view for final...
Transformer, as a strong and flexible architecture for modelling long-range relations, has been widely explored in vision tasks. However, when used video inpainting that requires fine-grained representation, existed method still suffers from yielding blurry edges detail due to the hard patch splitting. Here we aim tackle this problem by proposing FuseFormer, Transformer model designed via feature fusion based on novel Soft Split Composition operations. The soft split divides map into many...
Learning powerful representations in bird's-eye-view (BEV) for perception tasks is trending and drawing extensive attention both from industry academia. Conventional approaches most autonomous driving algorithms perform detection, segmentation, tracking, etc., a front or perspective view. As sensor configurations get more complex, integrating multi-source information different sensors representing features unified view come of vital importance. BEV inherits several advantages, as surrounding...
Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes physical 3D into structured grid map with semantic labels per cell, termed as Occupancy, would be desirable. Compared to form bounding box, key insight behind occupancy it could capture fine-grained details critical obstacles in scene, and thereby facilitate subsequent tasks....
To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised and self-supervised pre-training. It has been proved that combining multiple modalities/sources can greatly boost training models. However, current works adopt a multi-stage system, where complex pipeline may increase uncertainty instability is thus desirable these be integrated in...
Video inpainting aims to fill the given spatiotemporal holes with realistic appearance but is still a challenging task even prosperous deep learning approaches. Recent works introduce promising Transformer architecture into video and achieve better performance. However, it suffers from synthesizing blurry texture as well huge computational cost. Towards this end, we propose novel Decoupled Spatial-Temporal (DSTT) for improving exceptional efficiency. Our proposed DSTT disentangles of...
The captivating realm of Minecraft has attracted substantial research interest in recent years, serving as a rich platform for developing intelligent agents capable functioning open-world environments. However, the current landscape predominantly focuses on specific objectives, such popular "ObtainDiamond" task, and not yet shown effective generalization to broader spectrum tasks. Furthermore, leading success rate task stands at around 20%, highlighting limitations Reinforcement Learning...
Developing generative models for interleaved image-text data has both research and practical value. It requires to understand the sequences subsequently generate images text. However, existing attempts are limited by issue that fixed number of visual tokens cannot efficiently capture image details, which is particularly problematic in multi-image scenarios. To address this, this paper presents MM-Interleaved, an end-to-end model data. introduces a multi-scale feature synchronizer module,...
Transformers have revolutionized computer vision and natural language processing, but their high computational complexity limits application in high-resolution image processing long-context analysis. This paper introduces Vision-RWKV (VRWKV), a model adapted from the RWKV used NLP field with necessary modifications for tasks. Similar to Vision Transformer (ViT), our is designed efficiently handle sparse inputs demonstrate robust global capabilities, while also scaling up effectively,...
Embedding-based retrieval models have made significant strides in retrieval-augmented generation (RAG) techniques for text and multimodal large language (LLMs) applications. However, when it comes to speech larage (SLLMs), these methods are limited a two-stage process, where automatic recognition (ASR) is combined with text-based retrieval. This sequential architecture suffers from high latency error propagation. To address limitations, we propose unified embedding framework that eliminates...
World models that forecast environmental changes from actions are vital for autonomous driving with strong generalization. The prevailing world model mainly build on video prediction model. Although these can produce high-fidelity sequences advanced diffusion-based generator, they constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve problem combining generation loss MAE-style feature-level context learning. particular,...
Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop realistic simulators. To end, (1) bridge gap between decisions vehicle control commands by standardizing decision states according to off-the-shelf motion planning module. (2) employ a...
Multi-camera 3D object detection blossoms in recent years and most of state-of-the-art methods are built up on the bird' s-eye- view (BEV) representations. Albeit remarkable performance, these works suffer from low efficiency. Typically, knowledge distillation can be used for model compression. However, due to unclear geometry reasoning, expert features usually contain some noisy confusing areas. In this work, we investigate how distill an imperfect expert. We propose FD3D, a Focal Distiller...
The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the long-range continuous property objects in space. To address this issue, paper presents a novel approach that views each tracklet stream: at timestamp, current frame fed into network to interact...
This article introduces the solutions of team lvisTraveler for LVIS Challenge 2020. In this work, two characteristics dataset are mainly considered: long-tailed distribution and high quality instance segmentation mask. We adopt a two-stage training pipeline. first stage, we incorporate EQL self-training to learn generalized representation. second utilize Balanced GroupSoftmax promote classifier, propose novel proposal assignment strategy new balanced mask loss head get more precise...
We introduce Deformable Convolution v4 (DCNv4), a highly efficient and effective operator designed for broad spectrum of vision applications. DCNv4 addresses the limitations its predecessor, DCNv3, with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance dynamic property expressive power 2. optimizing memory access minimize redundant operations speedup. These improvements result significantly faster convergence compared DCNv3 substantial increase...