- Advanced Neural Network Applications
- 3D Surveying and Cultural Heritage
- 3D Shape Modeling and Analysis
- Robotics and Sensor-Based Localization
- Human Pose and Action Recognition
- Autonomous Vehicle Technology and Safety
- Natural Language Processing Techniques
- Computer Graphics and Visualization Techniques
- Video Surveillance and Tracking Methods
- Multimodal Machine Learning Applications
- Topic Modeling
- Domain Adaptation and Few-Shot Learning
- Anomaly Detection Techniques and Applications
- Robot Manipulation and Learning
- Remote Sensing and LiDAR Applications
- Optical Imaging and Spectroscopy Techniques
- Non-Invasive Vital Sign Monitoring
- Text Readability and Simplification
- Fire Detection and Safety Systems
- Air Quality Monitoring and Forecasting
- Semantic Web and Ontologies
- 3D Modeling in Geospatial Applications
- Hand Gesture Recognition Systems
- Urinary Bladder and Prostate Research
- Soft Robotics and Applications
University of Southern California
2024
Southern California University for Professional Studies
2024
Chinese University of Hong Kong
2019-2023
Zhejiang University
2018
We present Voxel Transformer (VoTr), a novel and effective voxel-based backbone for 3D object detection from point clouds. Conventional convolutional backbones in detectors cannot efficiently capture large context information, which is crucial recognition localization, owing to the limited receptive fields. In this paper, we resolve problem by introducing Transformer-based architecture that enables long-range relationships between voxels self-attention. Given fact non-empty are naturally...
Point cloud is an important type of 3D representation. However, directly applying convolutions on point clouds challenging due to the sparse, irregular and unordered data structure. In this paper, we propose a novel Interpolated Convolution operation, InterpConv, tackle feature learning understanding problem. The key idea utilize set discrete kernel weights interpolate features neighboring kernel-weight coordinates by interpolation function for convolution. A normalization term introduced...
We present a flexible and high-performance framework, named Pyramid R-CNN, for two-stage 3D object detection from point clouds. Current approaches generally rely on the points or voxels of interest RoI feature extraction second stage, but cannot effectively handle sparsity non-uniform distribution those points, this may result in failures detecting objects that are far away. To resolve problems, we propose novel second-stage module, pyramid head, to adaptively learn features sparse interest....
Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On other hand, learning from unlabeled large-scale collected incrementally self-training powerful recognition received increasing attention may solutions next-generation industry-level robust driving. However, research community generally suffered inadequacy those essential real-world scene data, which hampers future...
Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data adapting success of 2D Vision-Language Models (VLM) 3D space remains an open problem. Existing works that leverage VLM for generally resort constructing intermediate representations data, but at cost losing geometry information. To take a step toward understanding, we propose...
We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into reliable motion planner for autonomous vehicles. Motion planning is core challenge in driving, aiming to plan driving trajectory safe and comfortable. Existing planners predominantly leverage heuristic methods forecast trajectories, these approaches demonstrate insufficient generalization capabilities face of novel unseen scenarios. In this paper, we propose capitalizes on strong reasoning potential...
Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present large-scale dataset for standardizing the evaluation of different self-supervised semi-supervised approaches by learning from raw data, which is first largest to date. Existing systems heavily rely on `perfect' visual perception models (i.e., detection) trained using extensive annotated data ensure safety. However, it unrealistic elaborately label instances all scenarios circumstances night,...
Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely real-world environments. While Vision-Language Models (VLMs) have shown great promise reasoning task planning agents, their ability comprehend phenomena remains extremely limited. To close this gap, we introduce PhysBench, comprehensive benchmark designed evaluate VLMs' understanding capability across diverse set of tasks. PhysBench contains...
We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds. In contrast to previous methods that normally predict attributes of objects all at once, we expressively model the interdependencies between objects, which in turn enables better accuracy. Specifically, view each as sequence words reformulate task decoding scenes an auto-regressive manner. further propose lightweight scene-to-sequence decoder can auto-regressively generate conditioned on...
Human-level driving is an ultimate goal of autonomous driving. Conventional approaches formulate as a perception-prediction-planning framework, yet their systems do not capitalize on the inherent reasoning ability and experiential knowledge humans. In this paper, we propose fundamental paradigm shift from current pipelines, exploiting Large Language Models (LLMs) cognitive agent to integrate human-like intelligence into systems. Our approach, termed Agent-Driver, transforms traditional...
Point cloud is an important type of 3D representation. However, directly applying convolutions on point clouds challenging due to the sparse, irregular and unordered data structure. In this paper, we propose a novel Interpolated Convolution operation, InterpConv, tackle feature learning understanding problem. The key idea utilize set discrete kernel weights interpolate features neighboring kernel-weight coordinates by interpolation function for convolution. A normalization term introduced...
Autonomous driving, in recent years, has been receiving increasing attention for its potential to relieve drivers' burdens and improve the safety of driving. In modern autonomous driving pipelines, perception system is an indispensable component, aiming accurately estimate status surrounding environments provide reliable observations prediction planning. 3D object detection, which intelligently predicts locations, sizes, categories critical objects near vehicle, important part a system. This...
Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN TopNet) use Multi-layer Perceptrons (MLPs) to directly process clouds, which may cause loss of details because structural context clouds are not fully considered. To solve this problem, we introduce grids as intermediate representations regularize unordered clouds. We therefore propose novel Gridding Residual Network (GRNet) for completion. In...
Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data adapting success of 2D Vision-Language Models (VLM) 3D space remains an open problem. Existing works that leverage VLM for generally resort constructing intermediate representations data, but at cost losing geometry information. To take a step toward understanding, we propose...
Adapting driving behavior to new environments, customs, and laws is a long-standing problem in autonomous driving, precluding the widespread deployment of vehicles (AVs). In this paper, we present LLaDA, simple yet powerful tool that enables human drivers alike drive everywhere by adapting their tasks motion plans traffic rules locations. LLaDA achieves leveraging impressive zero-shot generalizability large language models (LLMs) interpreting local driver handbook. Through an extensive user...
The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These are typically executed cascading manner, making them prone to compounding errors. Furthermore, there is usually very thin interface between two tasks, creating lossy information bottleneck. To address these challenges, our approach formulates union as refinement problem, where first pose (current time), subsequent poses waypoints multiple forecasts (future...
This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on retrieval-based affordance transfer paradigm to acquire versatile capabilities abundant out-of-domain data. First, extracts unified at scale diverse sources of demonstrations including data,...
Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement or teleoperation to achieve whole-body control, they are often limited by the diversity simulated environments and high costs demonstration collection. In contrast, human videos ubiquitous present an untapped source semantic motion information that could significantly enhance generalization capabilities robots. This paper introduces...
Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes logs and synthesize geometry-consistent videos through neural rendering, but their dependence on costly object annotations limits ability to generalize in-the-wild scenarios. On the other hand, generative models can action-conditioned in more generalizable way often struggle with...