NFDI4DS | UHH-SEMS - Publication Details

Voxel Transformer for 3D Object Detection

OPENALEX - Publications

Jiageng Mao Yujing Xue Minzhe Niu Haoyue Bai Jiashi Feng and 3 more

We present Voxel Transformer (VoTr), a novel and effective voxel-based backbone for 3D object detection from point clouds. Conventional convolutional backbones in detectors cannot efficiently capture large context information, which is crucial recognition localization, owing to the limited receptive fields. In this paper, we resolve problem by introducing Transformer-based architecture that enables long-range relationships between voxels self-attention. Given fact non-empty are naturally...

10.1109/iccv48922.2021.00315 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

Interpolated Convolutional Networks for 3D Point Cloud Understanding

OPENALEX - Publications

Jiageng Mao Xiaogang Wang Hongsheng Li

Point cloud is an important type of 3D representation. However, directly applying convolutions on point clouds challenging due to the sparse, irregular and unordered data structure. In this paper, we propose a novel Interpolated Convolution operation, InterpConv, tackle feature learning understanding problem. The key idea utilize set discrete kernel weights interpolate features neighboring kernel-weight coordinates by interpolation function for convolution. A normalization term introduced...

10.1109/iccv.2019.00166 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2019-10-01

3D Object Detection for Autonomous Driving: A Comprehensive Survey

OPENALEX - Publications

Jiageng Mao Shaoshuai Shi Xiaogang Wang Hongsheng Li

10.1007/s11263-023-01790-1 article EN International Journal of Computer Vision 2023-04-27

Pyramid R-CNN: Towards Better Performance and Adaptability for 3D Object Detection

OPENALEX - Publications

Jiageng Mao Minzhe Niu Haoyue Bai Xiaodan Liang Hang Xu and 1 more

We present a flexible and high-performance framework, named Pyramid R-CNN, for two-stage 3D object detection from point clouds. Current approaches generally rely on the points or voxels of interest RoI feature extraction second stage, but cannot effectively handle sparsity non-uniform distribution those points, this may result in failures detecting objects that are far away. To resolve problems, we propose novel second-stage module, pyramid head, to adaptively learn features sparse interest....

10.1109/iccv48922.2021.00272 article EN 2021 IEEE/CVF International Conference on Computer Vision (ICCV) 2021-10-01

One Million Scenes for Autonomous Driving: ONCE Dataset

OPENALEX - Publications

Jiageng Mao Minzhe Niu Chenhan Jiang Hanxue Liang Xiaodan Liang and 7 more

Current perception models in autonomous driving have become notorious for greatly relying on a mass of annotated data to cover unseen cases and address the long-tail problem. On other hand, learning from unlabeled large-scale collected incrementally self-training powerful recognition received increasing attention may solutions next-generation industry-level robust driving. However, research community generally suffered inadequacy those essential real-world scene data, which hampers future...

10.48550/arxiv.2106.11037 preprint EN cc-by-nc-sa arXiv (Cornell University) 2021-01-01

CLIP2: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

OPENALEX - Publications

Yihan Zeng Chenhan Jiang Jiageng Mao Jianhua Han Chaoqiang Ye and 5 more

Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data adapting success of 2D Vision-Language Models (VLM) 3D space remains an open problem. Existing works that leverage VLM for generally resort constructing intermediate representations data, but at cost losing geometry information. To take a step toward understanding, we propose...

10.1109/cvpr52729.2023.01463 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

GPT-Driver: Learning to Drive with GPT

OPENALEX - Publications

Jiageng Mao Yuxi Qian Hang Zhao Yue Wang

We present a simple yet effective approach that can transform the OpenAI GPT-3.5 model into reliable motion planner for autonomous vehicles. Motion planning is core challenge in driving, aiming to plan driving trajectory safe and comfortable. Existing planners predominantly leverage heuristic methods forecast trajectories, these approaches demonstrate insufficient generalization capabilities face of novel unseen scenarios. In this paper, we propose capitalizes on strong reasoning potential...

10.48550/arxiv.2310.01415 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Driving Everywhere with Large Language Model Policy Adaptation

OPENALEX - Publications

Boyi Li Yue Wang Jiageng Mao Boris Ivanovic Sushant Veer and 2 more

10.1109/cvpr52733.2024.01416 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

SODA10M: A Large-Scale 2D Self/Semi-Supervised Object Detection Dataset for Autonomous Driving

OPENALEX - Publications

Jianhua Han Xiwen Liang Hang Xu Kai Chen Lanqing Hong and 6 more

Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present large-scale dataset for standardizing the evaluation of different self-supervised semi-supervised approaches by learning from raw data, which is first largest to date. Existing systems heavily rely on `perfect' visual perception models (i.e., detection) trained using extensive annotated data ensure safety. However, it unrealistic elaborately label instances all scenarios circumstances night,...

10.48550/arxiv.2106.11118 preprint EN cc-by-nc-sa arXiv (Cornell University) 2021-01-01

A survey on deep learning-based single image crowd counting: Network design, loss function and supervisory signal

OPENALEX - Publications

Haoyue Bai Jiageng Mao S.-H. Gary Chan

10.1016/j.neucom.2022.08.037 article EN Neurocomputing 2022-08-08

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

OPENALEX - Publications

Wei Chow Jiageng Mao Boyi Li Daniel Seita Vitor Guizilini and 1 more

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely real-world environments. While Vision-Language Models (VLMs) have shown great promise reasoning task planning agents, their ability comprehend phenomena remains extremely limited. To close this gap, we introduce PhysBench, comprehensive benchmark designed evaluate VLMs' understanding capability across diverse set of tasks. PhysBench contains...

10.48550/arxiv.2501.16411 preprint EN arXiv (Cornell University) 2025-01-27

Point2Seq: Detecting 3D Objects as Sequences

OPENALEX - Publications

Yujing Xue Jiageng Mao Minzhe Niu Hang Xu Michael Bi Mi and 3 more

We present a simple and effective framework, named Point2Seq, for 3D object detection from point clouds. In contrast to previous methods that normally predict attributes of objects all at once, we expressively model the interdependencies between objects, which in turn enables better accuracy. Specifically, view each as sequence words reformulate task decoding scenes an auto-regressive manner. further propose lightweight scene-to-sequence decoder can auto-regressively generate conditioned on...

10.1109/cvpr52688.2022.00833 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2022-06-01

A Language Agent for Autonomous Driving

OPENALEX - Publications

Jiageng Mao Junjie Ye Yuxi Qian Marco Pavone Yue Wang

Human-level driving is an ultimate goal of autonomous driving. Conventional approaches formulate as a perception-prediction-planning framework, yet their systems do not capitalize on the inherent reasoning ability and experiential knowledge humans. In this paper, we propose fundamental paradigm shift from current pipelines, exploiting Large Language Models (LLMs) cognitive agent to integrate human-like intelligence into systems. Our approach, termed Agent-Driver, transforms traditional...

10.48550/arxiv.2311.10813 preprint EN cc-by-nc-sa arXiv (Cornell University) 2023-01-01

Restoring the sense of bladder fullness for spinal cord injury patients

OPENALEX - Publications

Daniel Fong Xiaofan Yu Jiageng Mao Mahya Saffarpour Prashant Gupta and 4 more

10.1016/j.smhl.2018.07.014 article EN Smart Health 2018-07-08

Interpolated Convolutional Networks for 3D Point Cloud Understanding

OPENALEX - Publications

Jiageng Mao Xiaogang Wang Hongsheng Li

Point cloud is an important type of 3D representation. However, directly applying convolutions on point clouds challenging due to the sparse, irregular and unordered data structure. In this paper, we propose a novel Interpolated Convolution operation, InterpConv, tackle feature learning understanding problem. The key idea utilize set discrete kernel weights interpolate features neighboring kernel-weight coordinates by interpolation function for convolution. A normalization term introduced...

10.48550/arxiv.1908.04512 preprint EN other-oa arXiv (Cornell University) 2019-01-01

3D Object Detection for Autonomous Driving: A Comprehensive Survey

OPENALEX - Publications

Jiageng Mao Shaoshuai Shi Xiaogang Wang Hongsheng Li

Autonomous driving, in recent years, has been receiving increasing attention for its potential to relieve drivers' burdens and improve the safety of driving. In modern autonomous driving pipelines, perception system is an indispensable component, aiming accurately estimate status surrounding environments provide reliable observations prediction planning. 3D object detection, which intelligently predicts locations, sizes, categories critical objects near vehicle, important part a system. This...

10.48550/arxiv.2206.09474 preprint EN cc-by arXiv (Cornell University) 2022-01-01

GRNet: Gridding Residual Network for Dense Point Cloud Completion

OPENALEX - Publications

Haozhe Xie Hongxun Yao Shangchen Zhou Jiageng Mao Shengping Zhang and 1 more

Estimating the complete 3D point cloud from an incomplete one is a key problem in many vision and robotics applications. Mainstream methods (e.g., PCN TopNet) use Multi-layer Perceptrons (MLPs) to directly process clouds, which may cause loss of details because structural context clouds are not fully considered. To solve this problem, we introduce grids as intermediate representations regularize unordered clouds. We therefore propose novel Gridding Residual Network (GRNet) for completion. In...

10.48550/arxiv.2006.03761 preprint EN public-domain arXiv (Cornell University) 2020-01-01

CLIP$^2$: Contrastive Language-Image-Point Pretraining from Real-World Point Cloud Data

OPENALEX - Publications

Yihan Zeng Chenhan Jiang Jiageng Mao Jianhua Han Chaoqiang Ye and 5 more

Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data adapting success of 2D Vision-Language Models (VLM) 3D space remains an open problem. Existing works that leverage VLM for generally resort constructing intermediate representations data, but at cost losing geometry information. To take a step toward understanding, we propose...

10.48550/arxiv.2303.12417 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Driving Everywhere with Large Language Model Policy Adaptation

OPENALEX - Publications

Boyi Li Yue Wang Jiageng Mao Boris Ivanovic Sushant Veer and 2 more

Adapting driving behavior to new environments, customs, and laws is a long-standing problem in autonomous driving, precluding the widespread deployment of vehicles (AVs). In this paper, we present LLaDA, simple yet powerful tool that enables human drivers alike drive everywhere by adapting their tasks motion plans traffic rules locations. LLaDA achieves leveraging impressive zero-shot generalizability large language models (LLMs) interpreting local driver handbook. Through an extensive user...

10.48550/arxiv.2402.05932 preprint EN arXiv (Cornell University) 2024-02-08

DeTra: A Unified Model for Object Detection and Trajectory Forecasting

OPENALEX - Publications

Sergio Casas Ben Agro Jiageng Mao Thomas Gilles Alexander Cui and 2 more

The tasks of object detection and trajectory forecasting play a crucial role in understanding the scene for autonomous driving. These are typically executed cascading manner, making them prone to compounding errors. Furthermore, there is usually very thin interface between two tasks, creating lossy information bottleneck. To address these challenges, our approach formulates union as refinement problem, where first pose (current time), subsequent poses waypoints multiple forecasts (future...

10.48550/arxiv.2406.04426 preprint EN arXiv (Cornell University) 2024-06-06

RAM: Retrieval-Based Affordance Transfer for Generalizable Zero-Shot Robotic Manipulation

OPENALEX - Publications

Yuxuan Kuang Junjie Ye Haoran Geng Jiageng Mao Congyue Deng and 3 more

This work proposes a retrieve-and-transfer framework for zero-shot robotic manipulation, dubbed RAM, featuring generalizability across various objects, environments, and embodiments. Unlike existing approaches that learn manipulation from expensive in-domain demonstrations, RAM capitalizes on retrieval-based affordance transfer paradigm to acquire versatile capabilities abundant out-of-domain data. First, extracts unified at scale diverse sources of demonstrations including data,...

10.48550/arxiv.2407.04689 preprint EN arXiv (Cornell University) 2024-07-05

Learning from Massive Human Videos for Universal Humanoid Pose Control

OPENALEX - Publications

Jiageng Mao Siheng Zhao Siqi Song Tongyang Shi Junjie Ye and 5 more

Scalable learning of humanoid robots is crucial for their deployment in real-world applications. While traditional approaches primarily rely on reinforcement or teleoperation to achieve whole-body control, they are often limited by the diversity simulated environments and high costs demonstration collection. In contrast, human videos ubiquitous present an untapped source semantic motion information that could significantly enhance generalization capabilities robots. This paper introduces...

10.48550/arxiv.2412.14172 preprint EN arXiv (Cornell University) 2024-12-18

DreamDrive: Generative 4D Scene Modeling from Street View Images

OPENALEX - Publications

Jiageng Mao Boyi Li Boris Ivanovic Yuxiao Chen Yan Wang and 5 more

Synthesizing photo-realistic visual observations from an ego vehicle's driving trajectory is a critical step towards scalable training of self-driving models. Reconstruction-based methods create 3D scenes logs and synthesize geometry-consistent videos through neural rendering, but their dependence on costly object annotations limits ability to generalize in-the-wild scenarios. On the other hand, generative models can action-conditioned in more generalizable way often struggle with...

10.48550/arxiv.2501.00601 preprint EN arXiv (Cornell University) 2024-12-31