- Advanced Vision and Imaging
- Robotics and Sensor-Based Localization
- 3D Shape Modeling and Analysis
- Multimodal Machine Learning Applications
- Advanced Neural Network Applications
- 3D Surveying and Cultural Heritage
- Robot Manipulation and Learning
- Advanced Image and Video Retrieval Techniques
- Reinforcement Learning in Robotics
- Human Pose and Action Recognition
- Optical measurement and interference techniques
- Robotic Path Planning Algorithms
- Constraint Satisfaction and Optimization
- Historical Geography and Cartography
- Hand Gesture Recognition Systems
- Image Processing Techniques and Applications
- Adversarial Robustness in Machine Learning
- Speech and dialogue systems
- Industrial Vision Systems and Defect Detection
- Human-Automation Interaction and Safety
- Domain Adaptation and Few-Shot Learning
- Robotics and Automated Systems
- Cell Image Analysis Techniques
- Soft Robotics and Applications
- Image Processing and 3D Reconstruction
Shanghai University
2024
Peking University
2023-2024
Beijing Academy of Artificial Intelligence
2023
Chang'an University
2023
National University of Defense Technology
2019-2022
Online reconstruction based on RGB-D sequences has thus far been restrained to relatively slow camera motions (<1m/s). Under very fast motion (e.g., 3m/s), the can easily crumble even for state-of-the-art methods. Fast brings two challenges depth fusion: 1) high nonlinearity of pose optimization due large inter-frame rotations and 2) lack reliably trackable features blur. We propose tackle difficulties fast-motion tracking in absence inertial measurements using random optimization,...
High-dimensional nonlinear state estimation is at the heart of inertial-aided navigation systems (INS). Traditional methods usually rely on good initialization and find difficulty in handling large interframe transformations due to fast camera motion. We opt tackle these challenges by solving depth inertial odometry (DIO) problem with random optimization. To address exponentially increased amount candidate states sampled for high-dimensional space, we propose a highly efficient variant...
In this work, we tackle 6-DoF grasp detection for transparent and specular objects, which is an important yet challenging problem in vision-based robotic systems, due to the failure of depth cameras sensing their geometry. We, first time, propose a multiview RGB-based network, GraspNeRF, that leverages generalizable neural radiance field (NeRF) achieve material-agnostic object grasping clutter. Compared existing NeRF-based 3-DoF methods rely on densely captured input images time-consuming...
Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform convolution directly over the progressively fused geometric data, and smartly fuse information from frame frame. We propose a novel fusion-aware point which operates on surface being reconstructed exploits effectively inter-frame correlation for high-quality feature learning. This is enabled by dedicated dynamic data structure that organizes online acquired cloud...
Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this happens 3D space, 3D-aware agent can advance its capability via learning from fine-grained spatial information. However, leveraging representation be prohibitively unpractical policy floor-level task, due to low sample efficiency and expensive computational cost. In work, we propose...
Online reconstruction based on RGB-D sequences has thus far been restrained to relatively slow camera motions (<1m/s). Under very fast motion (e.g., 3m/s), the can easily crumble even for state-of-the-art methods. Fast brings two challenges depth fusion: 1) high nonlinearity of pose optimization due large inter-frame rotations and 2) lack reliably trackable features blur. We propose tackle difficulties fast-motion tracking in absence inertial measurements using random optimization,...
Abstract We propose a novel approach to robot‐operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at recognition segmentation objects from scene. Our algorithm built top volumetric depth fusion framework performs real‐time voxel‐based labeling over reconstructed volume. The guided an estimated discrete viewing score field (VSF) parameterized 3D...
We introduce MIPS-Fusion, a robust and scalable online RGB-D reconstruction method based on novel neural implicit representation - multi-implicit-submap. Different from existing methods lacking either flexibility with single map or scalability due to extra storage of feature grids, we propose pure tackling both difficulties divide-and-conquer design. In our method, submaps are incrementally allocated alongside the scanning trajectory efficiently learned local bundle adjustments. The can be...
A practical navigation agent must be capable of handling a wide range interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied fall short serving generalists in the real world, they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, first video-based vision-language-action (VLA) model designed to unify diverse tasks...
Object goal navigation (ObjectNav) is a fundamental task of embodied AI that requires the agent to find target object in unseen environments. This particularly challenging as it demands both perceptual and cognitive processes for effective perception decision-making. While has gained significant progress powered by rapidly developed visual foundation models, on side remains limited either implicitly learning from massive demonstrations or explicitly leveraging pre-defined heuristic rules....
Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment instances without predefined categories. However, progress in lags behind 2D counterpart due limited annotated data. To address this, recent works first generate open-vocabulary masks through models and then merge them into based on metrics calculated between two neighboring frames. In contrast these local metrics, we propose a novel metric, view consensus rate, enhance the utilization of multi-view...
Vision-and-Language Navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is long-standing challenge, either out-of-distribution scenes or from Sim Real. paper, we propose NaVid, video-based large vision language model (VLM), mitigate such gap. NaVid makes the first endeavour showcase capability VLMs achieve state-of-the-art level navigation performance...
Recent research on Vision-and-Language Navigation (VLN) indicates that agents suffer from poor generalization in unseen environments due to the lack of realistic training and high-quality path-instruction pairs. Most existing methods for constructing navigation scenes have high costs, extension instructions mainly relies predefined templates or rules, lacking adaptability. To alleviate issue, we propose InstruGen, a VLN pairs generation paradigm. Specifically, use YouTube house tour videos...
Choosing appropriate hyperparameters plays a crucial role in the success of neural networks as hyper-parameters directly control behavior and performance training algorithms. To obtain efficient tuning, Bayesian optimization methods based on Gaussian process (GP) models are widely used. Despite numerous applications deep learning, existing methodologies developed convenient but restrictive assumption that tuning parameters independent each other. However, with conditional dependence common...
Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research 3D vision has primarily focused on manipulation through depth perception and pose detection. However, real-world environments, these methods often face challenges due to imperfect perception, such as with transparent lids reflective handles. Moreover, they generally lack the diversity part-based interactions required for flexible...
Camera placement is crutial in multi-camera systems such as virtual reality, autonomous driving, and high-quality reconstruction. The camera challenge lies the nonlinear nature of high-dimensional parameters unavailability gradients for target functions like coverage visibility. Consequently, most existing methods tackle this by leveraging non-gradient-based optimization methods.In work, we present a hybrid approach that incorporates both gradient-based methods. This design allows our method...
We propose a novel approach to robot-operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at recognition segmentation objects from scene. Our algorithm built top volumetric depth fusion framework (e.g., KinectFusion) performs real-time voxel-based labeling over reconstructed volume. The guided an estimated discrete viewing score field (VSF)...
Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform convolution directly over the progressively fused geometric data, and smartly fuse information from frame frame. We propose a novel fusion-aware point which operates on surface being reconstructed exploits effectively inter-frame correlation for high quality feature learning. This is enabled by dedicated dynamic data structure organizes online acquired cloud...