Jiazhao Zhang

ORCID: 0000-0001-9459-293X
Publications
Citations
Views
---
Saved
---
About
Contact & Profiles
Research Areas
  • Advanced Vision and Imaging
  • Robotics and Sensor-Based Localization
  • 3D Shape Modeling and Analysis
  • Multimodal Machine Learning Applications
  • Advanced Neural Network Applications
  • 3D Surveying and Cultural Heritage
  • Robot Manipulation and Learning
  • Advanced Image and Video Retrieval Techniques
  • Reinforcement Learning in Robotics
  • Human Pose and Action Recognition
  • Optical measurement and interference techniques
  • Robotic Path Planning Algorithms
  • Constraint Satisfaction and Optimization
  • Historical Geography and Cartography
  • Hand Gesture Recognition Systems
  • Image Processing Techniques and Applications
  • Adversarial Robustness in Machine Learning
  • Speech and dialogue systems
  • Industrial Vision Systems and Defect Detection
  • Human-Automation Interaction and Safety
  • Domain Adaptation and Few-Shot Learning
  • Robotics and Automated Systems
  • Cell Image Analysis Techniques
  • Soft Robotics and Applications
  • Image Processing and 3D Reconstruction

Shanghai University
2024

Peking University
2023-2024

Beijing Academy of Artificial Intelligence
2023

Chang'an University
2023

National University of Defense Technology
2019-2022

Online reconstruction based on RGB-D sequences has thus far been restrained to relatively slow camera motions (<1m/s). Under very fast motion (e.g., 3m/s), the can easily crumble even for state-of-the-art methods. Fast brings two challenges depth fusion: 1) high nonlinearity of pose optimization due large inter-frame rotations and 2) lack reliably trackable features blur. We propose tackle difficulties fast-motion tracking in absence inertial measurements using random optimization,...

10.48550/arxiv.2105.05600 preprint EN cc-by arXiv (Cornell University) 2021-01-01

High-dimensional nonlinear state estimation is at the heart of inertial-aided navigation systems (INS). Traditional methods usually rely on good initialization and find difficulty in handling large interframe transformations due to fast camera motion. We opt tackle these challenges by solving depth inertial odometry (DIO) problem with random optimization. To address exponentially increased amount candidate states sampled for high-dimensional space, we propose a highly efficient variant...

10.1109/tro.2022.3208503 article EN IEEE Transactions on Robotics 2022-10-11

In this work, we tackle 6-DoF grasp detection for transparent and specular objects, which is an important yet challenging problem in vision-based robotic systems, due to the failure of depth cameras sensing their geometry. We, first time, propose a multiview RGB-based network, GraspNeRF, that leverages generalizable neural radiance field (NeRF) achieve material-agnostic object grasping clutter. Compared existing NeRF-based 3-DoF methods rely on densely captured input images time-consuming...

10.1109/icra48891.2023.10160842 article EN 2023-05-29

Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform convolution directly over the progressively fused geometric data, and smartly fuse information from frame frame. We propose a novel fusion-aware point which operates on surface being reconstructed exploits effectively inter-frame correlation for high-quality feature learning. This is enabled by dedicated dynamic data structure that organizes online acquired cloud...

10.1109/cvpr42600.2020.00459 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2020-06-01

Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this happens 3D space, 3D-aware agent can advance its capability via learning from fine-grained spatial information. However, leveraging representation be prohibitively unpractical policy floor-level task, due to low sample efficiency and expensive computational cost. In work, we propose...

10.1109/cvpr52729.2023.00645 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2023-06-01

Online reconstruction based on RGB-D sequences has thus far been restrained to relatively slow camera motions (<1m/s). Under very fast motion (e.g., 3m/s), the can easily crumble even for state-of-the-art methods. Fast brings two challenges depth fusion: 1) high nonlinearity of pose optimization due large inter-frame rotations and 2) lack reliably trackable features blur. We propose tackle difficulties fast-motion tracking in absence inertial measurements using random optimization,...

10.1145/3450626.3459676 article EN ACM Transactions on Graphics 2021-07-19

Abstract We propose a novel approach to robot‐operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at recognition segmentation objects from scene. Our algorithm built top volumetric depth fusion framework performs real‐time voxel‐based labeling over reconstructed volume. The guided an estimated discrete viewing score field (VSF) parameterized 3D...

10.1111/cgf.13820 article EN Computer Graphics Forum 2019-10-01

10.1109/cvpr52733.2024.02671 article EN 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2024-06-16

We introduce MIPS-Fusion, a robust and scalable online RGB-D reconstruction method based on novel neural implicit representation - multi-implicit-submap. Different from existing methods lacking either flexibility with single map or scalability due to extra storage of feature grids, we propose pure tackling both difficulties divide-and-conquer design. In our method, submaps are incrementally allocated alongside the scanning trajectory efficiently learned local bundle adjustments. The can be...

10.1145/3618363 article EN ACM Transactions on Graphics 2023-12-05

10.1145/3476576.3476604 article ACM Transactions on Graphics 2021-07-17

A practical navigation agent must be capable of handling a wide range interaction demands, such as following instructions, searching objects, answering questions, tracking people, and more. Existing models for embodied fall short serving generalists in the real world, they are often constrained by specific task configurations or pre-defined maps with discretized waypoints. In this work, we present Uni-NaVid, first video-based vision-language-action (VLA) model designed to unify diverse tasks...

10.48550/arxiv.2412.06224 preprint EN arXiv (Cornell University) 2024-12-09

Object goal navigation (ObjectNav) is a fundamental task of embodied AI that requires the agent to find target object in unseen environments. This particularly challenging as it demands both perceptual and cognitive processes for effective perception decision-making. While has gained significant progress powered by rapidly developed visual foundation models, on side remains limited either implicitly learning from massive demonstrations or explicitly leveraging pre-defined heuristic rules....

10.48550/arxiv.2412.10439 preprint EN arXiv (Cornell University) 2024-12-11

Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment instances without predefined categories. However, progress in lags behind 2D counterpart due limited annotated data. To address this, recent works first generate open-vocabulary masks through models and then merge them into based on metrics calculated between two neighboring frames. In contrast these local metrics, we propose a novel metric, view consensus rate, enhance the utilization of multi-view...

10.48550/arxiv.2401.07745 preprint EN cc-by arXiv (Cornell University) 2024-01-01

Vision-and-Language Navigation (VLN) stands as a key research problem of Embodied AI, aiming at enabling agents to navigate in unseen environments following linguistic instructions. In this field, generalization is long-standing challenge, either out-of-distribution scenes or from Sim Real. paper, we propose NaVid, video-based large vision language model (VLM), mitigate such gap. NaVid makes the first endeavour showcase capability VLMs achieve state-of-the-art level navigation performance...

10.48550/arxiv.2402.15852 preprint EN arXiv (Cornell University) 2024-02-24

Recent research on Vision-and-Language Navigation (VLN) indicates that agents suffer from poor generalization in unseen environments due to the lack of realistic training and high-quality path-instruction pairs. Most existing methods for constructing navigation scenes have high costs, extension instructions mainly relies predefined templates or rules, lacking adaptability. To alleviate issue, we propose InstruGen, a VLN pairs generation paradigm. Specifically, use YouTube house tour videos...

10.48550/arxiv.2411.11394 preprint EN arXiv (Cornell University) 2024-11-18

Choosing appropriate hyperparameters plays a crucial role in the success of neural networks as hyper-parameters directly control behavior and performance training algorithms. To obtain efficient tuning, Bayesian optimization methods based on Gaussian process (GP) models are widely used. Despite numerous applications deep learning, existing methodologies developed convenient but restrictive assumption that tuning parameters independent each other. However, with conditional dependence common...

10.48550/arxiv.2402.04885 preprint EN arXiv (Cornell University) 2024-01-19

Effectively manipulating articulated objects in household scenarios is a crucial step toward achieving general embodied artificial intelligence. Mainstream research 3D vision has primarily focused on manipulation through depth perception and pose detection. However, real-world environments, these methods often face challenges due to imperfect perception, such as with transparent lids reflective handles. Moreover, they generally lack the diversity part-based interactions required for flexible...

10.48550/arxiv.2411.18276 preprint EN arXiv (Cornell University) 2024-11-27

Camera placement is crutial in multi-camera systems such as virtual reality, autonomous driving, and high-quality reconstruction. The camera challenge lies the nonlinear nature of high-dimensional parameters unavailability gradients for target functions like coverage visibility. Consequently, most existing methods tackle this by leveraging non-gradient-based optimization methods.In work, we present a hybrid approach that incorporates both gradient-based methods. This design allows our method...

10.48550/arxiv.2412.08266 preprint EN arXiv (Cornell University) 2024-12-11

We propose a novel approach to robot-operated active understanding of unknown indoor scenes, based on online RGBD reconstruction with semantic segmentation. In our method, the exploratory robot scanning is both driven by and targeting at recognition segmentation objects from scene. Our algorithm built top volumetric depth fusion framework (e.g., KinectFusion) performs real-time voxel-based labeling over reconstructed volume. The guided an estimated discrete viewing score field (VSF)...

10.48550/arxiv.1906.07409 preprint EN other-oa arXiv (Cornell University) 2019-01-01

Online semantic 3D segmentation in company with real-time RGB-D reconstruction poses special challenges such as how to perform convolution directly over the progressively fused geometric data, and smartly fuse information from frame frame. We propose a novel fusion-aware point which operates on surface being reconstructed exploits effectively inter-frame correlation for high quality feature learning. This is enabled by dedicated dynamic data structure organizes online acquired cloud...

10.48550/arxiv.2003.06233 preprint EN other-oa arXiv (Cornell University) 2020-01-01
Coming Soon ...