- Advanced Vision and Imaging
- Robotics and Sensor-Based Localization
- Optical measurement and interference techniques
- Image Processing Techniques and Applications
- Advanced Neural Network Applications
- Advanced Image and Video Retrieval Techniques
- Video Surveillance and Tracking Methods
- 3D Shape Modeling and Analysis
- Computer Graphics and Visualization Techniques
- Autonomous Vehicle Technology and Safety
- Human Pose and Action Recognition
- 3D Surveying and Cultural Heritage
- Multimodal Machine Learning Applications
- Robot Manipulation and Learning
- Advanced Optical Sensing Technologies
- Robotic Path Planning Algorithms
- Domain Adaptation and Few-Shot Learning
- Industrial Vision Systems and Defect Detection
- Advanced Image Processing Techniques
- Image and Object Detection Techniques
- Soft Robotics and Applications
- Adversarial Robustness in Machine Learning
- Tactile and Sensory Interactions
- Reinforcement Learning in Robotics
- Remote Sensing and LiDAR Applications
Toyota Research Institute
2018-2024
Toyota Industries (United States)
2019-2023
Toyota Motor Corporation (Switzerland)
2019-2021
KTH Royal Institute of Technology
2014-2020
Constructor University
2008
Although cameras are ubiquitous, robotic platforms typically rely on active sensors like LiDAR for direct 3D perception. In this work, we propose a novel self-supervised monocular depth estimation method combining geometry with new deep network, PackNet, learned only from unlabeled videos. Our architecture leverages symmetrical packing and unpacking blocks to jointly learn compress decompress detail-preserving representations using convolutions. self-supervised, our outperforms other self,...
Recent progress in 3D object detection from single images leverages monocular depth estimation as a way to produce pointclouds, turning cameras into pseudo-lidar sensors. These two-stage detectors improve with the accuracy of intermediate network, which can itself be improved without manual labels via large-scale self-supervised learning. However, they tend suffer overfitting more than end-to-end methods, are complex, and gap similar lidar-based remains significant. In this work, we propose...
Recent techniques in self-supervised monocular depth estimation are approaching the performance of supervised methods, but operate low resolution only. We show that high is key towards high-fidelity prediction. Inspired by recent deep learning methods for Single-Image Super-Resolution, we propose a subpixel convolutional layer extension super-resolution accurately synthesizes high-resolution disparities from their corresponding low-resolution features. In addition, introduce differentiable...
Multi-object tracking is an important ability for autonomous vehicle to safely navigate a traffic scene. Current state-of-the-art follows the tracking-by-detection paradigm where existing tracks are associated with detected objects through some distance metric. Key challenges increase accuracy lie in data association and track life cycle management. We propose probabilistic, multi-modal, multiobject system consisting of different trainable modules provide robust data-driven results. First,...
Building 3D perception systems for autonomous vehicles that do not rely on high-density LiDAR is a critical research problem because of the expense compared to cameras and other sensors. Recent has developed variety camera-only methods, where features are differentiably "lifted" from multi-camera images onto 2D ground plane, yielding "bird's eye view" (BEV) feature representation space around vehicle. This line work produced novel "lifting" but we observe details in training setups have...
Thanks to the efforts of robotics and autonomous systems community, robots are becoming ever more capable. There is also an increasing demand from end-users for service that can operate in real environments extended periods. In STRANDS project we tackling this head-on by integrating state-of-the-art artificial intelligence research into mobile robots, deploying these long-term installations security care environments. Over four deployments, our have been operational a combined duration 104...
Self-supervised learning is showing great promise for monocular depth estimation, using geometry as the only source of supervision. Depth networks are indeed capable representations that relate visual appearance to 3D properties by implicitly leveraging category-level patterns. In this work we investigate how leverage more directly semantic structure guide geometric representation learning, while remaining in self-supervised regime. Instead labels and proxy losses a multi-task approach,...
Multi-frame depth estimation improves over single-frame approaches by also leveraging geometric relationships between images via feature matching, in addition to learning appearance-based features. In this paper we revisit matching for self-supervised monocular estimation, and propose a novel transformer architecture cost volume generation. We use depth-discretized epipolar sampling select candidates, refine predictions through series of self- cross-attention layers. These layers sharpen the...
Estimating scene geometry from data obtained with cost-effective sensors is key for robots and self-driving cars. In this paper, we study the problem of predicting dense depth a single RGB image (monodepth) optional sparse measurements low-cost active sensors. We introduce Sparse Auxiliary Networks (SANs), new module enabling monodepth networks to perform both tasks prediction completion, depending on whether only images or also point clouds are available at inference time. First, decouple...
We present a novel method for re-creating the static structure of cluttered office environments - which we define as "meta-room" from multiple observations collected by an autonomous robot equipped with RGB-D depth camera over extended periods time. Our works directly point clusters identifying what has changed one observation to next, removing dynamic elements and at same time adding previously occluded objects reconstruct underlying accurately possible. The process constructing meta-rooms...
We present an automatic approach for the task of reconstructing a 2-D floor plan from unstructured point clouds building interiors. Our emphasizes accurate and robust detection structural elements and, unlike previous approaches, does not require prior knowledge scanning device poses. The reconstruction is formulated as multiclass labeling problem that we using energy minimization. use intuitive priors to define costs minimization rely on wall opening algorithms ensure robustness. provide...
In this article, we present and evaluate a system, which allows mobile robot to autonomously detect, model, re-recognize objects in everyday environments. While other systems have demonstrated one of these elements, our knowledge, the first is capable doing all things, without human interaction, normal indoor scenes. Our system detects learn by modeling static part environment extracting dynamic elements. It then creates executes view plan around element gather additional views for learning....
Self-supervised monocular depth estimation enables robots to learn 3D perception from raw video streams. This scalable approach leverages projective geometry and ego-motion via view synthesis, assuming the world is mostly static. Dynamic scenes, which are common in autonomous driving human-robot interaction, violate this assumption. Therefore, they require modeling dynamic objects explicitly, for instance estimating pixel-wise motion, i.e. scene flow. However, simultaneous self-supervised...
Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing in favor improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular framework capable predicting for arbitrary test images from different domains...
Recent implicit neural representations have shown great results for novel view synthesis. However, existing methods require expensive per-scene optimization from many views hence limiting their application to real-world unbounded urban settings where the objects of interest or backgrounds are observed very few views. To mitigate this challenge, we introduce a new approach called NeO 360, Neural fields sparse synthesis outdoor scenes. 360 is generalizable method that reconstructs 360° scenes...
Self-supervised monocular depth and ego-motion estimation is a promising approach to replace or supplement expensive sensors such as LiDAR for robotics applications like autonomous driving. However, most research in this area focuses on single camera stereo pairs that cover only fraction of the scene around vehicle. In work, we extend self-supervised large-baseline multi-camera rigs. Using generalized spatio-temporal contexts, pose consistency constraints, carefully designed photometric loss...
We present CARTO, a novel approach for reconstructing multiple articulated objects from single stereo RGB observation. use implicit object-centric representations and learn geometry articulation decoder object categories. Despite training on categories, our achieves comparable reconstruction accuracy to methods that train bespoke decoders separately each category. Combined with image encoder we infer the 3D shape, 6D pose, size, joint type, state of unknown in forward pass. Our method 20.4%...
Simulators can efficiently generate large amounts of labeled synthetic data with perfect supervision for hard-to-label tasks like semantic segmentation. However, they introduce a domain gap that severely hurts real-world performance. We propose to use self-supervised monocular depth estimation as proxy task bridge this and improve sim-to-real unsupervised adaptation (UDA). Our Geometric Unsupervised Domain Adaptation method (GUDA) <sup xmlns:mml="http://www.w3.org/1998/Math/MathML"...
3D object detection from visual sensors is a corner-stone capability of robotic systems. State-of-the-art methods focus on reasoning and decoding bounding boxes multi-view camera input. In this work we gain intuition the integral role consistency in scene understanding geometric learning. To end, introduce VEDet, novel framework that exploits geometry to improve localization through viewpoint awareness equivariance. VEDet leverages query-based transformer architecture encodes by augmenting...
Current methods for 3D scene reconstruction from sparse posed images employ intermediate representations such as neural fields, voxel grids, or Gaussians, to achieve multi-view consistent appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation depth maps novel viewpoints, given an arbitrary number input views. Our method uses raymap conditioning both augment visual features with spatial information different well guide...