- Robot Manipulation and Learning
- Multimodal Machine Learning Applications
- Reinforcement Learning in Robotics
- Robotics and Sensor-Based Localization
- Natural Language Processing Techniques
- Advanced Vision and Imaging
- Topic Modeling
- Soft Robotics and Applications
- Human Pose and Action Recognition
- Domain Adaptation and Few-Shot Learning
- 3D Shape Modeling and Analysis
- Robotic Path Planning Algorithms
- 3D Surveying and Cultural Heritage
- Advanced Neural Network Applications
- Muscle activation and electromyography studies
- Advanced Image and Video Retrieval Techniques
- Motor Control and Adaptation
- Remote Sensing and LiDAR Applications
- Robotic Mechanisms and Dynamics
- Tactile and Sensory Interactions
- Machine Learning and Data Classification
- Modular Robots and Swarm Intelligence
- Video Analysis and Summarization
- Anomaly Detection Techniques and Applications
- Adversarial Robustness in Machine Learning
Google (United States)
2019-2024
Princeton University
2016-2020
Carnegie Mellon University
2015
Access to large, diverse RGB-D datasets is critical for training scene understanding algorithms. However, existing still cover only a limited number of views or restricted scale spaces. In this paper, we introduce Matterport3D, large-scale dataset containing 10,800 panoramic from 194,400 images 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D 3D semantic segmentations. The precise global alignment comprehensive, set over entire buildings...
This paper focuses on semantic scene completion, a task for producing complete 3D voxel representation of volumetric occupancy and labels from single-view depth map observation. Previous work has considered completion labeling maps separately. However, we observe that these two problems are tightly intertwined. To leverage the coupled nature tasks, introduce network (SSCNet), an end-to-end convolutional takes single image as input simultaneously outputs all voxels in camera view frustum. Our...
Matching local geometric features on real-world depth images is a challenging task due to the noisy, low-resolution, and incomplete nature of 3D scan data. These difficulties limit performance current state-of-art methods, which are typically based histograms over properties. In this paper, we present 3DMatch, data-driven model that learns volumetric patch descriptor for establishing correspondences between partial To amass training data our model, propose self-supervised feature learning...
Skilled robotic manipulation benefits from complex synergies between non-prehensile (e.g. pushing) and prehensile grasping) actions: pushing can help rearrange cluttered objects to make space for arms fingers; likewise, grasping displace movements more precise collision-free. In this work, we demonstrate that it is possible discover learn these scratch through model-free deep reinforcement learning. Our method involves training two fully convolutional networks map visual observations one...
Robot warehouse automation has attracted significant interest in recent years, perhaps most visibly the Amazon Picking Challenge (APC) [1]. A fully autonomous pick-and-place system requires robust vision that reliably recognizes and locates objects amid cluttered environments, self-occlusions, sensor noise, a large variety of objects. In this paper we present an approach leverages multiview RGB-D data self-supervised, data-driven learning to overcome those difficulties. The was part...
This paper presents a robotic pick-and-place system that is capable of grasping and recognizing both known novel objects in cluttered environments. The key new feature the it handles wide range object categories without needing any task-specific training data for objects. To achieve this, first uses category-agnostic affordance prediction algorithm to select execute among four different primitive behaviors. It then recognizes picked with cross-domain image classification framework matches...
Access to large, diverse RGB-D datasets is critical for training scene understanding algorithms. However, existing still cover only a limited number of views or restricted scale spaces. In this paper, we introduce Matterport3D, large-scale dataset containing 10,800 panoramic from 194,400 images 90 building-scale scenes. Annotations are provided with surface reconstructions, camera poses, and 2D 3D semantic segmentations. The precise global alignment comprehensive, set over entire buildings...
Large language models excel at a wide range of complex tasks. However, enabling general inference in the real world, e.g., for robotics problems, raises challenge grounding. We propose embodied to directly incorporate real-world continuous sensor modalities into and thereby establish link between words percepts. Input our model are multi-modal sentences that interleave visual, state estimation, textual input encodings. train these encodings end-to-end, conjunction with pre-trained large...
We investigate whether a robot arm can learn to pick and throw arbitrary rigid objects into selected boxes quickly accurately. Throwing has the potential increase physical reachability picking speed of arm. However, precisely throwing in unstructured settings presents many challenges: from acquiring grasps suitable for reliable throwing, handling varying object-centric properties (e.g., mass distribution, friction, shape) complex aerodynamics. In this work, we propose an end-to-end...
Large language models (LLMs) trained on code-completion have been shown to be capable of synthesizing simple Python programs from docstrings [1]. We find that these code-writing LLMs can re-purposed write robot policy code, given natural commands. Specifically, code express functions or feedback loops process perception outputs (e.g., object detectors [2], [3]) and parameterize control primitive APIs. When provided as input several example commands (formatted comments) followed by...
Transparent objects are a common part of everyday life, yet they possess unique visual properties that make them incredibly difficult for standard 3D sensors to produce accurate depth estimates for. In many cases, often appear as noisy or distorted approximations the surfaces lie behind them. To address these challenges, we present ClearGrasp - deep learning approach estimating geometry transparent from single RGB-D image robotic manipulation. Given objects, uses convolutional networks infer...
Recent works have shown how the reasoning capabilities of Large Language Models (LLMs) can be applied to domains beyond natural language processing, such as planning and interaction for robots. These embodied problems require an agent understand many semantic aspects world: repertoire skills available, these influence world, changes world map back language. LLMs in environments need consider not just what do, but also when do them - answers that change over time response agent's own choices....
Intelligent manipulation benefits from the capacity to flexibly control an end-effector with high degrees of freedom (DoF) and dynamically react environment. However, due challenges collecting effective training data learning efficiently, most grasping algorithms today are limited top-down movements open-loop execution. In this work, we propose a new low-cost hardware interface for demonstrations by people in diverse environments. This makes it possible train robust end-to-end 6DoF...
Large pretrained (e.g., "foundation") models exhibit distinct capabilities depending on the domain of data they are trained on. While these domains generic, may only barely overlap. For example, visual-language (VLMs) Internet-scale image captions, but large language (LMs) further text with no images spreadsheets, SAT questions, code). As a result, store different forms commonsense knowledge across domains. In this work, we show that diversity is symbiotic, and can be leveraged through...
Grounding language to the visual observations of a navigating agent can be performed using off-the-shelf visual-language models pretrained on Internet-scale data (e.g., image captions). While this is useful for matching images natural descriptions object goals, it remains disjoint from process mapping environment, so that lacks spatial precision classic geometric maps. To address problem, we propose VLMaps, map representation directly fuses features with 3D reconstruction physical world....
This article presents a robotic pick-and-place system that is capable of grasping and recognizing both known novel objects in cluttered environments. The key new feature the it handles wide range object categories without needing any task-specific training data for objects. To achieve this, first uses an object-agnostic framework to map from visual observations actions: inferring dense pixel-wise probability maps affordances four different primitive actions. It then executes action with...
We investigate whether a robot arm can learn to pick and throw arbitrary objects into selected boxes quickly accurately.Throwing has the potential increase physical reachability picking speed of arm.However, precisely throwing in unstructured settings presents many challenges: from acquiring reliable pre-throw conditions (e.g.grasp object) handling varying object-centric properties (e.g.mass distribution, friction, shape) dynamics (e.g.aerodynamics).In this work, we propose an end-to-end...
Robotic manipulation can be formulated as inducing a sequence of spatial displacements: where the space being moved encompass an object, part or end effector. In this work, we propose Transporter Network, simple model architecture that rearranges deep features to infer displacements from visual input - which parameterize robot actions. It makes no assumptions objectness (e.g. canonical poses, models, keypoints), it exploits symmetries, and is orders magnitude more sample efficient than our...
Is it possible to learn policies for robotic assembly that can generalize new objects? We explore this idea in the context of kit task. Since classic methods rely heavily on object pose estimation, they often struggle objects without 3D CAD models or task-specific training data. In work, we propose formulate task as a shape matching problem, where goal is descriptor establishes geometric correspondences between surfaces and their target placement locations from visual input. This formulation...
Rearranging and manipulating deformable objects such as cables, fabrics, bags is a long-standing challenge in robotic manipulation. The complex dynamics high-dimensional configuration spaces of deformables, compared to rigid objects, make manipulation difficult not only for multi-step planning, but even goal specification. Goals cannot be easily specified object poses, may involve relative spatial relations "place the item inside bag". In this work, we develop suite simulated benchmarks with...
For a robot to personalize physical assistance effectively, it must learn user preferences that can be generally reapplied future scenarios. In this work, we investigate personalization of household cleanup with robots tidy up rooms by picking objects and putting them away. A key challenge is determining the proper place put each object, as people's vary greatly depending on personal taste or cultural background. instance, one person may prefer storing shirts in drawer, while another shelf....
Large language models (LLMs) have demonstrated exciting progress in acquiring diverse new capabilities through in-context learning, ranging from logical reasoning to code-writing. Robotics researchers also explored using LLMs advance the of robotic control. However, since low-level robot actions are hardware-dependent and underrepresented LLM training corpora, existing efforts applying robotics largely treated as semantic planners or relied on human-engineered control primitives interface...
Large language models (LLMs) exhibit a wide range of promising capabilities -- from step-by-step planning to commonsense reasoning that may provide utility for robots, but remain prone confidently hallucinated predictions. In this work, we present KnowNo, which is framework measuring and aligning the uncertainty LLM-based planners such they know when don't ask help needed. KnowNo builds on theory conformal prediction statistical guarantees task completion while minimizing human in complex...
People employ expressive behaviors to effectively communicate and coordinate their actions with others, such as nodding acknowledge a person glancing at them or saying "excuse me" pass people in busy corridor. We would like robots also demonstrate human-robot interaction. Prior work proposes rule-based methods that struggle scale new communication modalities social situations, while data-driven require specialized datasets for each situation the robot is used in. propose leverage rich...