NFDI4DS | UHH-SEMS - Publication Details

Language Models as Zero-Shot Trajectory Generators

OPENALEX - Publications

Teyun Kwon Norman Di Palo Edward Johns

10.1109/lra.2024.3410155 article EN IEEE Robotics and Automation Letters 2024-06-05

On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation

OPENALEX - Publications

Norman Di Palo Edward Johns

Imitation learning with visual observations is notoriously inefficient when addressed end-to-end behavioural cloning methods. In this letter, we explore an alternative paradigm which decomposes reasoning into three phases. First, a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">retrieval</i> phase, informs the robot xmlns:xlink="http://www.w3.org/1999/xlink">what</i> it can do object. Second,...

10.1109/lra.2024.3349832 article EN IEEE Robotics and Automation Letters 2024-01-04

Demonstrate Once, Imitate Immediately (DOME): Learning Visual Servoing for One-Shot Imitation Learning

OPENALEX - Publications

Eugene Valassakis Georgios Papagiannis Norman Di Palo Edward Johns

We present DOME, a novel method for one-shot imitation learning, where task can be learned from just single demonstration and then deployed immediately, without any further data collection or training. DOME does not require prior object knowledge, perform the in configurations with distractors. At its core, uses an image-conditioned segmentation network followed by visual servoing network, to move robot's end-effector same relative pose as during demonstration, after which completed...

10.1109/iros47612.2022.9981982 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2022-10-23

Gemini Robotics: Bringing AI into the Physical World

OPENALEX - Publications

Gemini Team Saminda Abeyruwan Joshua Ainslie Jean-Baptiste Alayrac Montserrat Gonzalez Arenas and 95 more

Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities digital domains, yet their translation physical agents such as robots remains a significant challenge. This report introduces new family AI purposefully designed for robotics and built upon foundation Gemini 2.0. We present Robotics, an advanced Vision-Language-Action (VLA) model capable directly controlling robots. Robotics executes smooth reactive movements tackle wide range...

10.48550/arxiv.2503.20020 preprint EN arXiv (Cornell University) 2025-03-25

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

OPENALEX - Publications

Norman Di Palo Edward Johns

10.1109/icra57147.2024.10610923 article EN 2024-05-13

Coarse-to-Fine for Sim-to-Real: Sub-Millimetre Precision Across Wide Task Spaces

OPENALEX - Publications

Eugene Valassakis Norman Di Palo Edward Johns

In this paper, we study the problem of zero-shot sim-to-real when task requires both highly precise control with sub-millimetre error tolerance, and wide space generalisation. Our framework involves a coarse-to-fine controller, where trajectories begin classical motion planning using ICP-based pose estimation, transition to learned end-to-end controller which maps images actions is trained in simulation domain randomisation. way, achieve whilst also generalising across spaces, keeping...

10.1109/iros51168.2021.9636388 article EN 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2021-09-27

Towards A Unified Agent with Foundation Models

OPENALEX - Publications

Norman Di Palo Arunkumar Byravan Leonard Hasenclever Markus Wulfmeier Nicolas Heess and 1 more

Language Models and Vision have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, planning-like behaviour, text form, among many others. In this work, we investigate how to embed leverage such abilities Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring enables an agent tackle series fundamental RL challenges, efficient exploration, reusing experience...

10.48550/arxiv.2307.09668 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

OPENALEX - Publications

Norman Di Palo Edward Johns

We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping observations to action sequences emulate the demonstrator's behaviour. achieve this by transforming (inputs) and trajectories of actions (outputs) into tokens a text-pretrained Transformer (GPT-4 Turbo) ingest generate, via framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, these Transformers excel at...

10.48550/arxiv.2403.19578 preprint EN arXiv (Cornell University) 2024-03-28

Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics

OPENALEX - Publications

Norman Di Palo Edward Johns

10.15607/rss.2024.xx.096 article EN 2024-07-15

SAFARI: Safe and Active Robot Imitation Learning with Imagination

OPENALEX - Publications

Norman Di Palo Edward Johns

One of the main issues in Imitation Learning is erroneous behavior an agent when facing out-of-distribution situations, not covered by set demonstrations given expert. In this work, we tackle problem introducing a novel active learning and control algorithm, SAFARI. During training, it allows to request further human these situations are met. At deployment, combines model-free acting using behavioural cloning with model-based planning reduce state-distribution shift, future state...

10.48550/arxiv.2011.09586 preprint EN cc-by arXiv (Cornell University) 2020-01-01

Regularizing Trajectory Optimization with Denoising Autoencoders

OPENALEX - Publications

Rinu Boney Norman Di Palo Mathias Berglund Alexander Ilin Juho Kannala and 2 more

Trajectory optimization using a learned model of the environment is one core elements model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies model. We propose to regularize trajectory by means denoising autoencoder that trained on same trajectories as environment. show proposed regularization leads improved planning with both gradient-based and gradient-free optimizers. also demonstrate regularized rapid initial learning in set popular motor control...

10.48550/arxiv.1903.11981 preprint EN other-oa arXiv (Cornell University) 2019-01-01

DINOBot: Robot Manipulation via Retrieval and Alignment with Vision Foundation Models

OPENALEX - Publications

Norman Di Palo Edward Johns

We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting object, DINOBot first uses these to retrieve most visually similar object experienced during human demonstrations, then this align its end-effector enable effective interaction. Through series real-world experiments on everyday tasks, we show that exploiting both...

10.48550/arxiv.2402.13181 preprint EN arXiv (Cornell University) 2024-02-20

Diffusion Augmented Agents: A Framework for Efficient Exploration and Transfer Learning

OPENALEX - Publications

Norman Di Palo Leonard Hasenclever Jan Humplik Arunkumar Byravan

We introduce Diffusion Augmented Agents (DAAG), a novel framework that leverages large language models, vision and diffusion models to improve sample efficiency transfer learning in reinforcement for embodied agents. DAAG hindsight relabels the agent's past experience by using transform videos temporally geometrically consistent way align with target instructions technique we call Hindsight Experience Augmentation. A model orchestrates this autonomous process without requiring human...

10.48550/arxiv.2407.20798 preprint EN arXiv (Cornell University) 2024-07-30

R+X: Retrieval and Execution from Everyday Human Videos

OPENALEX - Publications

Georgios Papagiannis Norman Di Palo Pietro Vitiello Edward Johns

We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given language command human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method on this behaviour. By leveraging Vision Language Model (VLM) for retrieval, does not require any manual annotation videos, execution, can perform commanded immediately,...

10.48550/arxiv.2407.12957 preprint EN arXiv (Cornell University) 2024-07-17

Improving Model-Based Control and Active Exploration with Reconstruction Uncertainty Optimization

OPENALEX - Publications

Norman Di Palo Harri Valpola

Model based predictions of future trajectories a dynamical system often suffer from inaccuracies, forcing model control algorithms to re-plan often, thus being computationally expensive, suboptimal and not reliable. In this work, we propose agnostic method for estimating the uncertainty model?s on reconstruction error, using it in exploration. As our experiments show, estimation can be used improve performance wide variety environments by choosing which is confident. It also active learning...

10.48550/arxiv.1812.03955 preprint EN other-oa arXiv (Cornell University) 2018-01-01

Learning Multi-Stage Tasks with One Demonstration via Self-Replay

OPENALEX - Publications

Norman Di Palo Edward Johns

In this work, we introduce a novel method to learn everyday-like multi-stage tasks from single human demonstration, without requiring any prior object knowledge. Inspired by the recent Coarse-to-Fine Imitation Learning method, model imitation learning as learned reaching phase followed an open-loop replay of demonstrator's actions. We build upon for where, following robot can autonomously collect image data entire task, next in sequence and then replaying repeating loop all stages task....

10.48550/arxiv.2111.07447 preprint EN cc-by arXiv (Cornell University) 2021-01-01

Language Models as Zero-Shot Trajectory Generators

OPENALEX - Publications

Teyun Kwon Norman Di Palo Edward Johns

Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge be used the trajectories themselves. In this work, we address assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict dense sequence end-effector poses manipulation skills, only object detection segmentation vision models. We study how well single...

10.48550/arxiv.2310.11604 preprint EN cc-by arXiv (Cornell University) 2023-01-01

On the Effectiveness of Retrieval, Alignment, and Replay in Manipulation

OPENALEX - Publications

Norman Di Palo Edward Johns

Imitation learning with visual observations is notoriously inefficient when addressed end-to-end behavioural cloning methods. In this paper, we explore an alternative paradigm which decomposes reasoning into three phases. First, a retrieval phase, informs the robot what it can do object. Second, alignment where to interact And third, replay how Through series of real-world experiments on everyday tasks, such as grasping, pouring, and inserting objects, show that decomposition brings...

10.48550/arxiv.2312.12345 preprint EN cc-by arXiv (Cornell University) 2023-01-01

Demonstrate Once, Imitate Immediately (DOME): Learning Visual Servoing for One-Shot Imitation Learning

OPENALEX - Publications

Eugene Valassakis Georgios Papagiannis Norman Di Palo Edward Johns

We present DOME, a novel method for one-shot imitation learning, where task can be learned from just single demonstration and then deployed immediately, without any further data collection or training. DOME does not require prior object knowledge, perform the in configurations with distractors. At its core, uses an image-conditioned segmentation network followed by visual servoing network, to move robot's end-effector same relative pose as during demonstration, after which completed...

10.48550/arxiv.2204.02863 preprint EN cc-by arXiv (Cornell University) 2022-01-01

Coarse-to-Fine for Sim-to-Real: Sub-Millimetre Precision Across Wide Task Spaces

OPENALEX - Publications

Eugene Valassakis Norman Di Palo Edward Johns

In this paper, we study the problem of zero-shot sim-to-real when task requires both highly precise control with sub-millimetre error tolerance, and wide space generalisation. Our framework involves a coarse-to-fine controller, where trajectories begin classical motion planning using ICP-based pose estimation, transition to learned end-to-end controller which maps images actions is trained in simulation domain randomisation. way, achieve whilst also generalising across spaces, keeping...

10.48550/arxiv.2105.11283 preprint EN other-oa arXiv (Cornell University) 2021-01-01