- Robot Manipulation and Learning
- Multimodal Machine Learning Applications
- Reinforcement Learning in Robotics
- Human Pose and Action Recognition
- Domain Adaptation and Few-Shot Learning
- Advanced Vision and Imaging
- Robotics and Sensor-Based Localization
- Advanced Image and Video Retrieval Techniques
- Topic Modeling
- Modular Robots and Swarm Intelligence
- Image Processing Techniques and Applications
- Natural Language Processing Techniques
- Reservoir Engineering and Simulation Methods
- Space Satellite Systems and Control
- Advanced Control Systems Optimization
- Robotic Locomotion and Control
- Muscle activation and electromyography studies
- Soft Robotics and Applications
- Generative Adversarial Networks and Image Synthesis
- Data Stream Mining Techniques
- Machine Learning and Data Classification
- Space Science and Extraterrestrial Life
- Robotic Path Planning Algorithms
- Machine Learning and Algorithms
- Image and Object Detection Techniques
Imperial College London
2020-2024
Imitation learning with visual observations is notoriously inefficient when addressed end-to-end behavioural cloning methods. In this letter, we explore an alternative paradigm which decomposes reasoning into three phases. First, a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">retrieval</i> phase, informs the robot xmlns:xlink="http://www.w3.org/1999/xlink">what</i> it can do object. Second,...
We present DOME, a novel method for one-shot imitation learning, where task can be learned from just single demonstration and then deployed immediately, without any further data collection or training. DOME does not require prior object knowledge, perform the in configurations with distractors. At its core, uses an image-conditioned segmentation network followed by visual servoing network, to move robot's end-effector same relative pose as during demonstration, after which completed...
Recent advancements in large multimodal models have led to the emergence of remarkable generalist capabilities digital domains, yet their translation physical agents such as robots remains a significant challenge. This report introduces new family AI purposefully designed for robotics and built upon foundation Gemini 2.0. We present Robotics, an advanced Vision-Language-Action (VLA) model capable directly controlling robots. Robotics executes smooth reactive movements tackle wide range...
In this paper, we study the problem of zero-shot sim-to-real when task requires both highly precise control with sub-millimetre error tolerance, and wide space generalisation. Our framework involves a coarse-to-fine controller, where trajectories begin classical motion planning using ICP-based pose estimation, transition to learned end-to-end controller which maps images actions is trained in simulation domain randomisation. way, achieve whilst also generalising across spaces, keeping...
Language Models and Vision have recently demonstrated unprecedented capabilities in terms of understanding human intentions, reasoning, scene understanding, planning-like behaviour, text form, among many others. In this work, we investigate how to embed leverage such abilities Reinforcement Learning (RL) agents. We design a framework that uses language as the core reasoning tool, exploring enables an agent tackle series fundamental RL challenges, efficient exploration, reusing experience...
We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning, mapping observations to action sequences emulate the demonstrator's behaviour. achieve this by transforming (inputs) and trajectories of actions (outputs) into tokens a text-pretrained Transformer (GPT-4 Turbo) ingest generate, via framework we call Keypoint Action Tokens (KAT). Despite being trained only on language, these Transformers excel at...
One of the main issues in Imitation Learning is erroneous behavior an agent when facing out-of-distribution situations, not covered by set demonstrations given expert. In this work, we tackle problem introducing a novel active learning and control algorithm, SAFARI. During training, it allows to request further human these situations are met. At deployment, combines model-free acting using behavioural cloning with model-based planning reduce state-distribution shift, future state...
Trajectory optimization using a learned model of the environment is one core elements model-based reinforcement learning. This procedure often suffers from exploiting inaccuracies model. We propose to regularize trajectory by means denoising autoencoder that trained on same trajectories as environment. show proposed regularization leads improved planning with both gradient-based and gradient-free optimizers. also demonstrate regularized rapid initial learning in set popular motor control...
We propose DINOBot, a novel imitation learning framework for robot manipulation, which leverages the image-level and pixel-level capabilities of features extracted from Vision Transformers trained with DINO. When interacting object, DINOBot first uses these to retrieve most visually similar object experienced during human demonstrations, then this align its end-effector enable effective interaction. Through series real-world experiments on everyday tasks, we show that exploiting both...
We introduce Diffusion Augmented Agents (DAAG), a novel framework that leverages large language models, vision and diffusion models to improve sample efficiency transfer learning in reinforcement for embodied agents. DAAG hindsight relabels the agent's past experience by using transform videos temporally geometrically consistent way align with target instructions technique we call Hindsight Experience Augmentation. A model orchestrates this autonomous process without requiring human...
We present R+X, a framework which enables robots to learn skills from long, unlabelled, first-person videos of humans performing everyday tasks. Given language command human, R+X first retrieves short video clips containing relevant behaviour, and then executes the skill by conditioning an in-context imitation learning method on this behaviour. By leveraging Vision Language Model (VLM) for retrieval, does not require any manual annotation videos, execution, can perform commanded immediately,...
Model based predictions of future trajectories a dynamical system often suffer from inaccuracies, forcing model control algorithms to re-plan often, thus being computationally expensive, suboptimal and not reliable. In this work, we propose agnostic method for estimating the uncertainty model?s on reconstruction error, using it in exploration. As our experiments show, estimation can be used improve performance wide variety environments by choosing which is confident. It also active learning...
In this work, we introduce a novel method to learn everyday-like multi-stage tasks from single human demonstration, without requiring any prior object knowledge. Inspired by the recent Coarse-to-Fine Imitation Learning method, model imitation learning as learned reaching phase followed an open-loop replay of demonstrator's actions. We build upon for where, following robot can autonomously collect image data entire task, next in sequence and then replaying repeating loop all stages task....
Large Language Models (LLMs) have recently shown promise as high-level planners for robots when given access to a selection of low-level skills. However, it is often assumed that LLMs do not possess sufficient knowledge be used the trajectories themselves. In this work, we address assumption thoroughly, and investigate if an LLM (GPT-4) can directly predict dense sequence end-effector poses manipulation skills, only object detection segmentation vision models. We study how well single...
Imitation learning with visual observations is notoriously inefficient when addressed end-to-end behavioural cloning methods. In this paper, we explore an alternative paradigm which decomposes reasoning into three phases. First, a retrieval phase, informs the robot what it can do object. Second, alignment where to interact And third, replay how Through series of real-world experiments on everyday tasks, such as grasping, pouring, and inserting objects, show that decomposition brings...
We present DOME, a novel method for one-shot imitation learning, where task can be learned from just single demonstration and then deployed immediately, without any further data collection or training. DOME does not require prior object knowledge, perform the in configurations with distractors. At its core, uses an image-conditioned segmentation network followed by visual servoing network, to move robot's end-effector same relative pose as during demonstration, after which completed...
In this paper, we study the problem of zero-shot sim-to-real when task requires both highly precise control with sub-millimetre error tolerance, and wide space generalisation. Our framework involves a coarse-to-fine controller, where trajectories begin classical motion planning using ICP-based pose estimation, transition to learned end-to-end controller which maps images actions is trained in simulation domain randomisation. way, achieve whilst also generalising across spaces, keeping...