Instruct2Act: Mapping Multi-modality Instructions to Robotic Actions with Large Language Model

Python Modality (human–computer interaction) Benchmark (surveying) Code (set theory)
DOI: 10.48550/arxiv.2305.11176 Publication Date: 2023-01-01
ABSTRACT
Foundation models have made significant strides in various applications, including text-to-image generation, panoptic segmentation, and natural language processing. This paper presents Instruct2Act, a framework that utilizes Large Language Models to map multi-modal instructions sequential actions for robotic manipulation tasks. Specifically, Instruct2Act employs the LLM model generate Python programs constitute comprehensive perception, planning, action loop In perception section, pre-defined APIs are used access multiple foundation where Segment Anything Model (SAM) accurately locates candidate objects, CLIP classifies them. this way, leverages expertise of abilities convert complex high-level into precise policy codes. Our approach is adjustable flexible accommodating instruction modalities input types catering specific task demands. We validated practicality efficiency our by assessing it on tasks different scenarios within tabletop domains. Furthermore, zero-shot method outperformed many state-of-the-art learning-based policies several The code proposed available at https://github.com/OpenGVLab/Instruct2Act, serving as robust benchmark with assorted modality inputs.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....