NFDI4DS | UHH-SEMS - Publication Details

VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

FOS: Computer and information sciences Computer Science - Robotics Robotics (cs.RO)

DOI: 10.48550/arxiv.2502.13508 Publication Date: 2025-02-19

Abstract Supplemental Material References Cited by

AUTHORS (7)

Wei Zhao

Pengxiang Ding

Min Zhang

Zheng Gong

Shuanghao Bai

Han Zhao

Donglin Wang

ABSTRACT

Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language (VLMs) that only support text-based instructions, neglecting the more natural speech modality human-robot interaction. Traditional integration methods usually involves a separate recognition system, which complicates model introduces error propagation. Moreover, transcription procedure would lose non-semantic information raw speech, such as voiceprint, may be crucial robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, novel VLA integrates directly into policy model. VLAS allows understand spoken commands through inner speech-text alignment produces corresponding actions fulfill task. We also present two new datasets, SQA CSI, three-stage tuning process empowers with ability of multimodal interaction across text, image, actions. Taking step further, voice retrieval-augmented generation (RAG) paradigm is designed enable our effectively handle tasks require individual-specific knowledge. Our extensive experiments show can accomplish diverse commands, offering seamless experience.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products OPENALEX - Publications

PlumX Metrics

VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....