VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation

FOS: Computer and information sciences Computer Science - Robotics Robotics (cs.RO)
DOI: 10.48550/arxiv.2502.13508 Publication Date: 2025-02-19
ABSTRACT
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language (VLMs) that only support text-based instructions, neglecting the more natural speech modality human-robot interaction. Traditional integration methods usually involves a separate recognition system, which complicates model introduces error propagation. Moreover, transcription procedure would lose non-semantic information raw speech, such as voiceprint, may be crucial robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, novel VLA integrates directly into policy model. VLAS allows understand spoken commands through inner speech-text alignment produces corresponding actions fulfill task. We also present two new datasets, SQA CSI, three-stage tuning process empowers with ability of multimodal interaction across text, image, actions. Taking step further, voice retrieval-augmented generation (RAG) paradigm is designed enable our effectively handle tasks require individual-specific knowledge. Our extensive experiments show can accomplish diverse commands, offering seamless experience.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....