VLAS: Vision-Language-Action Model With Speech Instructions For Customized Robot Manipulation
FOS: Computer and information sciences
Computer Science - Robotics
Robotics (cs.RO)
DOI:
10.48550/arxiv.2502.13508
Publication Date:
2025-02-19
AUTHORS (7)
ABSTRACT
Vision-language-action models (VLAs) have become increasingly popular in robot manipulation for their end-to-end design and remarkable performance. However, existing VLAs rely heavily on vision-language (VLMs) that only support text-based instructions, neglecting the more natural speech modality human-robot interaction. Traditional integration methods usually involves a separate recognition system, which complicates model introduces error propagation. Moreover, transcription procedure would lose non-semantic information raw speech, such as voiceprint, may be crucial robots to successfully complete customized tasks. To overcome above challenges, we propose VLAS, novel VLA integrates directly into policy model. VLAS allows understand spoken commands through inner speech-text alignment produces corresponding actions fulfill task. We also present two new datasets, SQA CSI, three-stage tuning process empowers with ability of multimodal interaction across text, image, actions. Taking step further, voice retrieval-augmented generation (RAG) paradigm is designed enable our effectively handle tasks require individual-specific knowledge. Our extensive experiments show can accomplish diverse commands, offering seamless experience.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....