ClipRover: Zero-shot Vision-Language Exploration and Target Discovery by Mobile Robots

Zero (linguistics)
DOI: 10.48550/arxiv.2502.08791 Publication Date: 2025-02-12
ABSTRACT
Vision-language navigation (VLN) has emerged as a promising paradigm, enabling mobile robots to perform zero-shot inference and execute tasks without specific pre-programming. However, current systems often separate map exploration path planning, with relying on inefficient algorithms due limited (partially observed) environmental information. In this paper, we present novel pipeline named ''ClipRover'' for simultaneous target discovery in unknown environments, leveraging the capabilities of vision-language model CLIP. Our approach requires only monocular vision operates any prior or knowledge about target. For comprehensive evaluations, design functional prototype UGV (unmanned ground vehicle) system ''Rover Master'', customized platform general-purpose VLN tasks. We integrate deploy ClipRover Rover Master evaluate its throughput, obstacle avoidance capability, trajectory performance across various real-world scenarios. Experimental results demonstrate that consistently outperforms traditional traversal achieves comparable path-planning methods depend knowledge. Notably, offers real-time active requiring pre-captured candidate images pre-built node graphs, addressing key limitations existing pipelines.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....