Vision-Language Models are Zero-Shot Reward Models for Reinforcement Learning
Zero (linguistics)
DOI:
10.48550/arxiv.2310.12921
Publication Date:
2023-01-01
AUTHORS (5)
ABSTRACT
Reinforcement learning (RL) requires either manually specifying a reward function, which is often infeasible, or model from large amount of human feedback, very expensive. We study more sample-efficient alternative: using pretrained vision-language models (VLMs) as zero-shot (RMs) to specify tasks via natural language. propose and general approach VLMs models, we call VLM-RMs. use VLM-RMs based on CLIP train MuJoCo humanoid learn complex without specified such kneeling, doing the splits, sitting in lotus position. For each these tasks, only provide single sentence text prompt describing desired task with minimal engineering. videos trained agents at: https://sites.google.com/view/vlm-rm. can improve performance by providing second "baseline" projecting out parts embedding space irrelevant distinguish between goal baseline. Further, find strong scaling effect for VLM-RMs: larger compute data are better models. The failure modes encountered all related known capability limitations current VLMs, limited spatial reasoning ability visually unrealistic environments that far off-distribution VLM. remarkably robust long VLM enough. This suggests future will become useful wide range RL applications.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....