True Knowledge Comes from Practice: Aligning LLMs with Embodied Environments via Reinforcement Learning
Normalization
DOI:
10.48550/arxiv.2401.14151
Publication Date:
2024-01-01
AUTHORS (6)
ABSTRACT
Despite the impressive performance across numerous tasks, large language models (LLMs) often fail in solving simple decision-making tasks due to misalignment of knowledge LLMs with environments. On contrary, reinforcement learning (RL) agents learn policies from scratch, which makes them always align environments but difficult incorporate prior for efficient explorations. To narrow gap, we propose TWOSOME, a novel general online framework that deploys as efficiently interact and embodied via RL without requiring any prepared datasets or Firstly, query joint probabilities each valid action form behavior policies. Then, enhance stability robustness policies, two normalization methods summarize four prompt design principles. Finally, parameter-efficient training architecture where actor critic share one frozen LLM equipped low-rank adapters (LoRA) updated by PPO. We conduct extensive experiments evaluate TWOSOME. i) TWOSOME exhibits significantly better sample efficiency compared conventional method, PPO, tuning SayCan, both classical environment, Overcooked, simulated household VirtualHome. ii) Benefiting LLMs' open-vocabulary feature, shows superior generalization ability unseen tasks. iii) Under our framework, there is no significant loss original during PPO finetuning.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....