Adaptive Advantage-Guided Policy Regularization for Offline Reinforcement Learning
Regularization
DOI:
10.48550/arxiv.2405.19909
Publication Date:
2024-05-30
AUTHORS (6)
ABSTRACT
In offline reinforcement learning, the challenge of out-of-distribution (OOD) is pronounced. To address this, existing methods often constrain learned policy through regularization. However, these suffer from issue unnecessary conservativeness, hampering improvement. This occurs due to indiscriminate use all actions behavior that generates dataset as constraints. The problem becomes particularly noticeable when quality suboptimal. Thus, we propose Adaptive Advantage-guided Policy Regularization (A2PR), obtaining high-advantage an augmented combined with VAE guide policy. A2PR can select differ those present in dataset, while still effectively maintaining conservatism OOD actions. achieved by harnessing capacity generate samples matching distribution data points. We theoretically prove improvement guaranteed. Besides, it mitigates value overestimation a bounded performance gap. Empirically, conduct series experiments on D4RL benchmark, where demonstrates state-of-the-art performance. Furthermore, experimental results additional suboptimal mixed datasets reveal exhibits superior Code available at https://github.com/ltlhuuu/A2PR.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()
EXTERNAL LINKS
PlumX Metrics
RECOMMENDATIONS
FAIR ASSESSMENT
Coming soon ....
JUPYTER LAB
Coming soon ....