NFDI4DS | UHH-SEMS - Publication Details

model free trajectory based policy optimization with monotonic improvement

ddc:004 FOS: Computer and information sciences reinforcement learning Computer Science - Machine Learning 0209 industrial biotechnology Policy Optimization 330 Trajectory Optimization 02 engineering and technology Machine Learning (cs.LG) Computer Science - Robotics 0202 electrical engineering, electronic engineering, information engineering trajectory optimization I460 - Machine learning DATA processing & computer science Robotics Reinforcement Learning 004 policy search G760 Machine Learning Robotics (cs.RO) info:eu-repo/classification/ddc/004

DOI: 10.5555/3291125.3291139 Publication Date: 2016-01-01

Abstract Supplemental Material References Cited by

AUTHORS (5)

R. Akrour

A. Abdolmaleki

H. Abdulsamad

J. Peters

Gerhard Neumann

ABSTRACT

Many of the recent trajectory optimization algorithms alternate between linear approximation of the system dynamics around the mean trajectory and conservative policy update. One way of constraining the policy change is by bounding the Kullback-Leibler (KL) divergence between successive policies. These approaches already demonstrated great experimental success in challenging problems such as end-to-end control of physical systems. However, the linear approximation of the system dynamics can introduce a bias in the policy update and prevent convergence to the optimal policy. In this article, we propose a new model-free trajectory-based policy optimization algorithm with guaranteed monotonic improvement. The algorithm backpropagates a local, quadratic and time-dependent Q-Function learned from trajectory data instead of a model of the system dynamics. Our policy update ensures exact KL-constraint satisfaction without simplifying assumptions on the system dynamics. We experimentally demonstrate on highly non-linear control tasks the improvement in performance of our algorithm in comparison to approaches linearizing the system dynamics. In order to show the monotonic improvement of our algorithm, we additionally conduct a theoretical analysis of our policy update scheme to derive a lower bound of the change in policy return between successive iterations.

SUPPLEMENTAL MATERIAL

Coming soon ....

REFERENCES ()

CITATIONS ()

EXTERNAL LINKS

OPENAIRE - Products

PlumX Metrics

model free trajectory based policy optimization with monotonic improvement

RECOMMENDATIONS

FAIR ASSESSMENT

Coming soon ....

JUPYTER LAB

Coming soon ....