Off-policy Maximum Entropy Reinforcement Learning : Soft Actor-Critic with Advantage Weighted Mixture Policy(SAC-AWMP)

Maximization
DOI: 10.48550/arxiv.2002.02829 Publication Date: 2020-01-01
ABSTRACT
The optimal policy of a reinforcement learning problem is often discontinuous and non-smooth. I.e., for two states with similar representations, their policies can be significantly different. In this case, representing the entire function approximator (FA) shared parameters all maybe not desirable, as generalization ability sharing makes discontinuous, non-smooth difficult. A common way to solve problem, known Mixture-of-Experts, represent weighted sum multiple components, where different components perform well on parts state space. Following idea inspired by recent work called advantage-weighted information maximization, we propose learn each weights these so that they entail itself also preferred action learned far state. preference characterized via advantage function. weight component would only large certain groups whose representations are similar. Therefore easy represented. We call parameterized in an Advantage Weighted Mixture Policy (AWMP) apply improve soft-actor-critic (SAC), one most competitive continuous control algorithm. Experimental results demonstrate SAC AWMP clearly outperforms four commonly used tasks achieve stable performance across random seeds.
SUPPLEMENTAL MATERIAL
Coming soon ....
REFERENCES ()
CITATIONS ()