1. Improving proximal policy optimization with alpha divergence.
- Author
-
Xu, Haotian, Yan, Zheng, Xuan, Junyu, Zhang, Guangquan, and Lu, Jie
- Subjects
- *
ARTIFICIAL neural networks , *REINFORCEMENT learning - Abstract
• A linearly combined form of the objective is reformulated to control the trade-off between the return and the divergence more effectively. • An improved proximal policy optimization method (i.e., alphaPPO) is proposed, with a more elaborative alpha divergence for two adjacent policies. • The effectiveness of our alphaPPO is validated using detailed experimental comparison and analysis for six benchmark environments. Proximal policy optimization (PPO) is a recent advancement in reinforcement learning, which is formulated as an unconstrained optimization problem including two terms: accumulative discount return and Kullback–Leibler (KL) divergence. Currently, there are three PPO versions: primary, adaptive, and clipping. The most widely used PPO algorithm is the clipping version, in which the KL divergence is replaced by a clipping function to measure the difference between two policies indirectly. In this paper, we revisit this primary PPO and improve it in two aspects. One is to reformulate it as a linearly combined form to control the trade-off between two terms. The other is to substitute a parametric alpha divergence for KL divergence to measure the difference of two policies more effectively. This novel PPO variant is referred to as alphaPPO in this paper. Experiments on six benchmark environments verify the effectiveness of our alphaPPO, compared with clipping and combined PPOs. [ABSTRACT FROM AUTHOR]
- Published
- 2023
- Full Text
- View/download PDF