Start Over

Deterministic policy optimization with clipped value expansion and long-horizon planning.

Authors :: Gao, Shiqing
Shi, Haibo
Wang, Fang
Wang, Zijian
Zhang, Siyu
Li, Yunxia
Sun, Yaoru
Source :: Neurocomputing. Apr2022, Vol. 483, p299-310. 12p.
Publication Year :: 2022
Abstract: • Mixture Gaussian Network equipped with memory modules for long-horizon prediction. • Learning from successful experience to guide the early exploration. • Clipped value expansion in the model to learn a stable critic with low overestimation bias. • Back-propagating multi-step gradients along imaginary trajectories to optimize the policy end-to-end. • Model-based deterministic policy gradient improves the sample efficiency. Model-based reinforcement learning (MBRL) approaches have demonstrated great potential in handling complex tasks with high sample efficiency. However, MBRL struggles with asymptotic performance compared to model-free reinforcement learning (MFRL). In this paper, we present a long-horizon policy optimization method, namely model-based deterministic policy gradient (MBDPG), for efficient exploitation of the learned dynamics model through multi-step gradient information. First, we approximate the dynamics of the environment with a parameterized linear combination of an ensemble of Gaussian distributions. Moreover, the dynamics model is equipped with a memory module and trained on a multi-step prediction task to reduce cumulative error. Second, successful experience is used to guide the policy at the early stage of training to avoid ineffective exploration. Third, a clipped double value network is expanded in the learned dynamics to reduce overestimation bias. Finally, we present a deterministic policy gradient approach in the model that backpropagates multi-step gradient along the imagined trajectories. Our method shows higher sampling efficiency than the state-of-the-art MFRL methods while maintaining better convergence performance and time efficiency compared to the SOAT MBRL. [ABSTRACT FROM AUTHOR]

Subjects :: *REINFORCEMENT learning
*MNEMONICS
*GAUSSIAN distribution
*HORIZON

Details

Language :: English
ISSN :: 09252312
Volume :: 483
Database :: Academic Search Index
Journal :: Neurocomputing
Publication Type :: Academic Journal
Accession number :: 155655309
Full Text :: https://doi.org/10.1016/j.neucom.2022.02.022

Full Text Access

View/download PDF

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Deterministic policy optimization with clipped value expansion and long-horizon planning.

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Deterministic policy optimization with clipped value expansion and long-horizon planning.

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources