Back to Search Start Over

Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

Authors :
Fei Zhu
Jiaqing Cao
Shan Zhong
Quan Liu
Qiming Fu
Source :
Information Sciences. 580:311-330
Publication Year :
2021
Publisher :
Elsevier BV, 2021.

Abstract

The problem of off-policy evaluation (OPE) has long been advocated as one of the foremost challenges in reinforcement learning. Gradient-based and emphasis-based temporal-difference (TD) learning algorithms comprise the major part of off-policy TD learning methods. In this work, we investigate the derivation of efficient OPE algorithms from a novel perspective based on the advantages of these two categories. The gradient-based framework is adopted, and the emphatic approach is used to improve convergence performance. We begin by proposing a new analogue of the on-policy objective, called the distribution-correction-based mean square projected Bellman error (DC-MSPBE). The key to the construction of DC-MSPBE is the use of emphatic weightings on the representable subspace of the original MSPBE. Based on this objective function, the emphatic TD with lower-variance gradient correction (ETD-LVC) algorithm is proposed. Under standard off-policy and stochastic approximation conditions, we provide the convergence analysis of the ETD-LVC algorithm in the case of linear function approximation. Further, we generalize the algorithm to nonlinear smooth function approximation. Finally, we empirically demonstrate the improved performance of our ETD-LVC algorithm on off-policy benchmarks. Taken together, we hope that our work can guide the future discovery of a better alternative in the off-policy TD learning algorithm family.

Details

ISSN :
00200255
Volume :
580
Database :
OpenAIRE
Journal :
Information Sciences
Accession number :
edsair.doi...........d52d0d261b3c9ddfa1d03bc02e5b785d