Start Over

Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

Authors :: Fei Zhu
Jiaqing Cao
Shan Zhong
Quan Liu
Qiming Fu
Source :: Information Sciences. 580:311-330
Publication Year :: 2021
Publisher :: Elsevier BV, 2021.
Abstract: The problem of off-policy evaluation (OPE) has long been advocated as one of the foremost challenges in reinforcement learning. Gradient-based and emphasis-based temporal-difference (TD) learning algorithms comprise the major part of off-policy TD learning methods. In this work, we investigate the derivation of efficient OPE algorithms from a novel perspective based on the advantages of these two categories. The gradient-based framework is adopted, and the emphatic approach is used to improve convergence performance. We begin by proposing a new analogue of the on-policy objective, called the distribution-correction-based mean square projected Bellman error (DC-MSPBE). The key to the construction of DC-MSPBE is the use of emphatic weightings on the representable subspace of the original MSPBE. Based on this objective function, the emphatic TD with lower-variance gradient correction (ETD-LVC) algorithm is proposed. Under standard off-policy and stochastic approximation conditions, we provide the convergence analysis of the ETD-LVC algorithm in the case of linear function approximation. Further, we generalize the algorithm to nonlinear smooth function approximation. Finally, we empirically demonstrate the improved performance of our ETD-LVC algorithm on off-policy benchmarks. Taken together, we hope that our work can guide the future discovery of a better alternative in the off-policy TD learning algorithm family.

Subjects :: Mathematical optimization
Information Systems and Management
Computer science
Stochastic approximation
Linear function
Computer Science Applications
Theoretical Computer Science
Nonlinear system
Artificial Intelligence
Control and Systems Engineering
Convergence (routing)
Key (cryptography)
Reinforcement learning
Temporal difference learning
Software
Subspace topology

Details

ISSN :: 00200255
Volume :: 580
Database :: OpenAIRE
Journal :: Information Sciences
Accession number :: edsair.doi...........d52d0d261b3c9ddfa1d03bc02e5b785d

Tools

Email
Cite

Printer

Authors Abstract Subjects Details

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

Abstract

Subjects

Details

Tools

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Gradient temporal-difference learning for off-policy evaluation using emphatic weightings

Abstract

Subjects

Details

Tools

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources