1. Sample complexity of variance-reduced policy gradient: weaker assumptions and lower bounds.
- Author
-
Paczolay, Gabor, Papini, Matteo, Metelli, Alberto Maria, Harmati, Istvan, and Restelli, Marcello
- Subjects
REINFORCEMENT learning ,COMPUTER simulation ,ALGORITHMS - Abstract
Several variance-reduced versions of REINFORCE based on importance sampling achieve an improved O (ϵ - 3) sample complexity to find an ϵ -stationary point, under an unrealistic assumption on the variance of the importance weights. In this paper, we propose the Defensive Policy Gradient (DEF-PG) algorithm, based on defensive importance sampling, achieving the same result without any assumption on the variance of the importance weights. We also show that this is not improvable by establishing a matching Ω (ϵ - 3) lower bound, and that REINFORCE with its O (ϵ - 4) sample complexity is actually optimal under weaker assumptions on the policy class. Numerical simulations show promising results for the proposed technique compared to similar algorithms based on vanilla importance sampling. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF