1. A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation.
- Author
-
Chen, Zaiwei, Maguluri, Siva T., Shakkottai, Sanjay, and Shanmugam, Karthikeyan
- Subjects
STOCHASTIC approximation ,STOCHASTIC analysis ,REINFORCEMENT learning ,SMOOTHNESS of functions ,CONTRACTION operators ,STATISTICAL learning - Abstract
The stochastic approximation (SA) method stands as the foundational mathematical tool for modern large-scale optimization and machine learning. Therefore, gaining a theoretical understanding of SA algorithms is of fundamental interest. In their paper titled "A Lyapunov Theory for Finite-Sample Guarantees of Markovian Stochastic Approximation," Chen et al. present a unified Lyapunov framework for the finite-sample analysis of a Markovian SA algorithm under a contractive operator with respect to an arbitrary norm. The key novelty lies in the construction of a smooth Lyapunov function called the generalized Moreau envelope. The authors demonstrate the effectiveness of their SA results in the context of reinforcement learning (RL), specifically through popular algorithms such as variants of temporal difference (TD) learning and Q-learning. As byproducts, the results provide theoretical insights into the efficiency of bootstrapping in TD learning with eligibility traces and the bias-variance tradeoff in off-policy learning. This paper develops a unified Lyapunov framework for finite-sample analysis of a Markovian stochastic approximation (SA) algorithm under a contraction operator with respect to an arbitrary norm. The main novelty lies in the construction of a valid Lyapunov function called the generalized Moreau envelope. The smoothness and an approximation property of the generalized Moreau envelope enable us to derive a one-step Lyapunov drift inequality, which is the key to establishing the finite-sample bounds. Our SA result has wide applications, especially in the context of reinforcement learning (RL). Specifically, we show that a large class of value-based RL algorithms can be modeled in the exact form of our Markovian SA algorithm. Therefore, our SA results immediately imply finite-sample guarantees for popular RL algorithms such as n-step temporal difference (TD) learning, TD (λ) , off-policy V-trace, and Q-learning. As byproducts, by analyzing the convergence bounds of n-step TD and TD (λ) , we provide theoretical insight into the problem about the efficiency of bootstrapping. Moreover, our finite-sample bounds of off-policy V-trace explicitly capture the tradeoff between the variance of the stochastic iterates and the bias in the limit. Funding: This work was supported by RTX, the National Science Foundation [Grants 2019844, 2107037, 211247, 2112533, 2144316, and 2240982], and the Machine Learning Laboratory at University of Texas at Austin. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.0249. [ABSTRACT FROM AUTHOR]
- Published
- 2024
- Full Text
- View/download PDF