Author: "Sutton, Richard" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Sutton, Richard"' showing total 2,690 results

Start Over Author "Sutton, Richard"

2,690 results on '"Sutton, Richard"'

1. Asynchronous Stochastic Approximation and Average-Reward Reinforcement Learning

Author: Yu, Huizhen, Wan, Yi, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, 93E20, 62L20, 90C40
Abstract: This paper studies asynchronous stochastic approximation (SA) algorithms and their application to reinforcement learning in semi-Markov decision processes (SMDPs) with an average-reward criterion. We first extend Borkar and Meyn's stability proof method to accommodate more general noise conditions, leading to broader convergence guarantees for asynchronous SA algorithms. Leveraging these results, we establish the convergence of an asynchronous SA analogue of Schweitzer's classical relative value iteration algorithm, RVI Q-learning, for finite-space, weakly communicating SMDPs. Furthermore, to fully utilize the SA results in this application, we introduce new monotonicity conditions for estimating the optimal reward rate in RVI Q-learning. These conditions substantially expand the previously considered algorithmic framework, and we address them with novel proof arguments in the stability and convergence analysis of RVI Q-learning., Comment: The materials in this paper extend the authors' results from 2023, reported in arXiv:2408.16262 and arXiv:2312.15091. This paper incorporates and subsumes the results of arXiv:2312.15091 and serves as Part II of arXiv:2408.16262
Published: 2024

2. On Convergence of Average-Reward Q-Learning in Weakly Communicating Markov Decision Processes

Author: Wan, Yi, Yu, Huizhen, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: This paper analyzes reinforcement learning (RL) algorithms for Markov decision processes (MDPs) under the average-reward criterion. We focus on Q-learning algorithms based on relative value iteration (RVI), which are model-free stochastic analogues of the classical RVI method for average-reward MDPs. These algorithms have low per-iteration complexity, making them well-suited for large state space problems. We extend the almost-sure convergence analysis of RVI Q-learning algorithms developed by Abounadi, Bertsekas, and Borkar (2001) from unichain to weakly communicating MDPs. This extension is important both practically and theoretically: weakly communicating MDPs cover a much broader range of applications compared to unichain MDPs, and their optimality equations have a richer solution structure (with multiple degrees of freedom), introducing additional complexity in proving algorithmic convergence. We also characterize the sets to which RVI Q-learning algorithms converge, showing that they are compact, connected, potentially nonconvex, and comprised of solutions to the average-reward optimality equation, with exactly one less degree of freedom than the general solution set of this equation. Furthermore, we extend our analysis to two RVI-based hierarchical average-reward RL algorithms using the options framework, proving their almost-sure convergence and characterizing their sets of convergence under the assumption that the underlying semi-Markov decision process is weakly communicating.
Published: 2024

3. An Idiosyncrasy of Time-discretization in Reinforcement Learning

Author: De Asis, Kris and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, I.2.6, I.2.9
Abstract: Many reinforcement learning algorithms are built on an assumption that an agent interacts with an environment over fixed-duration, discrete time steps. However, physical systems are continuous in time, requiring a choice of time-discretization granularity when digitally controlling them. Furthermore, such systems do not wait for decisions to be made before advancing the environment state, necessitating the study of how the choice of discretization may affect a reinforcement learning algorithm. In this work, we consider the relationship between the definitions of the continuous-time and discrete-time returns. Specifically, we acknowledge an idiosyncrasy with naively applying a discrete-time algorithm to a discretized continuous-time environment, and note how a simple modification can better align the return definitions. This observation is of practical consideration when dealing with environments where time-discretization granularity is a choice, or situations where such granularity is inherently stochastic., Comment: RLC 2024
Published: 2024

4. Reward Centering

Author: Naik, Abhishek, Wan, Yi, Tomar, Manan, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We show that discounted methods for solving continuing reinforcement learning problems can perform significantly better if they center their rewards by subtracting out the rewards' empirical average. The improvement is substantial at commonly used discount factors and increases further as the discount factor approaches one. In addition, we show that if a problem's rewards are shifted by a constant, then standard methods perform much worse, whereas methods with reward centering are unaffected. Estimating the average reward is straightforward in the on-policy setting; we propose a slightly more sophisticated method for the off-policy setting. Reward centering is a general idea, so we expect almost every reinforcement-learning algorithm to benefit by the addition of reward centering., Comment: In Proceedings of RLC 2024
Published: 2024

5. MetaOptimize: A Framework for Optimizing Step Sizes and Other Meta-parameters

Author: Sharifnassab, Arsalan, Salehkaleybar, Saber, and Sutton, Richard
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control
Abstract: This paper addresses the challenge of optimizing meta-parameters (i.e., hyperparameters) in machine learning algorithms, a critical factor influencing training efficiency and model performance. Moving away from the computationally expensive traditional meta-parameter search methods, we introduce MetaOptimize framework that dynamically adjusts meta-parameters, particularly step sizes (also known as learning rates), during training. More specifically, MetaOptimize can wrap around any first-order optimization algorithm, tuning step sizes on the fly to minimize a specific form of regret that accounts for long-term effect of step sizes on training, through a discounted sum of future losses. We also introduce low complexity variants of MetaOptimize that, in conjunction with its adaptability to multiple optimization algorithms, demonstrate performance competitive to those of best hand-crafted learning rate schedules across various machine learning applications.
Published: 2024

6. Loss of plasticity in deep continual learning

Author: Dohare, Shibhansh, Hernandez-Garcia, J. Fernando, Lan, Qingfeng, Rahman, Parash, Mahmood, A. Rupam, and Sutton, Richard S.
Published: 2024
Full Text: View/download PDF

7. Step-size Optimization for Continual Learning

Author: Degris, Thomas, Javed, Khurram, Sharifnassab, Arsalan, Liu, Yuxin, and Sutton, Richard
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning.
Published: 2024

8. A Note on Stability in Asynchronous Stochastic Approximation without Communication Delays

Author: Yu, Huizhen, Wan, Yi, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, 62L20 (Primary) 93E35, 90C40 (Secondary)
Abstract: In this paper, we study asynchronous stochastic approximation algorithms without communication delays. Our main contribution is a stability proof for these algorithms that extends a method of Borkar and Meyn by accommodating more general noise conditions. We also derive convergence results from this stability result and discuss their application in important average-reward reinforcement learning problems., Comment: Corrected typos and a minor error; parts of this material will be included in a separate future arXiv preprint
Published: 2023

9. Cardiovascular autonomic dysfunction in post-COVID-19 syndrome: a major health-care burden

Author: Fedorowski, Artur, Fanciulli, Alessandra, Raj, Satish R., Sheldon, Robert, Shibao, Cyndya A., and Sutton, Richard
Published: 2024
Full Text: View/download PDF

10. Iterative Option Discovery for Planning, by Planning

Author: Young, Kenny and Sutton, Richard S.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Discovering useful temporal abstractions, in the form of options, is widely thought to be key to applying reinforcement learning and planning to increasingly complex domains. Building on the empirical success of the Expert Iteration approach to policy learning used in AlphaZero, we propose Option Iteration, an analogous approach to option discovery. Rather than learning a single strong policy that is trained to match the search results everywhere, Option Iteration learns a set of option policies trained such that for each state encountered, at least one policy in the set matches the search results for some horizon into the future. Intuitively, this may be significantly easier as it allows the algorithm to hedge its bets compared to learning a single globally strong policy, which may have complex dependencies on the details of the current state. Having learned such a set of locally strong policies, we can use them to guide the search algorithm resulting in a virtuous cycle where better options lead to better search results which allows for training of better options. We demonstrate experimentally that planning using options learned with Option Iteration leads to a significant benefit in challenging planning environments compared to an analogous planning algorithm operating in the space of primitive actions and learning a single rollout policy with Expert Iteration., Comment: Fixed incorrect arrows on some figures in the appendix
Published: 2023

11. Reinforcement learning : an introduction.

Author: Sutton, Richard S. and Barto, Andrew G.
Subjects: Reinforcement learning
Abstract: Summary: "Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms."-- Provided by publisher.
Published: 2018

12. Reinforcement learning : an introduction.

Author: Sutton, Richard S. and Barto, Andrew G.
Subjects: Reinforcement learning
Abstract: Summary: "Reinforcement learning, one of the most active research areas in artificial intelligence, is a computational approach to learning whereby an agent tries to maximize the total amount of reward it receives while interacting with a complex, uncertain environment. In Reinforcement Learning, Richard Sutton and Andrew Barto provide a clear and simple account of the field's key ideas and algorithms."-- Provided by publisher.
Published: 2018

13. Value-aware Importance Weighting for Off-policy Reinforcement Learning

Author: De Asis, Kristopher, Graves, Eric, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, I.2
Abstract: Importance sampling is a central idea underlying off-policy prediction in reinforcement learning. It provides a strategy for re-weighting samples from a distribution to obtain unbiased estimates under another distribution. However, importance sampling weights tend to exhibit extreme variance, often leading to stability issues in practice. In this work, we consider a broader class of importance weights to correct samples in off-policy learning. We propose the use of $\textit{value-aware importance weights}$ which take into account the sample space to provide lower variance, but still unbiased, estimates under a target distribution. We derive how such weights can be computed, and detail key properties of the resulting importance weights. We then extend several reinforcement learning prediction algorithms to the off-policy setting with these weights, and evaluate them empirically., Comment: CoLLAs 2023
Published: 2023

14. Maintaining Plasticity in Deep Continual Learning

Author: Dohare, Shibhansh, Hernandez-Garcia, J. Fernando, Rahman, Parash, Mahmood, A. Rupam, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning
Abstract: Modern deep-learning systems are specialized to problem settings in which training occurs once and then never again, as opposed to continual-learning settings in which training occurs continually. If deep-learning systems are applied in a continual learning setting, then it is well known that they may fail to remember earlier examples. More fundamental, but less well known, is that they may also lose their ability to learn on new examples, a phenomenon called loss of plasticity. We provide direct demonstrations of loss of plasticity using the MNIST and ImageNet datasets repurposed for continual learning as sequences of tasks. In ImageNet, binary classification performance dropped from 89% accuracy on an early task down to 77%, about the level of a linear network, on the 2000th task. Loss of plasticity occurred with a wide range of deep network architectures, optimizers, activation functions, batch normalization, dropout, but was substantially eased by L2-regularization, particularly when combined with weight perturbation. Further, we introduce a new algorithm -- continual backpropagation -- which slightly modifies conventional backpropagation to reinitialize a small fraction of less-used units after each example and appears to maintain plasticity indefinitely.
Published: 2023

15. Tradition Serving Modernity? The Musical Lives of a Makassarese Drummer

Author: Sutton, Richard Anderson
Published: 2006
Full Text: View/download PDF

16. Toward Efficient Gradient-Based Value Estimation

Author: Sharifnassab, Arsalan and Sutton, Richard
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Gradient-based methods for value estimation in reinforcement learning have favorable stability properties, but they are typically much slower than Temporal Difference (TD) learning methods. We study the root causes of this slowness and show that Mean Square Bellman Error (MSBE) is an ill-conditioned loss function in the sense that its Hessian has large condition-number. To resolve the adverse effect of poor conditioning of MSBE on gradient based methods, we propose a low complexity batch-free proximal method that approximately follows the Gauss-Newton direction and is asymptotically robust to parameterization. Our main algorithm, called RANS, is efficient in the sense that it is significantly faster than the residual gradient methods while having almost the same computational complexity, and is competitive with TD on the classic problems that we tested.
Published: 2023

17. Thomas Young, Natural Philosopher, 1773-1829 (review)

Author: Sutton, Richard M.
Published: 2012

18. Auxiliary task discovery through generate-and-test

Author: Rafiee, Banafsheh, Ghiassian, Sina, Jin, Jun, Sutton, Richard, Luo, Jun, and White, Adam
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In this paper, we explore an approach to auxiliary task discovery in reinforcement learning based on ideas from representation learning. Auxiliary tasks tend to improve data efficiency by forcing the agent to learn auxiliary prediction and control objectives in addition to the main task of maximizing reward, and thus producing better representations. Typically these tasks are designed by people. Meta-learning offers a promising avenue for automatic task discovery; however, these methods are computationally expensive and challenging to tune in practice. In this paper, we explore a complementary approach to the auxiliary task discovery: continually generating new auxiliary tasks and preserving only those with high utility. We also introduce a new measure of auxiliary tasks' usefulness based on how useful the features induced by them are for the main task. Our discovery algorithm significantly outperforms random tasks and learning without auxiliary tasks across a suite of environments.
Published: 2022

19. On Convergence of Average-Reward Off-Policy Control Algorithms in Weakly Communicating MDPs

Author: Wan, Yi and Sutton, Richard S.
Subjects: Computer Science - Machine Learning
Abstract: We show two average-reward off-policy control algorithms, Differential Q-learning (Wan, Naik, & Sutton 2021a) and RVI Q-learning (Abounadi Bertsekas & Borkar 2001), converge in weakly communicating MDPs. Weakly communicating MDPs are the most general MDPs that can be solved by a learning algorithm with a single stream of experience. The original convergence proofs of the two algorithms require that the solution set of the average-reward optimality equation only has one degree of freedom, which is not necessarily true for weakly communicating MDPs. To the best of our knowledge, our results are the first showing average-reward off-policy control algorithms converge in weakly communicating MDPs. As a direct extension, we show that average-reward options algorithms for temporal abstraction introduced by Wan, Naik, & Sutton (2021b) converge if the Semi-MDP induced by options is weakly communicating.
Published: 2022

20. The Alberta Plan for AI Research

Author: Sutton, Richard S., Bowling, Michael, and Pilarski, Patrick M.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Herein we describe our approach to artificial intelligence research, which we call the Alberta Plan. The Alberta Plan is pursued within our research groups in Alberta and by others who are like minded throughout the world. We welcome all who would join us in this pursuit.
Published: 2022

21. A framework to simplify paediatric syncope diagnosis

Author: Stewart, Julian M., van Dijk, J. Gert, Balaji, Seshadri, and Sutton, Richard
Published: 2023
Full Text: View/download PDF

22. Sustainable Compromises: A Yurt, a Straw Bale House, and Ecological Living by Alan Boye (review)

Author: Sutton, Richard K.
Published: 2015
Full Text: View/download PDF

23. Doubly-Asynchronous Value Iteration: Making Value Iteration Asynchronous in Actions

Author: Tian, Tian, Young, Kenny, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning
Abstract: Value iteration (VI) is a foundational dynamic programming method, important for learning and planning in optimal control and reinforcement learning. VI proceeds in batches, where the update to the value of each state must be completed before the next batch of updates can begin. Completing a single batch is prohibitively expensive if the state space is large, rendering VI impractical for many applications. Asynchronous VI helps to address the large state space problem by updating one state at a time, in-place and in an arbitrary order. However, Asynchronous VI still requires a maximization over the entire action space, making it impractical for domains with large action space. To address this issue, we propose doubly-asynchronous value iteration (DAVI), a new algorithm that generalizes the idea of asynchrony from states to states and actions. More concretely, DAVI maximizes over a sampled subset of actions that can be of any user-defined size. This simple approach of using sampling to reduce computation maintains similarly appealing theoretical properties to VI without the need to wait for a full sweep through the entire action space in each update. In this paper, we show DAVI converges to the optimal value function with probability one, converges at a near-geometric rate with probability 1-delta, and returns a near-optimal policy in computation time that nearly matches a previously established bound for VI. We also empirically demonstrate DAVI's effectiveness in several experiments.
Published: 2022

24. Toward Discovering Options that Achieve Faster Planning

Author: Wan, Yi and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We propose a new objective for option discovery that emphasizes the computational advantage of using options in planning. In a sequential machine, the speed of planning is proportional to the number of elementary operations used to achieve a good policy. For episodic tasks, the number of elementary operations depends on the number of options composed by the policy in an episode and the number of options being considered at each decision point. To reduce the amount of computation in planning, for a given set of episodic tasks and a given number of options, our objective prefers options with which it is possible to achieve a high return by composing few options, and also prefers a smaller set of options to choose from at each decision point. We develop an algorithm that optimizes the proposed objective. In a variant of the classic four-room domain, we show that 1) a higher objective value is typically associated with fewer number of elementary planning operations used by the option-value iteration algorithm to obtain a near-optimal value function, 2) our algorithm achieves an objective value that matches it achieved by two human-designed options 3) the amount of computation used by option-value iteration with options discovered by our algorithm matches it with the human-designed options, 4) the options produced by our algorithm also make intuitive sense--they seem to move to and terminate at the entrances of rooms.
Published: 2022

25. The Quest for a Common Model of the Intelligent Decision Maker

Author: Sutton, Richard S.
Subjects: Computer Science - Artificial Intelligence
Abstract: The premise of the Multi-disciplinary Conference on Reinforcement Learning and Decision Making is that multiple disciplines share an interest in goal-directed decision making over time. The idea of this paper is to sharpen and deepen this premise by proposing a perspective on the decision maker that is substantive and widely held across psychology, artificial intelligence, economics, control theory, and neuroscience, which I call the "common model of the intelligent agent". The common model does not include anything specific to any organism, world, or application domain. The common model does include aspects of the decision maker's interaction with its world (there must be input and output, and a goal) and internal components of the decision maker (for perception, decision-making, internal evaluation, and a world model). I identify these aspects and components, note that they are given different names in different disciplines but refer essentially to the same ideas, and discuss the challenges and benefits of devising a neutral terminology that can be used across disciplines. It is time to recognize and build on the convergence of multiple diverse disciplines on a substantive common model of the intelligent agent., Comment: Will appear as an extended abstract at the fifth Multi-disciplinary Conference on Reinforcement Learning and Decision Making, held in Providence, Rhode Island, June 8-11, 2022
Published: 2022

26. A History of Meta-gradient: Gradient Methods for Meta-learning

Author: Sutton, Richard S.
Subjects: Computer Science - Machine Learning
Abstract: The history of meta-learning methods based on gradient descent is reviewed, focusing primarily on methods that adapt step-size (learning rate) meta-parameters., Comment: 3 pages of text, 54 references
Published: 2022

27. Reward-Respecting Subtasks for Model-Based Reinforcement Learning

Author: Sutton, Richard S., Machado, Marlos C., Holland, G. Zacharias, Szepesvari, David, Timbers, Finbarr, Tanner, Brian, and White, Adam
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: To achieve the ambitious goals of artificial intelligence, reinforcement learning must include planning with a model of the world that is abstract in state and time. Deep learning has made progress with state abstraction, but temporal abstraction has rarely been used, despite extensively developed theory based on the options framework. One reason for this is that the space of possible options is immense, and the methods previously proposed for option discovery do not take into account how the option models will be used in planning. Options are typically discovered by posing subsidiary tasks, such as reaching a bottleneck state or maximizing the cumulative sum of a sensory signal other than reward. Each subtask is solved to produce an option, and then a model of the option is learned and made available to the planning process. In most previous work, the subtasks ignore the reward on the original problem, whereas we propose subtasks that use the original reward plus a bonus based on a feature of the state at the time the option terminates. We show that option models obtained from such reward-respecting subtasks are much more likely to be useful in planning than eigenoptions, shortest path options based on bottleneck states, or reward-respecting options generated by the option-critic. Reward respecting subtasks strongly constrain the space of options and thereby also provide a partial solution to the problem of option discovery. Finally, we show how values, policies, options, and models can all be learned online and off-policy using standard algorithms and general value functions.
Published: 2022
Full Text: View/download PDF

28. Reply to ‘Graded exercise therapy should not be recommended for patients with post-exertional malaise’

Author: Fedorowski, Artur, Fanciulli, Alessandra, Raj, Satish R., Sheldon, Robert, Shibao, Cyndya A., and Sutton, Richard
Published: 2024
Full Text: View/download PDF

29. Will Artificial Intelligence Be “Better” Than Humans in the Management of Syncope?

Author: Dipaola, Franca, Gebska, Milena A., Gatti, Mauro, Levra, Alessandro Giaj, Parker, William H., Menè, Roberto, Lee, Sangil, Costantino, Giorgio, Barsotti, E. John, Shiffer, Dana, Johnston, Samuel L., Sutton, Richard, Olshansky, Brian, and Furlan, Raffaello
Published: 2024
Full Text: View/download PDF

30. Communicative capital: a key resource for human–machine shared agency and collaborative capacity

Author: Mathewson, Kory W., Parker, Adam S. R., Sherstan, Craig, Edwards, Ann L., Sutton, Richard S., and Pilarski, Patrick M.
Published: 2023
Full Text: View/download PDF

31. Learning Agent State Online with Recurrent Generate-and-Test

Author: Samani, Amir and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Learning continually and online from a continuous stream of data is challenging, especially for a reinforcement learning agent with sequential data. When the environment only provides observations giving partial information about the state of the environment, the agent must learn the agent state based on the data stream of experience. We refer to the state learned directly from the data stream of experience as the agent state. Recurrent neural networks can learn the agent state, but the training methods are computationally expensive and sensitive to the hyper-parameters, making them unideal for online learning. This work introduces methods based on the generate-and-test approach to learn the agent state. A generate-and-test algorithm searches for state features by generating features and testing their usefulness. In this process, features useful for the agent's performance on the task are preserved, and the least useful features get replaced with newly generated features. We study the effectiveness of our methods on two online multi-step prediction problems. The first problem, trace conditioning, focuses on the agent's ability to remember a cue for a prediction multiple steps into the future. In the second problem, trace patterning, the agent needs to learn patterns in the observation signals and remember them for future predictions. We show that our proposed methods can effectively learn the agent state online and produce accurate predictions.
Published: 2021

32. Average-Reward Learning and Planning with Options

Author: Wan, Yi, Naik, Abhishek, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning
Abstract: We extend the options framework for temporal abstraction in reinforcement learning from discounted Markov decision processes (MDPs) to average-reward MDPs. Our contributions include general convergent off-policy inter-option learning algorithms, intra-option algorithms for learning values and models, as well as sample-based planning variants of our learning algorithms. Our algorithms and convergence proofs extend those recently developed by Wan, Naik, and Sutton. We also extend the notion of option-interrupting behavior from the discounted to the average-reward formulation. We show the efficacy of the proposed algorithms with experiments on a continuing version of the Four-Room domain.
Published: 2021

33. An Empirical Comparison of Off-policy Prediction Learning Algorithms in the Four Rooms Environment

Author: Ghiassian, Sina and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Many off-policy prediction learning algorithms have been proposed in the past decade, but it remains unclear which algorithms learn faster than others. We empirically compare 11 off-policy prediction learning algorithms with linear function approximation on two small tasks: the Rooms task, and the High Variance Rooms task. The tasks are designed such that learning fast in them is challenging. In the Rooms task, the product of importance sampling ratios can be as large as $2^{14}$ and can sometimes be two. To control the high variance caused by the product of the importance sampling ratios, step size should be set small, which in turn slows down learning. The High Variance Rooms task is more extreme in that the product of the ratios can become as large as $2^{14}\times 25$. This paper builds upon the empirical study of off-policy prediction learning algorithms by Ghiassian and Sutton (2021). We consider the same set of algorithms as theirs and employ the same experimental methodology. The algorithms considered are: Off-policy TD($\lambda$), five Gradient-TD algorithms, two Emphatic-TD algorithms, Tree Backup($\lambda$), Vtrace($\lambda$), and ABTD($\zeta$). We found that the algorithms' performance is highly affected by the variance induced by the importance sampling ratios. The data shows that Tree Backup($\lambda$), Vtrace($\lambda$), and ABTD($\zeta$) are not affected by the high variance as much as other algorithms but they restrict the effective bootstrapping parameter in a way that is too limiting for tasks where high variance is not present. We observed that Emphatic TD($\lambda$) tends to have lower asymptotic error than other algorithms, but might learn more slowly in some cases. We suggest algorithms for practitioners based on their problem of interest, and suggest approaches that can be applied to specific algorithms that might result in substantially improved algorithms., Comment: 13 pages
Published: 2021

34. Continual Backprop: Stochastic Gradient Descent with Persistent Randomness

Author: Dohare, Shibhansh, Sutton, Richard S., and Mahmood, A. Rupam
Subjects: Computer Science - Machine Learning
Abstract: The Backprop algorithm for learning in neural networks utilizes two mechanisms: first, stochastic gradient descent and second, initialization with small random weights, where the latter is essential to the effectiveness of the former. We show that in continual learning setups, Backprop performs well initially, but over time its performance degrades. Stochastic gradient descent alone is insufficient to learn continually; the initial randomness enables only initial learning but not continual learning. To the best of our knowledge, ours is the first result showing this degradation in Backprop's ability to learn. To address this degradation in Backprop's plasticity, we propose an algorithm that continually injects random features alongside gradient descent using a new generate-and-test process. We call this the \textit{Continual Backprop} algorithm. We show that, unlike Backprop, Continual Backprop is able to continually adapt in both supervised and reinforcement learning (RL) problems. Continual Backprop has the same computational complexity as Backprop and can be seen as a natural extension of Backprop for continual learning.
Published: 2021

35. An Empirical Comparison of Off-policy Prediction Learning Algorithms on the Collision Task

Author: Ghiassian, Sina and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Off-policy prediction -- learning the value function for one policy from data generated while following another policy -- is one of the most challenging subproblems in reinforcement learning. This paper presents empirical results with eleven prominent off-policy learning algorithms that use linear function approximation: five Gradient-TD methods, two Emphatic-TD methods, Off-policy TD($\lambda$), Vtrace, and versions of Tree Backup and ABQ modified to apply to a prediction setting. Our experiments used the Collision task, a small idealized off-policy problem analogous to that of an autonomous car trying to predict whether it will collide with an obstacle. We assessed the performance of the algorithms according to their learning rate, asymptotic error level, and sensitivity to step-size and bootstrapping parameters. By these measures, the eleven algorithms can be partially ordered on the Collision task. In the top tier, the two Emphatic-TD algorithms learned the fastest, reached the lowest errors, and were robust to parameter settings. In the middle tier, the five Gradient-TD algorithms and Off-policy TD($\lambda$) were more sensitive to the bootstrapping parameter. The bottom tier comprised Vtrace, Tree Backup, and ABQ; these algorithms were no faster and had higher asymptotic error than the others. Our results are definitive for this task, though of course experiments with more tasks are needed before an overall assessment of the algorithms' merits can be made.
Published: 2021

36. Planning with Expectation Models for Control

Author: Kudashkina, Katya, Wan, Yi, Naik, Abhishek, and Sutton, Richard S.
Subjects: Computer Science - Artificial Intelligence
Abstract: In model-based reinforcement learning (MBRL), Wan et al. (2019) showed conditions under which the environment model could produce the expectation of the next feature vector rather than the full distribution, or a sample thereof, with no loss in planning performance. Such expectation models are of interest when the environment is stochastic and non-stationary, and the model is approximate, such as when it is learned using function approximation. In these cases a full distribution model may be impractical and a sample model may be either more expensive computationally or of high variance. Wan et al. considered only planning for prediction to evaluate a fixed policy. In this paper, we treat the control case - planning to improve and find a good approximate policy. We prove that planning with an expectation model must update a state-value function, not an action-value function as previously suggested (e.g., Sorg & Singh, 2010). This opens the question of how planning influences action selections. We consider three strategies for this and present general MBRL algorithms for each. We identify the strengths and weaknesses of these algorithms in computational experiments. Our algorithms and experiments are the first to treat MBRL with expectation models in a general setting.
Published: 2021

37. Does the Adam Optimizer Exacerbate Catastrophic Forgetting?

Author: Ashley, Dylan R., Ghiassian, Sina, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning, I.2.6
Abstract: Catastrophic forgetting remains a severe hindrance to the broad application of artificial neural networks (ANNs), however, it continues to be a poorly understood phenomenon. Despite the extensive amount of work on catastrophic forgetting, we argue that it is still unclear how exactly the phenomenon should be quantified, and, moreover, to what degree all of the choices we make when designing learning systems affect the amount of catastrophic forgetting. We use various testbeds from the reinforcement learning and supervised learning literature to (1) provide evidence that the choice of which modern gradient-based optimization algorithm is used to train an ANN has a significant impact on the amount of catastrophic forgetting and show that-surprisingly-in many instances classical algorithms such as vanilla SGD experience less catastrophic forgetting than the more modern algorithms such as Adam. We empirically compare four different existing metrics for quantifying catastrophic forgetting and (2) show that the degree to which the learning systems experience catastrophic forgetting is sufficiently sensitive to the metric used that a change from one principled metric to another is enough to change the conclusions of a study dramatically. Our results suggest that a much more rigorous experimental methodology is required when looking at catastrophic forgetting. Based on our results, we recommend inter-task forgetting in supervised learning must be measured with both retention and relearning metrics concurrently, and intra-task forgetting in reinforcement learning must-at the very least-be measured with pairwise interference., Comment: 9 pages in main text + 3 pages of references + 16 pages of appendices, 6 figures in main text + 21 figures in appendices, 6 tables in appendices; source code available at https://github.com/dylanashley/catastrophic-forgetting/tree/arxiv
Published: 2021

38. Average-Reward Off-Policy Policy Evaluation with Function Approximation

Author: Zhang, Shangtong, Wan, Yi, Sutton, Richard S., and Whiteson, Shimon
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We consider off-policy policy evaluation with function approximation (FA) in average-reward MDPs, where the goal is to estimate both the reward rate and the differential value function. For this problem, bootstrapping is necessary and, along with off-policy learning and FA, results in the deadly triad (Sutton & Barto, 2018). To address the deadly triad, we propose two novel algorithms, reproducing the celebrated success of Gradient TD algorithms in the average-reward setting. In terms of estimating the differential value function, the algorithms are the first convergent off-policy linear function approximation algorithms. In terms of estimating the reward rate, the algorithms are the first convergent off-policy linear function approximation algorithms that do not require estimating the density ratio. We demonstrate empirically the advantage of the proposed algorithms, as well as their nonlinear variants, over a competitive density-ratio-based approach, in a simple domain as well as challenging robot simulation tasks., Comment: ICML 2021
Published: 2021

39. From Eye-blinks to State Construction: Diagnostic Benchmarks for Online Representation Learning

Author: Rafiee, Banafsheh, Abbas, Zaheer, Ghiassian, Sina, Kumaraswamy, Raksha, Sutton, Richard, Ludvig, Elliot, and White, Adam
Subjects: Computer Science - Artificial Intelligence
Abstract: We present three new diagnostic prediction problems inspired by classical-conditioning experiments to facilitate research in online prediction learning. Experiments in classical conditioning show that animals such as rabbits, pigeons, and dogs can make long temporal associations that enable multi-step prediction. To replicate this remarkable ability, an agent must construct an internal state representation that summarizes its interaction history. Recurrent neural networks can automatically construct state and learn temporal associations. However, the current training methods are prohibitively expensive for online prediction -- continual learning on every time step -- which is the focus of this paper. Our proposed problems test the learning capabilities that animals readily exhibit and highlight the limitations of the current recurrent learning methods. While the proposed problems are nontrivial, they are still amenable to extensive testing and analysis in the small-compute regime, thereby enabling researchers to study issues in isolation, ultimately accelerating progress towards scalable online representation learning methods.
Published: 2020

40. Understanding the Pathologies of Approximate Policy Evaluation when Combined with Greedification in Reinforcement Learning

Author: Young, Kenny and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Despite empirical success, the theory of reinforcement learning (RL) with value function approximation remains fundamentally incomplete. Prior work has identified a variety of pathological behaviours that arise in RL algorithms that combine approximate on-policy evaluation and greedification. One prominent example is policy oscillation, wherein an algorithm may cycle indefinitely between policies, rather than converging to a fixed point. What is not well understood however is the quality of the policies in the region of oscillation. In this paper we present simple examples illustrating that in addition to policy oscillation and multiple fixed points -- the same basic issue can lead to convergence to the worst possible policy for a given approximation. Such behaviours can arise when algorithms optimize evaluation accuracy weighted by the distribution of states that occur under the current policy, but greedify based on the value of states which are rare or nonexistent under this distribution. This means the values used for greedification are unreliable and can steer the policy in undesirable directions. Our observation that this can lead to the worst possible policy shows that in a general sense such algorithms are unreliable. The existence of such examples helps to narrow the kind of theoretical guarantees that are possible and the kind of algorithmic ideas that are likely to be helpful. We demonstrate analytically and experimentally that such pathological behaviours can impact a wide range of RL and dynamic programming algorithms; such behaviours can arise both with and without bootstrapping, and with linear function approximation as well as with more complex parameterized functions like neural networks.
Published: 2020

41. Document-editing Assistants and Model-based Reinforcement Learning as a Path to Conversational AI

Author: Kudashkina, Katya, Pilarski, Patrick M., and Sutton, Richard S.
Subjects: Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Intelligent assistants that follow commands or answer simple questions, such as Siri and Google search, are among the most economically important applications of AI. Future conversational AI assistants promise even greater capabilities and a better user experience through a deeper understanding of the domain, the user, or the user's purposes. But what domain and what methods are best suited to researching and realizing this promise? In this article we argue for the domain of voice document editing and for the methods of model-based reinforcement learning. The primary advantages of voice document editing are that the domain is tightly scoped and that it provides something for the conversation to be about (the document) that is delimited and fully accessible to the intelligent assistant. The advantages of reinforcement learning in general are that its methods are designed to learn from interaction without explicit instruction and that it formalizes the purposes of the assistant. Model-based reinforcement learning is needed in order to genuinely understand the domain of discourse and thereby work efficiently with the user to achieve their goals. Together, voice document editing and model-based reinforcement learning comprise a promising research direction for achieving conversational AI., Comment: Currently under review
Published: 2020

42. Inverse Policy Evaluation for Value-based Sequential Decision-making

Author: Chan, Alan, de Asis, Kris, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Value-based methods for reinforcement learning lack generally applicable ways to derive behavior from a value function. Many approaches involve approximate value iteration (e.g., $Q$-learning), and acting greedily with respect to the estimates with an arbitrary degree of entropy to ensure that the state-space is sufficiently explored. Behavior based on explicit greedification assumes that the values reflect those of \textit{some} policy, over which the greedy policy will be an improvement. However, value-iteration can produce value functions that do not correspond to \textit{any} policy. This is especially relevant in the function-approximation regime, when the true value function can't be perfectly represented. In this work, we explore the use of \textit{inverse policy evaluation}, the process of solving for a likely policy given a value function, for deriving behavior from a value function. We provide theoretical and empirical results to show that inverse policy evaluation, combined with an approximate value iteration algorithm, is a feasible method for value-based control., Comment: Submitted to NeurIPS 2020
Published: 2020

43. Learning and Planning in Average-Reward Markov Decision Processes

Author: Wan, Yi, Naik, Abhishek, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use., Comment: In Proceedings of ICML 2021
Published: 2020

44. Learning Sparse Representations Incrementally in Deep Reinforcement Learning

Author: Hernandez-Garcia, J. Fernando and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Sparse representations have been shown to be useful in deep reinforcement learning for mitigating catastrophic interference and improving the performance of agents in terms of cumulative reward. Previous results were based on a two step process were the representation was learned offline and the action-value function was learned online afterwards. In this paper, we investigate if it is possible to learn a sparse representation and the action-value function simultaneously and incrementally. We investigate this question by employing several regularization techniques and observing how they affect sparsity of the representation learned by a DQN agent in two different benchmark domains. Our results show that with appropriate regularization it is possible to increase the sparsity of the representations learned by DQN agents. Moreover, we found that learning sparse representations also resulted in improved performance in terms of cumulative reward. Finally, we found that the performance of the agents that learned a sparse representation was more robust to the size of the experience replay buffer. This last finding supports the long standing hypothesis that the overlap in representations learned by deep neural networks is the leading cause of catastrophic interference.
Published: 2019

45. Reward-respecting subtasks for model-based reinforcement learning

Author: Sutton, Richard S., Machado, Marlos C., Holland, G. Zacharias, Szepesvari, David, Timbers, Finbarr, Tanner, Brian, and White, Adam
Published: 2023
Full Text: View/download PDF

46. Discounted Reinforcement Learning Is Not an Optimization Problem

Author: Naik, Abhishek, Shariff, Roshan, Yasui, Niko, Yao, Hengshuai, and Sutton, Richard S.
Subjects: Computer Science - Artificial Intelligence
Abstract: Discounted reinforcement learning is fundamentally incompatible with function approximation for control in continuing tasks. It is not an optimization problem in its usual formulation, so when using function approximation there is no optimal policy. We substantiate these claims, then go on to address some misconceptions about discounting and its connection to the average reward formulation. We encourage researchers to adopt rigorous optimization approaches, such as maximizing average reward, for reinforcement learning in continuing tasks., Comment: Accepted for presentation at the Optimization Foundations of Reinforcement Learning Workshop at NeurIPS 2019
Published: 2019

47. Fixed-Horizon Temporal Difference Methods for Stable Reinforcement Learning

Author: De Asis, Kristopher, Chan, Alan, Pitis, Silviu, Sutton, Richard S., and Graves, Daniel
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, I.2
Abstract: We explore fixed-horizon temporal difference (TD) methods, reinforcement learning algorithms for a new kind of value function that predicts the sum of rewards over a $\textit{fixed}$ number of future time steps. To learn the value function for horizon $h$, these algorithms bootstrap from the value function for horizon $h-1$, or some shorter horizon. Because no value function bootstraps from itself, fixed-horizon methods are immune to the stability problems that plague other off-policy TD methods using function approximation (also known as "the deadly triad"). Although fixed-horizon methods require the storage of additional value functions, this gives the agent additional predictive power, while the added complexity can be substantially reduced via parallel updates, shared weights, and $n$-step bootstrapping. We show how to use fixed-horizon value functions to solve reinforcement learning problems competitively with methods such as Q-learning that learn conventional value functions. We also prove convergence of fixed-horizon temporal difference methods with linear and general function approximation. Taken together, our results establish fixed-horizon TD methods as a viable new way of avoiding the stability problems of the deadly triad., Comment: AAAI 2020
Published: 2019

48. Behaviour Suite for Reinforcement Learning

Author: Osband, Ian, Doron, Yotam, Hessel, Matteo, Aslanides, John, Sezener, Eren, Saraiva, Andre, McKinney, Katrina, Lattimore, Tor, Szepesvari, Csaba, Singh, Satinder, Van Roy, Benjamin, Sutton, Richard, Silver, David, and Van Hasselt, Hado
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: This paper introduces the Behaviour Suite for Reinforcement Learning, or bsuite for short. bsuite is a collection of carefully-designed experiments that investigate core capabilities of reinforcement learning (RL) agents with two objectives. First, to collect clear, informative and scalable problems that capture key issues in the design of general and efficient learning algorithms. Second, to study agent behaviour through their performance on these shared benchmarks. To complement this effort, we open source github.com/deepmind/bsuite, which automates evaluation and analysis of any agent on bsuite. This library facilitates reproducible and accessible research on the core issues in RL, and ultimately the design of superior learning algorithms. Our code is Python, and easy to use within existing projects. We include examples with OpenAI Baselines, Dopamine as well as new reference implementations. Going forward, we hope to incorporate more excellent experiments from the research community, and commit to a periodic review of bsuite from a committee of prominent researchers.
Published: 2019

49. Planning with Expectation Models

Author: Wan, Yi, Abbas, Zaheer, White, Adam, White, Martha, and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Distribution and sample models are two popular model choices in model-based reinforcement learning (MBRL). However, learning these models can be intractable, particularly when the state and action spaces are large. Expectation models, on the other hand, are relatively easier to learn due to their compactness and have also been widely used for deterministic environments. For stochastic environments, it is not obvious how expectation models can be used for planning as they only partially characterize a distribution. In this paper, we propose a sound way of using approximate expectation models for MBRL. In particular, we 1) show that planning with an expectation model is equivalent to planning with a distribution model if the state value function is linear in state features, 2) analyze two common parametrization choices for approximating the expectation: linear and non-linear expectation models, 3) propose a sound model-based policy evaluation algorithm and present its convergence results, and 4) empirically demonstrate the effectiveness of the proposed planning algorithm.
Published: 2019

50. Learning Feature Relevance Through Step Size Adaptation in Temporal-Difference Learning

Author: Kearney, Alex, Veeriah, Vivek, Travnik, Jaden, Pilarski, Patrick M., and Sutton, Richard S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: There is a long history of using meta learning as representation learning, specifically for determining the relevance of inputs. In this paper, we examine an instance of meta-learning in which feature relevance is learned by adapting step size parameters of stochastic gradient descent---building on a variety of prior work in stochastic approximation, machine learning, and artificial neural networks. In particular, we focus on stochastic meta-descent introduced in the Incremental Delta-Bar-Delta (IDBD) algorithm for setting individual step sizes for each feature of a linear function approximator. Using IDBD, a feature with large or small step sizes will have a large or small impact on generalization from training examples. As a main contribution of this work, we extend IDBD to temporal-difference (TD) learning---a form of learning which is effective in sequential, non i.i.d. problems. We derive a variety of IDBD generalizations for TD learning, demonstrating that they are able to distinguish which features are relevant and which are not. We demonstrate that TD IDBD is effective at learning feature relevance in both an idealized gridworld and a real-world robotic prediction task.
Published: 2019

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

2,690 results on '"Sutton, Richard"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources