Author: "Lazaric, Alessandro" / Language: undetermined - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Lazaric, Alessandro"' showing total 27 results

Start Over Author "Lazaric, Alessandro" Language undetermined

27 results on '"Lazaric, Alessandro"'

1. Learning Goal-Conditioned Policies Offline with Self-Supervised Reward Shaping

Author: Mezghani, Lina, Sukhbaatar, Sainbayar, Bojanowski, Piotr, Lazaric, Alessandro, Alahari, Karteek, Apprentissage de modèles à partir de données massives (Thoth), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Jean Kuntzmann (LJK), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA), Meta AI, ANR-18-CE23-0011, and ANR-18-CE23-0011,AVENUE,Réseau de mémoire visuelle pour l'interprétation de scènes(2018)
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Computer Science - Machine Learning, Self-Supervised Learning, Artificial Intelligence (cs.AI), [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Computer Science - Artificial Intelligence, Goal-Conditioned RL, [INFO.INFO-CV]Computer Science [cs]/Computer Vision and Pattern Recognition [cs.CV], Offline RL, Robotics (cs.RO), [INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI], Machine Learning (cs.LG)
Abstract: Developing agents that can execute multiple skills by learning from pre-collected datasets is an important problem in robotics, where online interaction with the environment is extremely time-consuming. Moreover, manually designing reward functions for every single desired skill is prohibitive. Prior works targeted these challenges by learning goal-conditioned policies from offline datasets without manually specified rewards, through hindsight relabelling. These methods suffer from the issue of sparsity of rewards, and fail at long-horizon tasks. In this work, we propose a novel self-supervised learning phase on the pre-collected dataset to understand the structure and the dynamics of the model, and shape a dense reward function for learning policies offline. We evaluate our method on three continuous control tasks, and show that our model significantly outperforms existing approaches, especially on tasks that involve long-term planning., Comment: Code: https://github.com/facebookresearch/go-fresh
Published: 2023
Full Text: View/download PDF

2. Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

Author: Yuan, Rui, Du, Simon S., Gower, Robert M., Lazaric, Alessandro, and Xiao, Lin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We consider infinite-horizon discounted Markov decision processes and study the convergence rates of the natural policy gradient (NPG) and the Q-NPG methods with the log-linear policy class. Using the compatible function approximation framework, both methods with log-linear policies can be written as inexact versions of the policy mirror descent (PMD) method. We show that both methods attain linear convergence rates and $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexities using a simple, non-adaptive geometrically increasing step size, without resorting to entropy or other strongly convex regularization. Lastly, as a byproduct, we obtain sublinear convergence rates for both methods with arbitrary constant step size., Comment: This version adds a table of comparison for the literature review. The paper is published as a conference paper at ICLR 2023
Published: 2022
Full Text: View/download PDF

3. Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

Author: Chen, Liyu, Tirinzoni, Andrea, Pirotta, Matteo, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study the sample complexity of learning an $\epsilon$-optimal policy in the Stochastic Shortest Path (SSP) problem. We first derive sample complexity bounds when the learner has access to a generative model. We show that there exists a worst-case SSP instance with $S$ states, $A$ actions, minimum cost $c_{\min}$, and maximum expected cost of the optimal policy over all states $B_{\star}$, where any algorithm requires at least $\Omega(SAB_{\star}^3/(c_{\min}\epsilon^2))$ samples to return an $\epsilon$-optimal policy with high probability. Surprisingly, this implies that whenever $c_{\min}=0$ an SSP problem may not be learnable, thus revealing that learning in SSPs is strictly harder than in the finite-horizon and discounted settings. We complement this result with lower bounds when prior knowledge of the hitting time of the optimal policy is available and when we restrict optimality by competing against policies with bounded hitting time. Finally, we design an algorithm with matching upper bounds in these cases. This settles the sample complexity of learning $\epsilon$-optimal polices in SSP with generative models. We also initiate the study of learning $\epsilon$-optimal policies without access to a generative model (i.e., the so-called best-policy identification problem), and show that sample-efficient learning is impossible in general. On the other hand, efficient learning can be made possible if we assume the agent can directly reach the goal state from any state by paying a fixed cost. We then establish the first upper and lower bounds under this assumption. Finally, using similar analytic tools, we prove that horizon-free regret is impossible in SSPs under general costs, resolving an open problem in (Tarbouriech et al., 2021c).
Published: 2022
Full Text: View/download PDF

4. Contextual bandits with concave rewards, and an application to fair ranking

Author: Do, Virginie, Dohmatob, Elvis, Pirotta, Matteo, Lazaric, Alessandro, and Usunier, Nicolas
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computers and Society, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Computers and Society (cs.CY), Machine Learning (stat.ML), Information Retrieval (cs.IR), Computer Science - Information Retrieval, Machine Learning (cs.LG)
Abstract: We consider Contextual Bandits with Concave Rewards (CBCR), a multi-objective bandit problem where the desired trade-off between the rewards is defined by a known concave objective function, and the reward vector depends on an observed stochastic context. We present the first algorithm with provably vanishing regret for CBCR without restrictions on the policy space, whereas prior works were restricted to finite policy spaces or tabular representations. Our solution is based on a geometric interpretation of CBCR algorithms as optimization algorithms over the convex set of expected rewards spanned by all stochastic policies. Building on Frank-Wolfe analyses in constrained convex optimization, we derive a novel reduction from the CBCR regret to the regret of a scalar-reward bandit problem. We illustrate how to apply the reduction off-the-shelf to obtain algorithms for CBCR with both linear and general reward functions, in the case of non-combinatorial actions. Motivated by fairness in recommendation, we describe a special case of CBCR with rankings and fairness-aware objectives, leading to the first algorithm with regret guarantees for contextual combinatorial bandits with fairness of exposure., Comment: ICLR 2023
Published: 2022
Full Text: View/download PDF

5. On the Complexity of Representation Learning in Contextual Linear Bandits

Author: Tirinzoni, Andrea, Pirotta, Matteo, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In contextual linear bandits, the reward function is assumed to be a linear combination of an unknown reward vector and a given embedding of context-arm pairs. In practice, the embedding is often learned at the same time as the reward vector, thus leading to an online representation learning problem. Existing approaches to representation learning in contextual bandits are either very generic (e.g., model-selection techniques or algorithms for learning with arbitrary function classes) or specialized to particular structures (e.g., nested features or representations with certain spectral properties). As a result, the understanding of the cost of representation learning in contextual linear bandit is still limited. In this paper, we take a systematic approach to the problem and provide a comprehensive study through an instance-dependent perspective. We show that representation learning is fundamentally more complex than linear bandits (i.e., learning with a given representation). In particular, learning with a given set of representations is never simpler than learning with the worst realizable representation in the set, while we show cases where it can be arbitrarily harder. We complement this result with an extensive discussion of how it relates to existing literature and we illustrate positive instances where representation learning is as complex as learning with a fixed representation and where sub-logarithmic regret is achievable.
Published: 2022
Full Text: View/download PDF

6. Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees

Author: Tirinzoni, Andrea, Papini, Matteo, Touati, Ahmed, Lazaric, Alessandro, and Pirotta, Matteo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We study the problem of representation learning in stochastic contextual linear bandits. While the primary concern in this domain is usually to find realizable representations (i.e., those that allow predicting the reward function at any context-action pair exactly), it has been recently shown that representations with certain spectral properties (called HLS) may be more effective for the exploration-exploitation task, enabling LinUCB to achieve constant (i.e., horizon-independent) regret. In this paper, we propose BanditSRL, a representation learning algorithm that combines a novel constrained optimization problem to learn a realizable representation with good spectral properties with a generalized likelihood ratio test to exploit the recovered representation and avoid excessive exploration. We prove that BanditSRL can be paired with any no-regret algorithm and achieve constant regret whenever an HLS representation is available. Furthermore, BanditSRL can be easily combined with deep neural networks and we show how regularizing towards HLS representations is beneficial in standard benchmarks., Comment: Accepted at Neurips 2022
Published: 2022
Full Text: View/download PDF

7. Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times

Author: Calandriello, Daniele, Carratino, Luigi, Lazaric, Alessandro, Valko, Michal, and Rosasco, Lorenzo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Computing a Gaussian process (GP) posterior has a computational cost cubical in the number of historical points. A reformulation of the same GP posterior highlights that this complexity mainly depends on how many \emph{unique} historical points are considered. This can have important implication in active learning settings, where the set of historical points is constructed sequentially by the learner. We show that sequential black-box optimization based on GPs (GP-Opt) can be made efficient by sticking to a candidate solution for multiple evaluation steps and switch only when necessary. Limiting the number of switches also limits the number of unique points in the history of the GP. Thus, the efficient GP reformulation can be used to exactly and cheaply compute the posteriors required to run the GP-Opt algorithms. This approach is especially useful in real-world applications of GP-Opt with high switch costs (e.g. switching chemicals in wet labs, data/model loading in hyperparameter optimization). As examples of this meta-approach, we modify two well-established GP-Opt algorithms, GP-UCB and GP-EI, to switch candidates as infrequently as possible adapting rules from batched GP-Opt. These versions preserve all the theoretical no-regret guarantees while improving practical aspects of the algorithms such as runtime, memory complexity, and the ability of batching candidates and evaluating them in parallel.
Published: 2022
Full Text: View/download PDF

8. Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Author: Tarbouriech, Jean, Zhou, Runlong, Du, Simon, Pirotta, Matteo, Valko, Michal, Lazaric, Alessandro, Facebook AI Research [Paris] (FAIR), Facebook, Scool (Scool), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Tsinghua University [Beijing] (THU), Paul G. Allen School of Computer Science and Engineering [Seattle], University of Washington [Seattle], DeepMind [Paris], Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, and Tarbouriech, Jean
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], Astrophysics::Cosmology and Extragalactic Astrophysics, [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], Astrophysics::Galaxy Astrophysics, Machine Learning (cs.LG)
Abstract: We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to induce an optimistic SSP problem whose associated value iteration scheme is guaranteed to converge. We prove that EB-SSP achieves the minimax regret rate $\tilde{O}(B_{\star} \sqrt{S A K})$, where $K$ is the number of episodes, $S$ is the number of states, $A$ is the number of actions, and $B_{\star}$ bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of $B_{\star}$, nor of $T_{\star}$, which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of $T_{\star}$ is available) where the regret only contains a logarithmic dependence on $T_{\star}$, thus yielding the first (nearly) horizon-free regret bound beyond the finite-horizon MDP setting., Comment: NeurIPS 2021
Published: 2021
Full Text: View/download PDF

9. A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

Author: Tirinzoni, Andrea, Pirotta, Matteo, and Lazaric, Alessandro
Subjects: Computer Science::Machine Learning, FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution to an optimization problem, our derivation reveals the need for an additional constraint on the visitation distribution over state-action pairs that explicitly accounts for the dynamics of the MDP. We provide a characterization of our lower-bound through a series of examples illustrating how different MDPs may have significantly different complexity. 1) We first consider a "difficult" MDP instance, where the novel constraint based on the dynamics leads to a larger lower-bound (i.e., a larger regret) compared to the classical analysis. 2) We then show that our lower-bound recovers results previously derived for specific MDP instances. 3) Finally, we show that, in certain "simple" MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
Published: 2021
Full Text: View/download PDF

10. Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching

Author: Kamienny, Pierre-Alexandre, Tarbouriech, Jean, Lamprier, Sylvain, Lazaric, Alessandro, and Denoyer, Ludovic
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Learning meaningful behaviors in the absence of reward is a difficult problem in reinforcement learning. A desirable and challenging unsupervised objective is to learn a set of diverse skills that provide a thorough coverage of the state space while being directed, i.e., reliably reaching distinct regions of the environment. In this paper, we build on the mutual information framework for skill discovery and introduce UPSIDE, which addresses the coverage-directedness trade-off in the following ways: 1) We design policies with a decoupled structure of a directed skill, trained to reach a specific region, followed by a diffusing part that induces a local coverage. 2) We optimize policies by maximizing their number under the constraint that each of them reaches distinct regions of the environment (i.e., they are sufficiently discriminable) and prove that this serves as a lower bound to the original mutual information objective. 3) Finally, we compose the learned directed skills into a growing tree that adaptively covers the environment. We illustrate in several navigation and control environments how the skills learned by UPSIDE solve sparse-reward downstream tasks better than existing baselines., Comment: ICLR 2022
Published: 2021
Full Text: View/download PDF

11. A general sample complexity analysis of vanilla policy gradient

Author: Yuan, Rui, Gower, Robert M., and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We adapt recent tools developed for the analysis of Stochastic Gradient Descent (SGD) in non-convex optimization to obtain convergence and sample complexity guarantees for the vanilla policy gradient (PG). Our only assumptions are that the expected return is smooth w.r.t. the policy parameters, that its $H$-step truncated gradient is close to the exact gradient, and a certain ABC assumption. This assumption requires the second moment of the estimated gradient to be bounded by $A\geq 0$ times the suboptimality gap, $B \geq 0$ times the norm of the full batch gradient and an additive constant $C \geq 0$, or any combination of aforementioned. We show that the ABC assumption is more general than the commonly used assumptions on the policy space to prove convergence to a stationary point. We provide a single convergence theorem that recovers the $\widetilde{\mathcal{O}}(\epsilon^{-4})$ sample complexity of PG to a stationary point. Our results also affords greater flexibility in the choice of hyper parameters such as the step size and the batch size $m$, including the single trajectory case (i.e., $m=1$). When an additional relaxed weak gradient domination assumption is available, we establish a novel global optimum convergence theory of PG with $\widetilde{\mathcal{O}}(\epsilon^{-3})$ sample complexity. We then instantiate our theorems in different settings, where we both recover existing results and obtain improved sample complexity, e.g., $\widetilde{\mathcal{O}}(\epsilon^{-3})$ sample complexity for the convergence to the global optimum for Fisher-non-degenerated parametrized policies., Comment: Accepted at AISTATS 2022. This version updates references and adds acknowledgement to Matteo Papini who greatly improved our work before the submission
Published: 2021
Full Text: View/download PDF

12. Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning

Author: Yarats, Denis, Fergus, Rob, Lazaric, Alessandro, and Pinto, Lerrel
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: We present DrQ-v2, a model-free reinforcement learning (RL) algorithm for visual continuous control. DrQ-v2 builds on DrQ, an off-policy actor-critic approach that uses data augmentation to learn directly from pixels. We introduce several improvements that yield state-of-the-art results on the DeepMind Control Suite. Notably, DrQ-v2 is able to solve complex humanoid locomotion tasks directly from pixel observations, previously unattained by model-free RL. DrQ-v2 is conceptually simple, easy to implement, and provides significantly better computational footprint compared to prior work, with the majority of tasks taking just 8 hours to train on a single GPU. Finally, we publicly release DrQ-v2's implementation to provide RL practitioners with a strong and computationally efficient baseline.
Published: 2021
Full Text: View/download PDF

13. Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations

Author: Garcelon, Evrard, Avadhanula, Vashist, Lazaric, Alessandro, and Pirotta, Matteo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We consider a multi-armed bandit setting where, at the beginning of each round, the learner receives noisy independent, and possibly biased, \emph{evaluations} of the true reward of each arm and it selects $K$ arms with the objective of accumulating as much reward as possible over $T$ rounds. Under the assumption that at each round the true reward of each arm is drawn from a fixed distribution, we derive different algorithmic approaches and theoretical guarantees depending on how the evaluations are generated. First, we show a $\widetilde{O}(T^{2/3})$ regret in the general case when the observation functions are a genearalized linear function of the true rewards. On the other hand, we show that an improved $\widetilde{O}(\sqrt{T})$ regret can be derived when the observation functions are noisy linear functions of the true rewards. Finally, we report an empirical validation that confirms our theoretical findings, provides a thorough comparison to alternative approaches, and further supports the interest of this setting in practice.
Published: 2021
Full Text: View/download PDF

14. Meta-learning with Stochastic Linear Bandits

Author: Cella, Leonardo, Lazaric, Alessandro, and Pontil, Massimiliano
Subjects: Computer Science::Machine Learning, FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We investigate meta-learning procedures in the setting of stochastic linear bandits tasks. The goal is to select a learning algorithm which works well on average over a class of bandits tasks, that are sampled from a task-distribution. Inspired by recent work on learning-to-learn linear regression, we consider a class of bandit algorithms that implement a regularized version of the well-known OFUL algorithm, where the regularization is a square euclidean distance to a bias vector. We first study the benefit of the biased OFUL algorithm in terms of regret minimization. We then propose two strategies to estimate the bias within the learning-to-learn setting. We show both theoretically and experimentally, that when the number of tasks grows and the variance of the task-distribution is small, our strategies have a significant advantage over learning the tasks in isolation.
Published: 2020
Full Text: View/download PDF

15. A Provably Efficient Sample Collection Strategy for Reinforcement Learning

Author: Tarbouriech, Jean, Pirotta, Matteo, Valko, Michal, Lazaric, Alessandro, Facebook AI Research [Paris] (FAIR), Facebook, Scool (Scool), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), DeepMind [Paris], and Tarbouriech, Jean
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Statistics - Machine Learning, Machine Learning (stat.ML), [INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG], [STAT.ML] Statistics [stat]/Machine Learning [stat.ML], Machine Learning (cs.LG)
Abstract: One of the challenges in online reinforcement learning (RL) is that the agent needs to trade off the exploration of the environment and the exploitation of the samples to optimize its behavior. Whether we optimize for regret, sample complexity, state-space coverage or model estimation, we need to strike a different exploration-exploitation trade-off. In this paper, we propose to tackle the exploration-exploitation problem following a decoupled approach composed of: 1) An "objective-specific" algorithm that (adaptively) prescribes how many samples to collect at which states, as if it has access to a generative model (i.e., a simulator of the environment); 2) An "objective-agnostic" sample collection exploration strategy responsible for generating the prescribed samples as fast as possible. Building on recent methods for exploration in the stochastic shortest path problem, we first provide an algorithm that, given as input the number of samples $b(s,a)$ needed in each state-action pair, requires $\tilde{O}(B D + D^{3/2} S^2 A)$ time steps to collect the $B=\sum_{s,a} b(s,a)$ desired samples, in any unknown communicating MDP with $S$ states, $A$ actions and diameter $D$. Then we show how this general-purpose exploration algorithm can be paired with "objective-specific" strategies that prescribe the sample requirements to tackle a variety of settings -- e.g., model estimation, sparse reward discovery, goal-free cost-free exploration in communicating MDPs -- for which we obtain improved or novel sample complexity guarantees., Comment: NeurIPS 2021
Published: 2020
Full Text: View/download PDF

16. Learning Near Optimal Policies with Low Inherent Bellman Error

Author: Zanette, Andrea, Lazaric, Alessandro, Kochenderfer, Mykel, and Brunskill, Emma
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: We study the exploration problem with approximate linear action-value functions in episodic reinforcement learning under the notion of low inherent Bellman error, a condition normally employed to show convergence of approximate value iteration. First we relate this condition to other common frameworks and show that it is strictly more general than the low rank (or linear) MDP assumption of prior work. Second we provide an algorithm with a high probability regret bound $\widetilde O(\sum_{t=1}^H d_t \sqrt{K} + \sum_{t=1}^H \sqrt{d_t} \IBE K)$ where $H$ is the horizon, $K$ is the number of episodes, $\IBE$ is the value if the inherent Bellman error and $d_t$ is the feature dimension at timestep $t$. In addition, we show that the result is unimprovable beyond constants and logs by showing a matching lower bound. This has two important consequences: 1) it shows that exploration is possible using only \emph{batch assumptions} with an algorithm that achieves the optimal statistical rate for the setting we consider, which is more general than prior work on low-rank MDPs 2) the lack of closedness (measured by the inherent Bellman error) is only amplified by $\sqrt{d_t}$ despite working in the online setting. Finally, the algorithm reduces to the celebrated \textsc{LinUCB} when $H=1$ but with a different choice of the exploration parameter that allows handling misspecified contextual linear bandits. While computational tractability questions remain open for the MDP setting, this enriches the class of MDPs with a linear representation for the action-value function where statistically efficient reinforcement learning is possible., Comment: Bug fixes in appendix; appears in ICML 2020
Published: 2020
Full Text: View/download PDF

17. Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

Author: Kamienny, Pierre-Alexandre, Pirotta, Matteo, Lazaric, Alessandro, Lavril, Thibault, Usunier, Nicolas, and Denoyer, Ludovic
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, 68T99, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study the problem of learning exploration-exploitation strategies that effectively adapt to dynamic environments, where the task may change over time. While RNN-based policies could in principle represent such strategies, in practice their training time is prohibitive and the learning process often converges to poor solutions. In this paper, we consider the case where the agent has access to a description of the task (e.g., a task id or task parameters) at training time, but not at test time. We propose a novel algorithm that regularizes the training of an RNN-based policy using informed policies trained to maximize the reward in each task. This dramatically reduces the sample complexity of training RNN-based policies, without losing their representational power. As a result, our method learns exploration strategies that efficiently balance between gathering information about the unknown and changing task and maximizing the reward over time. We test the performance of our algorithm in a variety of environments where tasks may vary within each episode., Comment: 18 pages
Published: 2020
Full Text: View/download PDF

18. Concentration Inequalities for Multinoulli Random Variables

Author: Qian, Jian, Fruit, Ronan, Pirotta, Matteo, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics::Applications, Statistics - Machine Learning, Computer Science::Sound, Statistics::Methodology, Machine Learning (stat.ML), Mathematics::Spectral Theory, Statistics::Computation, Machine Learning (cs.LG)
Abstract: We investigate concentration inequalities for Dirichlet and Multinomial random variables., Comment: Tutorial at ALT'19 on Regret Minimization in Infinite-Horizon Finite Markov Decision Processes
Published: 2020
Full Text: View/download PDF

19. Active Model Estimation in Markov Decision Processes

Author: Tarbouriech, Jean, Shekhar, Shubhanshu, Pirotta, Matteo, Ghavamzadeh, Mohammad, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first algorithm to learn an $\epsilon$-accurate estimate of the dynamics, and provide its sample complexity analysis. While this algorithm enjoys strong guarantees in the large-sample regime, it tends to have a poor performance in early stages of exploration. To address this issue, we propose an algorithm that is based on maximum weighted entropy, a heuristic that stems from common sense and our theoretical analysis. The main idea here is to cover the entire state-action space with the weight proportional to the noise in the transitions. Using a number of simple domains with heterogeneous noise in their transitions, we show that our heuristic-based algorithm outperforms both our original algorithm and the maximum entropy algorithm in the small sample regime, while achieving similar asymptotic performance as that of the original algorithm.
Published: 2020
Full Text: View/download PDF

20. Improved Analysis of UCRL2 with Empirical Bernstein Inequality

Author: Fruit, Ronan, Pirotta, Matteo, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We consider the problem of exploration-exploitation in communicating Markov Decision Processes. We provide an analysis of UCRL2 with Empirical Bernstein inequalities (UCRL2B). For any MDP with $S$ states, $A$ actions, $\Gamma \leq S$ next states and diameter $D$, the regret of UCRL2B is bounded as $\widetilde{O}(\sqrt{D\Gamma S A T})$., Comment: Document in support of the tutorial at ALT 2019
Published: 2020
Full Text: View/download PDF

21. A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

Author: Carion, Nicolas, Synnaeve, Gabriel, Lazaric, Alessandro, and Usunier, Nicolas
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Multiagent Systems, Machine Learning (stat.ML), Machine Learning (cs.LG), Multiagent Systems (cs.MA)
Abstract: Effective coordination is crucial to solve multi-agent collaborative (MAC) problems. While centralized reinforcement learning methods can optimally solve small MAC instances, they do not scale to large problems and they fail to generalize to scenarios different from those seen during training. In this paper, we consider MAC problems with some intrinsic notion of locality (e.g., geographic proximity) such that interactions between agents and tasks are locally limited. By leveraging this property, we introduce a novel structured prediction approach to assign agents to tasks. At each step, the assignment is obtained by solving a centralized optimization problem (the inference procedure) whose objective function is parameterized by a learned scoring model. We propose different combinations of inference procedures and scoring models able to represent coordination patterns of increasing complexity. The resulting assignment policy can be efficiently learned on small problem instances and readily reused in problems with more agents and tasks (i.e., zero-shot generalization). We report experimental results on a toy search and rescue problem and on several target selection scenarios in StarCraft: Brood War, in which our model significantly outperforms strong rule-based baselines on instances with 5 times more agents and tasks than those seen during training.
Published: 2019
Full Text: View/download PDF

22. Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

Author: Zanette, Andrea, Brandfonbrener, David, Brunskill, Emma, Pirotta, Matteo, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We consider the exploration-exploitation dilemma in finite-horizon reinforcement learning (RL). When the state space is large or continuous, traditional tabular approaches are unfeasible and some form of function approximation is mandatory. In this paper, we introduce an optimistically-initialized variant of the popular randomized least-squares value iteration (RLSVI), a model-free algorithm where exploration is induced by perturbing the least-squares approximation of the action-value function. Under the assumption that the Markov decision process has low-rank transition dynamics, we prove that the frequentist regret of RLSVI is upper-bounded by $\widetilde O(d^2 H^2 \sqrt{T})$ where $ d $ are the feature dimension, $ H $ is the horizon, and $ T $ is the total number of steps. To the best of our knowledge, this is the first frequentist regret analysis for randomized exploration with function approximation., Comment: AISTATS 2020; minor bug fix
Published: 2019
Full Text: View/download PDF

23. Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret

Author: Calandriello, Daniele, Carratino, Luigi, Lazaric, Alessandro, Valko, Michal, Rosasco, Lorenzo, Istituto Italiano di Tecnologia (IIT), Dipartimento di Informatica, Bioingegneria, Robotica e Ingegneria dei Sistemi [Genova] (DIBRIS), Università degli studi di Genova = University of Genoa (UniGe), Facebook, Sequential Learning (SEQUEL), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), and Universita degli studi di Genova
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, sparse Gaussian process optimization, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], Statistics - Machine Learning, kernelized linear bandits, variance starvation, sketching, regret, Machine Learning (stat.ML), black-box optimization, Bayesian optimization, Machine Learning (cs.LG)
Abstract: Gaussian processes (GP) are a well studied Bayesian approach for the optimization of black-box functions. Despite their effectiveness in simple problems, GP-based algorithms hardly scale to high-dimensional functions, as their per-iteration time and space cost is at least quadratic in the number of dimensions $d$ and iterations $t$. Given a set of $A$ alternatives to choose from, the overall runtime $O(t^3A)$ is prohibitive. In this paper we introduce BKB (budgeted kernelized bandit), a new approximate GP algorithm for optimization under bandit feedback that achieves near-optimal regret (and hence near-optimal convergence rate) with near-constant per-iteration complexity and remarkably no assumption on the input space or covariance of the GP. We combine a kernelized linear bandit algorithm (GP-UCB) with randomized matrix sketching based on leverage score sampling, and we prove that randomly sampling inducing points based on their posterior variance gives an accurate low-rank approximation of the GP, preserving variance estimates and confidence intervals. As a consequence, BKB does not suffer from variance starvation, an important problem faced by many previous sparse GP approximations. Moreover, we show that our procedure selects at most $\tilde{O}(d_{eff})$ points, where $d_{eff}$ is the effective dimension of the explored space, which is typically much smaller than both $d$ and $t$. This greatly reduces the dimensionality of the problem, thus leading to a $O(TAd_{eff}^2)$ runtime and $O(A d_{eff})$ space complexity., Comment: Accepted at COLT 2019. Corrected typos and improved comparison with existing methods
Published: 2019
Full Text: View/download PDF

24. Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

Author: Qian, Jian, Fruit, Ronan, Pirotta, Matteo, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We introduce and analyse two algorithms for exploration-exploitation in discrete and continuous Markov Decision Processes (MDPs) based on exploration bonuses. SCAL$^+$ is a variant of SCAL (Fruit et al., 2018) that performs efficient exploration-exploitation in any unknown weakly-communicating MDP for which an upper bound C on the span of the optimal bias function is known. For an MDP with $S$ states, $A$ actions and $\Gamma \leq S$ possible next states, we prove that SCAL$^+$ achieves the same theoretical guarantees as SCAL (i.e., a high probability regret bound of $\widetilde{O}(C\sqrt{\Gamma SAT})$), with a much smaller computational complexity. Similarly, C-SCAL$^+$ exploits an exploration bonus to achieve sublinear regret in any undiscounted MDP with continuous state space. We show that C-SCAL$^+$ achieves the same regret bound as UCCRL (Ortner and Ryabko, 2012) while being the first implementable algorithm with regret guarantees in this setting. While optimistic algorithms such as UCRL, SCAL or UCCRL maintain a high-confidence set of plausible MDPs around the true unknown MDP, SCAL$^+$ and C-SCAL$^+$ leverage on an exploration bonus to directly plan on the empirically estimated MDP, thus being more computationally efficient.
Published: 2018
Full Text: View/download PDF

25. Thompson Sampling for Linear-Quadratic Control Problems

Author: Abeille, Marc, Lazaric, Alessandro, Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Sequential Learning (SEQUEL), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), and ANR-14-CE24-0010,ExTra-Learn,Extraction et transfert de connaissances dans l'apprentissage par renforcement(2014)
Subjects: FOS: Computer and information sciences, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], Statistics - Machine Learning, Machine Learning (stat.ML)
Abstract: International audience; We consider the exploration-exploitation tradeoff in linear quadratic (LQ) control problems, where the state dynamics is linear and the cost function is quadratic in states and controls. We analyze the regret of Thompson sampling (\ts) (a.k.a. posterior-sampling for reinforcement learning) in the frequentist setting, i.e., when the parameters characterizing the LQ dynamics are fixed. Despite the empirical and theoretical success in a wide range of problems from multi-armed bandit to linear bandit, we show that when studying the frequentist regret \ts in control problems, we need to trade-off the frequency of sampling optimistic parameters and the frequency of switches in the control policy. This results in an overall regret of $O(T^{2/3})$, which is significantly worse than the regret $O(\sqrt{T})$ achieved by the optimism-in-face-of-uncertainty algorithm in LQ control problems.
Published: 2017
Full Text: View/download PDF

26. Reinforcement Learning of POMDPs using Spectral Methods

Author: Azizzadenesheli, Kamyar, Lazaric, Alessandro, Anandkumar, Animashree, University of California [Irvine] (UC Irvine), University of California (UC), Sequential Learning (SEQUEL), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), ANR-14-CE24-0010,ExTra-Learn,Extraction et transfert de connaissances dans l'apprentissage par renforcement(2014), University of California [Irvine] (UCI), and University of California
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, math.OC, Computer Science - Artificial Intelligence, cs.LG, Computer Science - Numerical Analysis, Machine Learning (stat.ML), Numerical Analysis (math.NA), cs.AI, Latent Variable Model, stat.ML, Machine Learning (cs.LG), Computer Science - Learning, Artificial Intelligence (cs.AI), [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Spectral Methods, Upper Confidence Reinforcement Learning, Mathematics - Optimization and Control, cs.NA, Method of Moments, Partially Observable Markov Decision Pro-cess
Abstract: International audience; We propose a new reinforcement learning algorithm for partially observable Markov decision processes (POMDP) based on spectral decomposition methods. While spectral methods have been previously employed for consistent learning of (passive) latent variable models such as hidden Markov models, POMDPs are more challenging since the learner interacts with the environment and possibly changes the future observations in the process. We devise a learning algorithm running through episodes, in each episode we employ spectral techniques to learn the POMDP parameters from a trajectory generated by a fixed policy. At the end of the episode, an optimization oracle returns the optimal memoryless planning policy which maximizes the expected reward based on the estimated POMDP model. We prove an order-optimal regret bound w.r.t. the optimal memoryless policy and efficient scaling with respect to the dimensionality of observation and action spaces.
Published: 2016

27. Analysis of Kelner and Levin graph sparsification algorithm for a streaming setting

Author: Calandriello, Daniele, Lazaric, Alessandro, and Valko, Michal
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Machine Learning (stat.ML), Data Structures and Algorithms (cs.DS), Machine Learning (cs.LG)
Abstract: We derive a new proof to show that the incremental resparsification algorithm proposed by Kelner and Levin (2013) produces a spectral sparsifier in high probability. We rigorously take into account the dependencies across subsequent resparsifications using martingale inequalities, fixing a flaw in the original analysis.
Published: 2016
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

27 results on '"Lazaric, Alessandro"'

1. Learning Goal-Conditioned Policies Offline with Self-Supervised Reward Shaping

2. Linear Convergence of Natural Policy Gradient Methods with Log-Linear Policies

3. Reaching Goals is Hard: Settling the Sample Complexity of the Stochastic Shortest Path

4. Contextual bandits with concave rewards, and an application to fair ranking

5. On the Complexity of Representation Learning in Contextual Linear Bandits

6. Scalable Representation Learning in Linear Contextual Bandits with Constant Regret Guarantees

7. Scaling Gaussian Process Optimization by Evaluating a Few Unique Candidates Multiple Times

8. Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

9. A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs

10. Direct then Diffuse: Incremental Unsupervised Skill Discovery for State Covering and Goal Reaching

11. A general sample complexity analysis of vanilla policy gradient

12. Mastering Visual Continuous Control: Improved Data-Augmented Reinforcement Learning

13. Top $K$ Ranking for Multi-Armed Bandit with Noisy Evaluations

14. Meta-learning with Stochastic Linear Bandits

15. A Provably Efficient Sample Collection Strategy for Reinforcement Learning

16. Learning Near Optimal Policies with Low Inherent Bellman Error

17. Learning Adaptive Exploration Strategies in Dynamic Environments Through Informed Policy Regularization

18. Concentration Inequalities for Multinoulli Random Variables

19. Active Model Estimation in Markov Decision Processes

20. Improved Analysis of UCRL2 with Empirical Bernstein Inequality

21. A Structured Prediction Approach for Generalization in Cooperative Multi-Agent Reinforcement Learning

22. Frequentist Regret Bounds for Randomized Least-Squares Value Iteration

23. Gaussian Process Optimization with Adaptive Sketching: Scalable and No Regret

24. Exploration Bonus for Regret Minimization in Undiscounted Discrete and Continuous Markov Decision Processes

25. Thompson Sampling for Linear-Quadratic Control Problems

26. Reinforcement Learning of POMDPs using Spectral Methods

27. Analysis of Kelner and Levin graph sparsification algorithm for a streaming setting

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Database

Publisher

27 results on '"Lazaric, Alessandro"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources