Author: "Ghavamzadeh, Mohammad" / Topic: machine learning (stat.ml) - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Ghavamzadeh, Mohammad"' showing total 35 results

Start Over Author "Ghavamzadeh, Mohammad" Topic machine learning (stat.ml)

35 results on '"Ghavamzadeh, Mohammad"'

1. A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits

Author: Ghavamzadeh, Mohammad, Petrik, Marek, and Tennenholtz, Guy
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Algorithms for offline bandits must optimize decisions in uncertain environments using only offline data. A compelling and increasingly popular objective in offline bandits is to learn a policy which achieves low Bayesian regret with high confidence. An appealing approach to this problem, inspired by recent offline reinforcement learning results, is to maximize a form of lower confidence bound (LCB). This paper proposes a new approach that directly minimizes upper bounds on Bayesian regret using efficient conic optimization solvers. Our bounds build on connections among Bayesian regret, Value-at-Risk (VaR), and chance-constrained optimization. Compared to prior work, our algorithm attains superior theoretical offline regret bounds and better results in numerical simulations. Finally, we provide some evidence that popular LCB-style algorithms may be unsuitable for minimizing Bayesian regret in offline bandits.
Published: 2023

2. Robust Reinforcement Learning using Offline Data

Author: Panaganti, Kishan, Xu, Zaiyan, Kalathil, Dileep, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: The goal of robust reinforcement learning (RL) is to learn a policy that is robust against the uncertainty in model parameters. Parameter uncertainty commonly occurs in many real-world RL applications due to simulator modeling errors, changes in the real-world system dynamics over time, and adversarial disturbances. Robust RL is typically formulated as a max-min problem, where the objective is to learn the policy that maximizes the value against the worst possible models that lie in an uncertainty set. In this work, we propose a robust RL algorithm called Robust Fitted Q-Iteration (RFQI), which uses only an offline dataset to learn the optimal robust policy. Robust RL with offline data is significantly more challenging than its non-robust counterpart because of the minimization over all models present in the robust Bellman operator. This poses challenges in offline data collection, optimization over the models, and unbiased estimation. In this work, we propose a systematic approach to overcome these challenges, resulting in our RFQI algorithm. We prove that RFQI learns a near-optimal robust policy under standard assumptions and demonstrate its superior performance on standard benchmark problems., Appeared in Neural Information Processing Systems (NeurIPS) 2022
Published: 2022

3. Meta-Learning for Simple Regret Minimization

Author: Azizi, Mohammadjavad, Kveton, Branislav, Ghavamzadeh, Mohammad, and Katariya, Sumeet
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Statistics::Machine Learning, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY, Machine Learning (stat.ML), ComputingMethodologies_ARTIFICIALINTELLIGENCE, Machine Learning (cs.LG)
Abstract: We develop a meta-learning framework for simple regret minimization in bandits. In this framework, a learning agent interacts with a sequence of bandit tasks, which are sampled i.i.d.\ from an unknown prior distribution, and learns its meta-parameters to perform better on future tasks. We propose the first Bayesian and frequentist meta-learning algorithms for this setting. The Bayesian algorithm has access to a prior distribution over the meta-parameters and its meta simple regret over $m$ bandit tasks with horizon $n$ is mere $\tilde{O}(m / \sqrt{n})$. On the other hand, the meta simple regret of the frequentist algorithm is $\tilde{O}(\sqrt{m} n + m/ \sqrt{n})$. While its regret is worse, the frequentist algorithm is more general because it does not need a prior distribution over the meta-parameters. It can also be analyzed in more settings. We instantiate our algorithms for several classes of bandit problems. Our algorithms are general and we complement our theory by evaluating them empirically in several environments.
Published: 2022

4. Deep Hierarchy in Bandits

Author: Hong, Joey, Kveton, Branislav, Katariya, Sumeet, Zaheer, Manzil, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Mean rewards of actions are often correlated. The form of these correlations may be complex and unknown a priori, such as the preferences of a user for recommended products and their categories. To maximize statistical efficiency, it is important to leverage these correlations when learning. We formulate a bandit variant of this problem where the correlations of mean action rewards are represented by a hierarchical Bayesian model with latent variables. Since the hierarchy can have multiple layers, we call it deep. We propose a hierarchical Thompson sampling algorithm (HierTS) for this problem, and show how to implement it efficiently for Gaussian hierarchies. The efficient implementation is possible due to a novel exact hierarchical representation of the posterior, which itself is of independent interest. We use this exact posterior to analyze the Bayes regret of HierTS in Gaussian bandits. Our analysis reflects the structure of the problem, that the regret decreases with the prior width, and also shows that hierarchies reduce the regret by non-constant factors in the number of actions. We confirm these theoretical findings empirically, in both synthetic and real-world experiments.
Published: 2022

5. Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

Author: Azizi, MohammadJavad, Duong, Thang, Abbasi-Yadkori, Yasin, György, András, Vernade, Claire, and Ghavamzadeh, Mohammad
Subjects: Computer Science::Machine Learning, FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics::Machine Learning, Statistics - Machine Learning, TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY, Machine Learning (stat.ML), ComputingMethodologies_ARTIFICIALINTELLIGENCE, Machine Learning (cs.LG)
Abstract: We study a sequential decision problem where the learner faces a sequence of $K$-armed bandit tasks. The task boundaries might be known (the bandit meta-learning setting), or unknown (the non-stationary bandit setting). For a given integer $M\le K$, the learner aims to compete with the best subset of arms of size $M$. We design an algorithm based on a reduction to bandit submodular maximization, and show that, for $T$ rounds comprised of $N$ tasks, in the regime of large number of tasks and small number of optimal arms $M$, its regret in both settings is smaller than the simple baseline of $\tilde{O}(\sqrt{KNT})$ that can be obtained by using standard algorithms designed for non-stationary bandit problems. For the bandit meta-learning problem with fixed task length $\tau$, we show that the regret of the algorithm is bounded as $\tilde{O}(NM\sqrt{M \tau}+N^{2/3}M\tau)$. Under additional assumptions on the identifiability of the optimal arms in each task, we show a bandit meta-learning algorithm with an improved $\tilde{O}(N\sqrt{M \tau}+N^{1/2}\sqrt{M K \tau})$ regret.
Published: 2022
Full Text: View/download PDF

6. Operator Splitting Value Iteration

Author: Rakhsha, Amin, Wang, Andrew, Ghavamzadeh, Mohammad, and Farahmand, Amir-massoud
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Mathematics, Machine Learning (stat.ML), Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We introduce new planning and reinforcement learning algorithms for discounted MDPs that utilize an approximate model of the environment to accelerate the convergence of the value function. Inspired by the splitting approach in numerical linear algebra, we introduce Operator Splitting Value Iteration (OS-VI) for both Policy Evaluation and Control problems. OS-VI achieves a much faster convergence rate when the model is accurate enough. We also introduce a sample-based version of the algorithm called OS-Dyna. Unlike the traditional Dyna architecture, OS-Dyna still converges to the correct value function in presence of model approximation error., Comment: Accepted to NeurIPS2022
Published: 2022
Full Text: View/download PDF

7. Soft-Robust Algorithms for Batch Reinforcement Learning

Author: Lobo, Elita A., Ghavamzadeh, Mohammad, and Petrik, Marek
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Optimization and Control (math.OC), Computer Science - Artificial Intelligence, Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In reinforcement learning, robust policies for high-stakes decision-making problems with limited data are usually computed by optimizing the percentile criterion, which minimizes the probability of a catastrophic failure. Unfortunately, such policies are typically overly conservative as the percentile criterion is non-convex, difficult to optimize, and ignores the mean performance. To overcome these shortcomings, we study the soft-robust criterion, which uses risk measures to balance the mean and percentile criterion better. In this paper, we establish the soft-robust criterion's fundamental properties, show that it is NP-hard to optimize, and propose and analyze two algorithms to approximately optimize it. Our theoretical analyses and empirical evaluations demonstrate that our algorithms compute much less conservative solutions than the existing approximate methods for optimizing the percentile-criterion.
Published: 2020

8. Control-Aware Representations for Model-based Reinforcement Learning

Author: Cui, Brandon, Chow, Yinlam, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: A major challenge in modern reinforcement learning (RL) is efficient control of dynamical systems from high-dimensional sensory observations. Learning controllable embedding (LCE) is a promising approach that addresses this challenge by embedding the observations into a lower-dimensional latent space, estimating the latent dynamics, and utilizing it to perform control in the latent space. Two important questions in this area are how to learn a representation that is amenable to the control problem at hand, and how to achieve an end-to-end framework for representation learning and control. In this paper, we take a few steps towards addressing these questions. We first formulate a LCE model to learn representations that are suitable to be used by a policy iteration style algorithm in the latent space. We call this model control-aware representation learning (CARL). We derive a loss function for CARL that has close connection to the prediction, consistency, and curvature (PCC) principle for representation learning. We derive three implementations of CARL. In the offline implementation, we replace the locally-linear control algorithm (e.g.,~iLQR) used by the existing LCE methods with a RL algorithm, namely model-based soft actor-critic, and show that it results in significant improvement. In online CARL, we interleave representation learning and control, and demonstrate further gain in performance. Finally, we propose value-guided CARL, a variation in which we optimize a weighted version of the CARL loss function, where the weights depend on the TD-error of the current policy. We evaluate the proposed algorithms by extensive experiments on benchmark tasks and compare them with several LCE baselines.
Published: 2020

9. Stochastic Bandits with Linear Constraints

Author: Pacchiano, Aldo, Ghavamzadeh, Mohammad, Bartlett, Peter, and Jiang, Heinrich
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Statistics::Machine Learning, Computer Science - Machine Learning, Statistics - Machine Learning, TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY, Machine Learning (stat.ML), ComputingMethodologies_ARTIFICIALINTELLIGENCE, Machine Learning (cs.LG)
Abstract: We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of $T$ rounds is maximum, and each has an expected cost below a certain threshold $\tau$. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove an $\widetilde{\mathcal{O}}(\frac{d\sqrt{T}}{\tau-c_0})$ bound on its $T$-round regret, where the denominator is the difference between the constraint threshold and the cost of a known feasible action. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting. We prove a regret bound of $\widetilde{\mathcal{O}}(\frac{\sqrt{KT}}{\tau - c_0})$ for this algorithm in $K$-armed bandits, which is a $\sqrt{K}$ improvement over the regret bound we obtain by simply casting multi-armed bandits as an instance of contextual linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results., Comment: 9 pages
Published: 2020

10. Finite-Sample Analysis of Proximal Gradient TD Algorithms

Author: Liu, Bo, Liu, Ji, Ghavamzadeh, Mohammad, Mahadevan, Sridhar, and Petrik, Marek
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In this paper, we analyze the convergence rate of the gradient temporal difference learning (GTD) family of algorithms. Previous analyses of this class of algorithms use ODE techniques to prove asymptotic convergence, and to the best of our knowledge, no finite-sample analysis has been done. Moreover, there has been not much work on finite-sample analysis for convergent off-policy reinforcement learning algorithms. In this paper, we formulate GTD methods as stochastic gradient algorithms w.r.t.~a primal-dual saddle-point objective function, and then conduct a saddle-point error analysis to obtain finite-sample bounds on their performance. Two revised algorithms are also proposed, namely projected GTD2 and GTD2-MP, which offer improved convergence guarantees and acceleration, respectively. The results of our theoretical analysis show that the GTD family of algorithms are indeed comparable to the existing LSTD methods in off-policy learning scenarios., 31st Conference on Uncertainty in Artificial Intelligence (UAI). arXiv admin note: substantial text overlap with arXiv:2006.03976
Published: 2020

11. Neural Lyapunov Redesign

Author: Mehrjou, Arash, Ghavamzadeh, Mohammad, and Schölkopf, Bernhard
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, Machine Learning (stat.ML), Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Machine Learning (cs.LG)
Abstract: Learning controllers merely based on a performance metric has been proven effective in many physical and non-physical tasks in both control theory and reinforcement learning. However, in practice, the controller must guarantee some notion of safety to ensure that it does not harm either the agent or the environment. Stability is a crucial notion of safety, whose violation can certainly cause unsafe behaviors. Lyapunov functions are effective tools to assess stability in nonlinear dynamical systems. In this paper, we combine an improving Lyapunov function with automatic controller synthesis in an iterative fashion to obtain control policies with large safe regions. We propose a two-player collaborative algorithm that alternates between estimating a Lyapunov function and deriving a controller that gradually enlarges the stability region of the closed-loop system. We provide theoretical results on the class of systems that can be treated with the proposed algorithm and empirically evaluate the effectiveness of our method using an exemplary dynamical system., 27 pages
Published: 2020

12. Mirror Descent Policy Optimization

Author: Tomar, Manan, Shani, Lior, Efroni, Yonathan, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Mirror descent (MD), a well-known first-order method in constrained convex optimization, has recently been shown as an important tool to analyze trust-region algorithms in reinforcement learning (RL). However, there remains a considerable gap between such theoretically analyzed algorithms and the ones used in practice. Inspired by this, we propose an efficient RL algorithm, called {\em mirror descent policy optimization} (MDPO). MDPO iteratively updates the policy by {\em approximately} solving a trust-region problem, whose objective function consists of two terms: a linearization of the standard RL objective and a proximity term that restricts two consecutive policies to be close to each other. Each update performs this approximation by taking multiple gradient steps on this objective function. We derive {\em on-policy} and {\em off-policy} variants of MDPO, while emphasizing important design choices motivated by the existing theory of MD in RL. We highlight the connections between on-policy MDPO and two popular trust-region RL algorithms: TRPO and PPO, and show that explicitly enforcing the trust-region constraint is in fact {\em not} a necessity for high performance gains in TRPO. We then show how the popular soft actor-critic (SAC) algorithm can be derived by slight modifications of off-policy MDPO. Overall, MDPO is derived from the MD principles, offers a unified approach to viewing a number of popular RL algorithms, and performs better than or on-par with TRPO, PPO, and SAC in a number of continuous control tasks. Code is available at \url{https://github.com/manantomar/Mirror-Descent-Policy-Optimization}.
Published: 2020

13. Active Model Estimation in Markov Decision Processes

Author: Tarbouriech, Jean, Shekhar, Shubhanshu, Pirotta, Matteo, Ghavamzadeh, Mohammad, and Lazaric, Alessandro
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study the problem of efficient exploration in order to learn an accurate model of an environment, modeled as a Markov decision process (MDP). Efficient exploration in this problem requires the agent to identify the regions in which estimating the model is more difficult and then exploit this knowledge to collect more samples there. In this paper, we formalize this problem, introduce the first algorithm to learn an $\epsilon$-accurate estimate of the dynamics, and provide its sample complexity analysis. While this algorithm enjoys strong guarantees in the large-sample regime, it tends to have a poor performance in early stages of exploration. To address this issue, we propose an algorithm that is based on maximum weighted entropy, a heuristic that stems from common sense and our theoretical analysis. The main idea here is to cover the entire state-action space with the weight proportional to the noise in the transitions. Using a number of simple domains with heterogeneous noise in their transitions, we show that our heuristic-based algorithm outperforms both our original algorithm and the maximum entropy algorithm in the small sample regime, while achieving similar asymptotic performance as that of the original algorithm.
Published: 2020
Full Text: View/download PDF

14. Predictive Coding for Locally-Linear Control

Author: Shu, Rui, Nguyen, Tung, Chow, Yinlam, Pham, Tuan, Than, Khoat, Ghavamzadeh, Mohammad, Ermon, Stefano, and Bui, Hung H.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Electrical engineering, electronic engineering, information engineering, Machine Learning (stat.ML), Systems and Control (eess.SY), Electrical Engineering and Systems Science - Systems and Control, Machine Learning (cs.LG)
Abstract: High-dimensional observations and unknown dynamics are major challenges when applying optimal control to many real-world decision making tasks. The Learning Controllable Embedding (LCE) framework addresses these challenges by embedding the observations into a lower dimensional latent space, estimating the latent dynamics, and then performing control directly in the latent space. To ensure the learned latent dynamics are predictive of next-observations, all existing LCE approaches decode back into the observation space and explicitly perform next-observation prediction---a challenging high-dimensional task that furthermore introduces a large number of nuisance parameters (i.e., the decoder) which are discarded during control. In this paper, we propose a novel information-theoretic LCE approach and show theoretically that explicit next-observation prediction can be replaced with predictive coding. We then use predictive coding to develop a decoder-free LCE model whose latent dynamics are amenable to locally-linear control. Extensive experiments on benchmark tasks show that our model reliably learns a controllable latent space that leads to superior performance when compared with state-of-the-art LCE baselines.
Published: 2020
Full Text: View/download PDF

15. Online Planning with Lookahead Policies

Author: Efroni, Yonathan, Ghavamzadeh, Mohammad, and Mannor, Shie
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Real Time Dynamic Programming (RTDP) is an online algorithm based on Dynamic Programming (DP) that acts by 1-step greedy planning. Unlike DP, RTDP does not require access to the entire state space, i.e., it explicitly handles the exploration. This fact makes RTDP particularly appealing when the state space is large and it is not possible to update all states simultaneously. In this we devise a multi-step greedy RTDP algorithm, which we call $h$-RTDP, that replaces the 1-step greedy policy with a $h$-step lookahead policy. We analyze $h$-RTDP in its exact form and establish that increasing the lookahead horizon, $h$, results in an improved sample complexity, with the cost of additional computations. This is the first work that proves improved sample complexity as a result of {\em increasing} the lookahead horizon in online planning. We then analyze the performance of $h$-RTDP in three approximate settings: approximate model, approximate value updates, and approximate state representation. For these cases, we prove that the asymptotic performance of $h$-RTDP remains the same as that of a corresponding approximate DP algorithm, the best one can hope for without further assumptions on the approximation errors., NeurIPS 2020
Published: 2019

16. Lyapunov-based Safe Policy Optimization for Continuous Control

Author: Chow, Yinlam, Nachum, Ofir, Faust, Aleksandra, Duenez-Guzman, Edgar, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study continuous action reinforcement learning problems in which it is crucial that the agent interacts with the environment only through safe policies, i.e.,~policies that do not take the agent to undesirable situations. We formulate these problems as constrained Markov decision processes (CMDPs) and present safe policy optimization algorithms that are based on a Lyapunov approach to solve them. Our algorithms can use any standard policy gradient (PG) method, such as deep deterministic policy gradient (DDPG) or proximal policy optimization (PPO), to train a neural network policy, while guaranteeing near-constraint satisfaction for every policy update by projecting either the policy parameter or the action onto the set of feasible solutions induced by the state-dependent linearized Lyapunov constraints. Compared to the existing constrained PG algorithms, ours are more data efficient as they are able to utilize both on-policy and off-policy data. Moreover, our action-projection algorithm often leads to less conservative policy updates and allows for natural integration into an end-to-end PG training pipeline. We evaluate our algorithms and compare them with the state-of-the-art baselines on several simulated (MuJoCo) tasks, as well as a real-world indoor robot navigation problem, demonstrating their effectiveness in terms of balancing performance and constraint satisfaction. Videos of the experiments can be found in the following link: https://drive.google.com/file/d/1pzuzFqWIE710bE2U6DmS59AfRzqK2Kek/view?usp=sharing.
Published: 2019
Full Text: View/download PDF

17. Randomized Exploration in Generalized Linear Bandits

Author: Kveton, Branislav, Zaheer, Manzil, Szepesvari, Csaba, Li, Lihong, Ghavamzadeh, Mohammad, and Boutilier, Craig
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study two randomized algorithms for generalized linear bandits. The first, GLM-TSL, samples a generalized linear model (GLM) from the Laplace approximation to the posterior distribution. The second, GLM-FPL, fits a GLM to a randomly perturbed history of past rewards. We analyze both algorithms and derive $\tilde{O}(d \sqrt{n \log K})$ upper bounds on their $n$-round regret, where $d$ is the number of features and $K$ is the number of arms. The former improves on prior work while the latter is the first for Gaussian noise perturbations in non-linear models. We empirically evaluate both GLM-TSL and GLM-FPL in logistic bandits, and apply GLM-FPL to neural network bandits. Our work showcases the role of randomization, beyond posterior sampling, in exploration., Comment: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistic
Published: 2019
Full Text: View/download PDF

18. Benchmarking Batch Deep Reinforcement Learning Algorithms

Author: Fujimoto, Scott, Conti, Edoardo, Ghavamzadeh, Mohammad, and Pineau, Joelle
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task., Comment: Deep RL Workshop NeurIPS 2019
Published: 2019
Full Text: View/download PDF

19. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

Author: Kveton, Branislav, Szepesvari, Csaba, Vaswani, Sharan, Wen, Zheng, Ghavamzadeh, Mohammad, and Lattimore, Tor
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Computer Science::Computer Science and Game Theory, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), ComputingMilieux_MISCELLANEOUS, Machine Learning (cs.LG)
Abstract: We propose a bandit algorithm that explores by randomizing its history of rewards. Specifically, it pulls the arm with the highest mean reward in a non-parametric bootstrap sample of its history with pseudo rewards. We design the pseudo rewards such that the bootstrap mean is optimistic with a sufficiently high probability. We call our algorithm Giro, which stands for garbage in, reward out. We analyze Giro in a Bernoulli bandit and derive a $O(K \Delta^{-1} \log n)$ bound on its $n$-round regret, where $\Delta$ is the difference in the expected rewards of the optimal and the best suboptimal arms, and $K$ is the number of arms. The main advantage of our exploration design is that it easily generalizes to structured problems. To show this, we propose contextual Giro with an arbitrary reward generalization model. We evaluate Giro and its contextual variant on multiple synthetic and real-world problems, and observe that it performs well., Comment: Proceedings of the 36th International Conference on Machine Learning
Published: 2018

20. Risk-Sensitive Generative Adversarial Imitation Learning

Author: Lacotte, Jonathan, Ghavamzadeh, Mohammad, Chow, Yinlam, and Pavone, Marco
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study risk-sensitive imitation learning where the agent's goal is to perform at least as well as the expert in terms of a risk profile. We first formulate our risk-sensitive imitation learning setting. We consider the generative adversarial approach to imitation learning (GAIL) and derive an optimization problem for our formulation, which we call it risk-sensitive GAIL (RS-GAIL). We then derive two different versions of our RS-GAIL optimization problem that aim at matching the risk profiles of the agent and the expert w.r.t. Jensen-Shannon (JS) divergence and Wasserstein distance, and develop risk-sensitive generative adversarial imitation learning algorithms based on these optimization problems. We evaluate the performance of our algorithms and compare them with GAIL and the risk-averse imitation learning (RAIL) algorithms in two MuJoCo and two OpenAI classical control tasks.
Published: 2018

21. A Lyapunov-based Approach to Safe Reinforcement Learning

Author: Chow, Yinlam, Nachum, Ofir, Duenez-Guzman, Edgar, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In many real-world reinforcement learning (RL) problems, besides optimizing the main objective function, an agent must concurrently avoid violating a number of constraints. In particular, besides optimizing performance it is crucial to guarantee the safety of an agent during training as well as deployment (e.g. a robot should avoid taking actions - exploratory or not - which irrevocably harm its hardware). To incorporate safety in RL, we derive algorithms under the framework of constrained Markov decision problems (CMDPs), an extension of the standard Markov decision problems (MDPs) augmented with constraints on expected cumulative costs. Our approach hinges on a novel \emph{Lyapunov} method. We define and present a method for constructing Lyapunov functions, which provide an effective way to guarantee the global safety of a behavior policy during training via a set of local, linear constraints. Leveraging these theoretical underpinnings, we show how to use the Lyapunov approach to systematically transform dynamic programming (DP) and RL algorithms into their safe counterparts. To illustrate their effectiveness, we evaluate these algorithms in several CMDP planning and decision-making tasks on a safety benchmark domain. Our results show that our proposed method significantly outperforms existing baselines in balancing constraint satisfaction and performance.
Published: 2018

22. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

Author: Liu, Bo, Xie, Tengyang, Xu, Yangyang, Ghavamzadeh, Mohammad, Chow, Yinlam, Lyu, Daoming, and Yoon, Daesub
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Risk management in dynamic decision problems is a primary concern in many fields, including financial investment, autonomous driving, and healthcare. The mean-variance function is one of the most widely used objective functions in risk management due to its simplicity and interpretability. Existing algorithms for mean-variance optimization are based on multi-time-scale stochastic approximation, whose learning rate schedules are often hard to tune, and have only asymptotic convergence proof. In this paper, we develop a model-free policy search framework for mean-variance optimization with finite-sample error bound analysis (to local optima). Our starting point is a reformulation of the original mean-variance function with its Fenchel dual, from which we propose a stochastic block coordinate ascent policy search algorithm. Both the asymptotic convergence guarantee of the last iteration's solution and the convergence rate of the randomly picked solution are provided, and their applicability is demonstrated on several benchmark domains., Comment: Accepted by NIPS 2018
Published: 2018
Full Text: View/download PDF

23. Path Consistency Learning in Tsallis Entropy Regularized MDPs

Author: Nachum, Ofir, Chow, Yinlam, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study the sparse entropy-regularized reinforcement learning (ERL) problem in which the entropy term is a special form of the Tsallis entropy. The optimal policy of this formulation is sparse, i.e.,~at each state, it has non-zero probability for only a small number of actions. This addresses the main drawback of the standard Shannon entropy-regularized RL (soft ERL) formulation, in which the optimal policy is softmax, and thus, may assign a non-negligible probability mass to non-optimal actions. This problem is aggravated as the number of actions is increased. In this paper, we follow the work of Nachum et al. (2017) in the soft ERL setting, and propose a class of novel path consistency learning (PCL) algorithms, called {\em sparse PCL}, for the sparse ERL problem that can work with both on-policy and off-policy data. We first derive a {\em sparse consistency} equation that specifies a relationship between the optimal value function and policy of the sparse ERL along any system trajectory. Crucially, a weak form of the converse is also true, and we quantify the sub-optimality of a policy which satisfies sparse consistency, and show that as we increase the number of actions, this sub-optimality is better than that of the soft ERL optimal policy. We then use this result to derive the sparse PCL algorithms. We empirically compare sparse PCL with its soft counterpart, and show its advantage, especially in problems with a large number of actions.
Published: 2018
Full Text: View/download PDF

24. Optimizing over a Restricted Policy Class in Markov Decision Processes

Author: Banijamali, Ershad, Abbasi-Yadkori, Yasin, Ghavamzadeh, Mohammad, and Vlassis, Nikos
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We address the problem of finding an optimal policy in a Markov decision process under a restricted policy class defined by the convex hull of a set of base policies. This problem is of great interest in applications in which a number of reasonably good (or safe) policies are already known and we are only interested in optimizing in their convex hull. We show that this problem is NP-hard to solve exactly as well as to approximate to arbitrary accuracy. However, under a condition that is akin to the occupancy measures of the base policies having large overlap, we show that there exists an efficient algorithm that finds a policy that is almost as good as the best convex combination of the base policies. The running time of the proposed algorithm is linear in the number of states and polynomial in the number of base policies. In practice, we demonstrate an efficient implementation for large state problems. Compared to traditional policy gradient methods, the proposed approach has the advantage that, apart from the computation of occupancy measures of some base policies, the iterative method need not interact with the environment during the optimization process. This is especially important in complex systems where estimating the value of a policy can be a time consuming process., Comment: 14 pages
Published: 2018
Full Text: View/download PDF

25. Active Learning for Accurate Estimation of Linear Models

Author: Riquelme, Carlos, Ghavamzadeh, Mohammad, Lazaric, Alessandro, Stanford University, Sequential Learning (SEQUEL), Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre de Recherche en Informatique, Signal et Automatique de Lille - UMR 9189 (CRIStAL), Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Université de Lille-Centre National de la Recherche Scientifique (CNRS), Adobe Systems Inc. (Adobe Advanced Technology Labs), Adobe Systems Inc., and ANR-14-CE24-0010,ExTra-Learn,Extraction et transfert de connaissances dans l'apprentissage par renforcement(2014)
Subjects: FOS: Computer and information sciences, Computer Science - Learning, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We explore the sequential decision making problem where the goal is to estimate uniformly well a number of linear models, given a shared budget of random contexts independently sampled from a known distribution. The decision maker must query one of the linear models for each incoming context, and receives an observation corrupted by noise levels that are unknown, and depend on the model instance. We present Trace-UCB, an adaptive allocation algorithm that learns the noise levels while balancing contexts accordingly across the different linear functions, and derive guarantees for simple regret in both expectation and high-probability. Finally, we extend the algorithm and its guarantees to high dimensional settings, where the number of linear models times the dimension of the contextual space is higher than the total budget of samples. Simulations with real data suggest that Trace-UCB is remarkably robust, outperforming a number of baselines even when its assumptions are violated., 37 pages, 8 figures, International Conference on Machine Learning, ICML 2017
Published: 2017

26. Online Learning to Rank in Stochastic Click Models

Author: Zoghi, Masrour, Tunys, Tomas, Ghavamzadeh, Mohammad, Kveton, Branislav, Szepesvari, Csaba, and Wen, Zheng
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Online learning to rank is a core problem in information retrieval and machine learning. Many provably efficient algorithms have been recently proposed for this problem in specific click models. The click model is a model of how the user interacts with a list of documents. Though these results are significant, their impact on practice is limited, because all proposed algorithms are designed for specific click models and lack convergence guarantees in other models. In this work, we propose BatchRank, the first online learning to rank algorithm for a broad class of click models. The class encompasses two most fundamental click models, the cascade and position-based models. We derive a gap-dependent upper bound on the $T$-step regret of BatchRank and evaluate it on a range of web search queries. We observe that BatchRank outperforms ranked bandits and is more robust than CascadeKL-UCB, an existing algorithm for the cascade model., Proceedings of the 34th International Conference on Machine Learning
Published: 2017

27. Bottleneck Conditional Density Estimation

Author: Shu, Rui, Bui, Hung H., and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We introduce a new framework for training deep generative models for high-dimensional conditional density estimation. The Bottleneck Conditional Density Estimator (BCDE) is a variant of the conditional variational autoencoder (CVAE) that employs layer(s) of stochastic variables as the bottleneck between the input $x$ and target $y$, where both are high-dimensional. Crucially, we propose a new hybrid training method that blends the conditional generative model with a joint generative model. Hybrid blending is the key to effective training of the BCDE, which avoids overfitting and provides a novel mechanism for leveraging unlabeled data. We show that our hybrid training procedure enables models to achieve competitive results in the MNIST quadrant prediction task in the fully-supervised setting, and sets new benchmarks in the semi-supervised regime for MNIST, SVHN, and CelebA.
Published: 2016

28. Graphical Model Sketch

Author: Kveton, Branislav, Bui, Hung, Ghavamzadeh, Mohammad, Theocharous, Georgios, Muthukrishnan, S., and Sun, Siqi
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Structured high-cardinality data arises in many domains, and poses a major challenge for both modeling and inference. Graphical models are a popular approach to modeling structured data but they are unsuitable for high-cardinality variables. The count-min (CM) sketch is a popular approach to estimating probabilities in high-cardinality data but it does not scale well beyond a few variables. In this work, we bring together the ideas of graphical models and count sketches; and propose and analyze several approaches to estimating probabilities in structured high-cardinality streams of data. The key idea of our approximations is to use the structure of a graphical model and approximately estimate its factors by "sketches", which hash high-cardinality variables using random projections. Our approximations are computationally efficient and their space complexity is independent of the cardinality of variables. Our error bounds are multiplicative and significantly improve upon those of the CM sketch, a state-of-the-art approach to estimating probabilities in streams. We evaluate our approximations on synthetic and real-world problems, and report an order of magnitude improvements over the CM sketch., Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases
Published: 2016

29. Conservative Contextual Linear Bandits

Author: Kazerouni, Abbas, Ghavamzadeh, Mohammad, Abbasi-Yadkori, Yasin, and Van Roy, Benjamin
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Safety is a desirable property that can immensely increase the applicability of learning algorithms in real-world decision-making problems. It is much easier for a company to deploy an algorithm that is safe, i.e., guaranteed to perform at least as well as a baseline. In this paper, we study the issue of safety in contextual linear bandits that have application in many different fields including personalized ad recommendation in online marketing. We formulate a notion of safety for this class of algorithms. We develop a safe contextual linear bandit algorithm, called conservative linear UCB (CLUCB), that simultaneously minimizes its regret and satisfies the safety constraint, i.e., maintains its performance above a fixed percentage of the performance of a baseline strategy, uniformly over time. We prove an upper-bound on the regret of CLUCB and show that it can be decomposed into two terms: 1) an upper-bound for the regret of the standard linear UCB algorithm that grows with the time horizon and 2) a constant (does not grow with the time horizon) term that accounts for the loss of being conservative in order to satisfy the safety constraint. We empirically show that our algorithm is safe and validate our theoretical analysis.
Published: 2016
Full Text: View/download PDF

30. Bayesian Reinforcement Learning: A Survey

Author: Ghavamzadeh, Mohammad, Mannor, Shie, Pineau, Joelle, and Tamar, Aviv
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Bayesian methods for machine learning have been widely investigated, yielding principled methods for incorporating prior information into inference algorithms. In this survey, we provide an in-depth review of the role of Bayesian methods for the reinforcement learning (RL) paradigm. The major incentives for incorporating Bayesian reasoning in RL are: 1) it provides an elegant approach to action-selection (exploration/exploitation) as a function of the uncertainty in learning; and 2) it provides a machinery to incorporate prior knowledge into the algorithms. We first discuss models and methods for Bayesian inference in the simple single-step Bandit model. We then review the extensive recent literature on Bayesian methods for model-based RL, where prior information can be expressed on the parameters of the Markov model. We also present Bayesian methods for model-free RL, where priors are expressed over the value function or policy class. The objective of the paper is to provide a comprehensive survey on Bayesian RL algorithms and their theoretical and empirical properties.
Published: 2016
Full Text: View/download PDF

31. Safe Policy Improvement by Minimizing Robust Baseline Regret

Author: Petrik, Marek, Chow, Yinlam, and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Statistics - Machine Learning, Machine Learning (stat.ML)
Abstract: An important problem in sequential decision-making under uncertainty is to use limited data to compute a safe policy, i.e., a policy that is guaranteed to perform at least as well as a given baseline strategy. In this paper, we develop and analyze a new model-based approach to compute a safe policy when we have access to an inaccurate dynamics model of the system with known accuracy guarantees. Our proposed robust method uses this (inaccurate) model to directly minimize the (negative) regret w.r.t. the baseline policy. Contrary to the existing approaches, minimizing the regret allows one to improve the baseline policy in states with accurate dynamics and seamlessly fall back to the baseline policy, otherwise. We show that our formulation is NP-hard and propose an approximate algorithm. Our empirical results on several domains show that even this relatively simple approximate algorithm can significantly outperform standard approaches.
Published: 2016
Full Text: View/download PDF

32. Classification-based Approximate Policy Iteration: Experiments and Extended Discussions

Author: Farahmand, Amir-massoud, Precup, Doina, Barreto, André M. S., and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, I.2.6, I.2.8, Machine Learning (stat.ML), Systems and Control (eess.SY), Machine Learning (cs.LG), Computer Science - Learning, Statistics - Machine Learning, Optimization and Control (math.OC), 68T05 (Primary), 93E35, 93E20, 90C40, 49L20 (Secondary), FOS: Electrical engineering, electronic engineering, information engineering, FOS: Mathematics, Computer Science - Systems and Control, Mathematics - Optimization and Control
Abstract: Tackling large approximate dynamic programming or reinforcement learning problems requires methods that can exploit regularities, or intrinsic structure, of the problem in hand. Most current methods are geared towards exploiting the regularities of either the value function or the policy. We introduce a general classification-based approximate policy iteration (CAPI) framework, which encompasses a large class of algorithms that can exploit regularities of both the value function and the policy space, depending on what is advantageous. This framework has two main components: a generic value function estimator and a classifier that learns a policy based on the estimated value function. We establish theoretical guarantees for the sample complexity of CAPI-style algorithms, which allow the policy evaluation step to be performed by a wide variety of algorithms (including temporal-difference-style methods), and can handle nonparametric representations of policies. Our bounds on the estimation error of the performance loss are tighter than existing results. We also illustrate this approach empirically on several problems, including a large HIV control task.
Published: 2014
Full Text: View/download PDF

33. Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs

Author: A., Prashanth L. and Ghavamzadeh, Mohammad
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in rewards in addition to maximizing a standard criterion. Variance related risk measures are among the most common risk-sensitive criteria in finance and operations research. However, optimizing many such criteria is known to be a hard problem. In this paper, we consider both discounted and average reward Markov decision processes. For each formulation, we first define a measure of variability for a policy, which in turn gives us a set of risk-sensitive criteria to optimize. For each of these criteria, we derive a formula for computing its gradient. We then devise actor-critic algorithms that operate on three timescales - a TD critic on the fastest timescale, a policy gradient (actor) on the intermediate timescale, and a dual ascent for Lagrange multipliers on the slowest timescale. In the discounted setting, we point out the difficulty in estimating the gradient of the variance of the return and incorporate simultaneous perturbation approaches to alleviate this. The average setting, on the other hand, allows for an actor update using compatible features to estimate the gradient of the variance. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in a traffic signal control application.
Published: 2014
Full Text: View/download PDF

34. A Generalized Kernel Approach to Structured Output Learning

Author: Kadri, Hachem, Ghavamzadeh, Mohammad, Preux, Philippe, éQuipe AppRentissage et MultimediA [Marseille] (QARMA), Laboratoire d'informatique Fondamentale de Marseille (LIF), Centre National de la Recherche Scientifique (CNRS)-École Centrale de Marseille (ECM)-Aix Marseille Université (AMU)-Centre National de la Recherche Scientifique (CNRS)-École Centrale de Marseille (ECM)-Aix Marseille Université (AMU), Sequential Learning (SEQUEL), Laboratoire d'Informatique Fondamentale de Lille (LIFL), Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Laboratoire d'Automatique, Génie Informatique et Signal (LAGIS), Université de Lille, Sciences et Technologies-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria), Aix Marseille Université (AMU)-École Centrale de Marseille (ECM)-Centre National de la Recherche Scientifique (CNRS)-Aix Marseille Université (AMU)-École Centrale de Marseille (ECM)-Centre National de la Recherche Scientifique (CNRS), Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Automatique, Génie Informatique et Signal (LAGIS), and Université de Lille, Sciences et Technologies-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)
Subjects: FOS: Computer and information sciences, Computer Science - Learning, [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], Statistics - Machine Learning, operator-valued kernel, kernel dependency estimation, Machine Learning (stat.ML), structured outputs, function-valued RKHS, Machine Learning (cs.LG)
Abstract: We study the problem of structured output learning from a regression perspective. We first provide a general formulation of the kernel dependency estimation (KDE) problem using operator-valued kernels. We show that some of the existing formulations of this problem are special cases of our framework. We then propose a covariance-based operator-valued kernel that allows us to take into account the structure of the kernel feature space. This kernel operates on the output space and encodes the interactions between the outputs without any reference to the input space. To address this issue, we introduce a variant of our KDE method based on the conditional covariance operator that in addition to the correlation between the outputs takes into account the effects of the input variables. Finally, we evaluate the performance of our KDE approach using both covariance and conditional covariance kernels on two structured output problems, and compare it to the state-of-the-art kernel-based structured output regression methods., Comment: in International Conference on Machine Learning (ICML), Jun 2013, Atlanta, United States. 2013
Published: 2013

35. A Dantzig Selector Approach to Temporal Difference Learning

Author: Geist, Matthieu, Scherrer, Bruno, Lazaric, Alessandro, Ghavamzadeh, Mohammad, IMS : Information, Multimodalité & Signal, SUPELEC-Campus Metz, Ecole Supérieure d'Electricité - SUPELEC (FRANCE)-Ecole Supérieure d'Electricité - SUPELEC (FRANCE), Autonomous intelligent machine (MAIA), Inria Nancy - Grand Est, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Department of Complex Systems, Artificial Intelligence & Robotics (LORIA - AIS), Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Laboratoire Lorrain de Recherche en Informatique et ses Applications (LORIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS)-Université de Lorraine (UL)-Centre National de la Recherche Scientifique (CNRS), Sequential Learning (SEQUEL), Laboratoire d'Informatique Fondamentale de Lille (LIFL), Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Inria Lille - Nord Europe, Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire d'Automatique, Génie Informatique et Signal (LAGIS), Université de Lille, Sciences et Technologies-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Centre National de la Recherche Scientifique (CNRS), John Langford and Joelle Pineau, Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Université de Lille, Sciences et Technologies-Institut National de Recherche en Informatique et en Automatique (Inria)-Université de Lille, Sciences Humaines et Sociales-Centre National de la Recherche Scientifique (CNRS)-Laboratoire d'Automatique, Génie Informatique et Signal (LAGIS), Université de Lille, Sciences et Technologies-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Centrale Lille-Centre National de la Recherche Scientifique (CNRS)-Inria Lille - Nord Europe, and Institut National de Recherche en Informatique et en Automatique (Inria)
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Statistics::Machine Learning, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: LSTD is a popular algorithm for value function approximation. Whenever the number of features is larger than the number of samples, it must be paired with some form of regularization. In particular, L1-regularization methods tend to perform feature selection by promoting sparsity, and thus, are well-suited for high-dimensional problems. However, since LSTD is not a simple regression algorithm, but it solves a fixed--point problem, its integration with L1-regularization is not straightforward and might come with some drawbacks (e.g., the P-matrix assumption for LASSO-TD). In this paper, we introduce a novel algorithm obtained by integrating LSTD with the Dantzig Selector. We investigate the performance of the proposed algorithm and its relationship with the existing regularized approaches, and show how it addresses some of their drawbacks., Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)
Published: 2012

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

35 results on '"Ghavamzadeh, Mohammad"'

1. A Convex Relaxation Approach to Bayesian Regret Minimization in Offline Bandits

2. Robust Reinforcement Learning using Offline Data

3. Meta-Learning for Simple Regret Minimization

4. Deep Hierarchy in Bandits

5. Non-stationary Bandits and Meta-Learning with a Small Set of Optimal Arms

6. Operator Splitting Value Iteration

7. Soft-Robust Algorithms for Batch Reinforcement Learning

8. Control-Aware Representations for Model-based Reinforcement Learning

9. Stochastic Bandits with Linear Constraints

10. Finite-Sample Analysis of Proximal Gradient TD Algorithms

11. Neural Lyapunov Redesign

12. Mirror Descent Policy Optimization

13. Active Model Estimation in Markov Decision Processes

14. Predictive Coding for Locally-Linear Control

15. Online Planning with Lookahead Policies

16. Lyapunov-based Safe Policy Optimization for Continuous Control

17. Randomized Exploration in Generalized Linear Bandits

18. Benchmarking Batch Deep Reinforcement Learning Algorithms

19. Garbage In, Reward Out: Bootstrapping Exploration in Multi-Armed Bandits

20. Risk-Sensitive Generative Adversarial Imitation Learning

21. A Lyapunov-based Approach to Safe Reinforcement Learning

22. A Block Coordinate Ascent Algorithm for Mean-Variance Optimization

23. Path Consistency Learning in Tsallis Entropy Regularized MDPs

24. Optimizing over a Restricted Policy Class in Markov Decision Processes

25. Active Learning for Accurate Estimation of Linear Models

26. Online Learning to Rank in Stochastic Click Models

27. Bottleneck Conditional Density Estimation

28. Graphical Model Sketch

29. Conservative Contextual Linear Bandits

30. Bayesian Reinforcement Learning: A Survey

31. Safe Policy Improvement by Minimizing Robust Baseline Regret

32. Classification-based Approximate Policy Iteration: Experiments and Extended Discussions

33. Variance-Constrained Actor-Critic Algorithms for Discounted and Average Reward MDPs

34. A Generalized Kernel Approach to Structured Output Learning

35. A Dantzig Selector Approach to Temporal Difference Learning

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

35 results on '"Ghavamzadeh, Mohammad"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources