565 results on '"Bartlett, Peter L."'
Search Results
52. Large-Scale Markov Decision Problems via the Linear Programming Dual
- Author
-
Abbasi-Yadkori, Yasin, Bartlett, Peter L., Chen, Xi, and Malek, Alan
- Subjects
Mathematics - Optimization and Control ,Computer Science - Machine Learning - Abstract
We consider the problem of controlling a fully specified Markov decision process (MDP), also known as the planning problem, when the state space is very large and calculating the optimal policy is intractable. Instead, we pursue the more modest goal of optimizing over some small family of policies. Specifically, we show that the family of policies associated with a low-dimensional approximation of occupancy measures yields a tractable optimization. Moreover, we propose an efficient algorithm, scaling with the size of the subspace but not the state space, that is able to find a policy with low excess loss relative to the best policy in this class. To the best of our knowledge, such results did not exist in the literature previously. We bound excess loss in the average cost and discounted cost cases, which are treated separately. Preliminary experiments show the effectiveness of the proposed algorithms in a queueing application., Comment: 53 pages. arXiv admin note: text overlap with arXiv:1402.6763
- Published
- 2019
53. Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems
- Author
-
Malik, Dhruv, Pananjady, Ashwin, Bhatia, Kush, Khamaru, Koulik, Bartlett, Peter L., and Wainwright, Martin J.
- Subjects
Computer Science - Machine Learning ,Mathematics - Optimization and Control ,Statistics - Machine Learning - Abstract
We study derivative-free methods for policy optimization over the class of linear policies. We focus on characterizing the convergence rate of these methods when applied to linear-quadratic systems, and study various settings of driving noise and reward feedback. We show that these methods provably converge to within any pre-specified tolerance of the optimal policy with a number of zero-order evaluations that is an explicit polynomial of the error tolerance, dimension, and curvature properties of the problem. Our analysis reveals some interesting differences between the settings of additive driving noise and random initialization, as well as the settings of one-point and two-point reward feedback. Our theory is corroborated by extensive simulations of derivative-free methods on these systems. Along the way, we derive convergence rates for stochastic zero-order optimization algorithms when applied to a certain class of non-convex problems., Comment: Version v3 consistent with paper appearing in JMLR
- Published
- 2018
54. Gen-Oja: A Two-time-scale approach for Streaming CCA
- Author
-
Bhatia, Kush, Pacchiano, Aldo, Flammarion, Nicolas, Bartlett, Peter L., and Jordan, Michael I.
- Subjects
Computer Science - Machine Learning ,Computer Science - Data Structures and Algorithms ,Computer Science - Neural and Evolutionary Computing ,Statistics - Machine Learning - Abstract
In this paper, we study the problems of principal Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting. We propose a simple and efficient algorithm, Gen-Oja, for these problems. We prove the global convergence of our algorithm, borrowing ideas from the theory of fast-mixing Markov chains and two-time-scale stochastic approximation, showing that it achieves the optimal rate of convergence. In the process, we develop tools for understanding stochastic processes with Markovian noise which might be of independent interest., Comment: Accepted at NeurIPS 2018
- Published
- 2018
55. A simple parameter-free and adaptive approach to optimization under a minimal local smoothness assumption
- Author
-
Bartlett, Peter L., Gabillon, Victor, and Valko, Michal
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We study the problem of optimizing a function under a \emph{budgeted number of evaluations}. We only assume that the function is \emph{locally} smooth around one of its global optima. The difficulty of optimization is measured in terms of 1) the amount of \emph{noise} $b$ of the function evaluation and 2) the local smoothness, $d$, of the function. A smaller $d$ results in smaller optimization error. We come with a new, simple, and parameter-free approach. First, for all values of $b$ and $d$, this approach recovers at least the state-of-the-art regret guarantees. Second, our approach additionally obtains these results while being \textit{agnostic} to the values of both $b$ and $d$. This leads to the first algorithm that naturally adapts to an \textit{unknown} range of noise $b$ and leads to significant improvements in a moderate and low-noise regime. Third, our approach also obtains a remarkable improvement over the state-of-the-art SOO algorithm when the noise is very low which includes the case of optimization under deterministic feedback ($b=0$). There, under our minimal local smoothness assumption, this improvement is of exponential magnitude and holds for a class of functions that covers the vast majority of functions that practitioners optimize ($d=0$). We show that our algorithmic improvement is borne out in experiments as we empirically show faster convergence on common benchmarks.
- Published
- 2018
56. Best of many worlds: Robust model selection for online supervised learning
- Author
-
Muthukumar, Vidya, Ray, Mitas, Sahai, Anant, and Bartlett, Peter L.
- Subjects
Computer Science - Learning ,Statistics - Machine Learning - Abstract
We introduce algorithms for online, full-information prediction that are competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings. We show that by incorporating a probabilistic framework of structural risk minimization into existing adaptive algorithms, we can robustly learn not only the presence of stochastic structure when it exists (leading to constant as opposed to $\mathcal{O}(\sqrt{T})$ regret), but also the correct model order. We thus obtain regret bounds that are competitive with the regret of an optimal algorithm that possesses strong side information about both the complexity of the optimal contextual tree expert and whether the process generating the data is stochastic or adversarial. These are the first constructive guarantees on simultaneous adaptivity to the model and the presence of stochasticity., Comment: 33 pages, 5 figures
- Published
- 2018
57. Sharp convergence rates for Langevin dynamics in the nonconvex setting
- Author
-
Cheng, Xiang, Chatterji, Niladri S., Abbasi-Yadkori, Yasin, Bartlett, Peter L., and Jordan, Michael I.
- Subjects
Statistics - Machine Learning ,Computer Science - Machine Learning ,Mathematics - Probability ,Statistics - Computation - Abstract
We study the problem of sampling from a distribution $p^*(x) \propto \exp\left(-U(x)\right)$, where the function $U$ is $L$-smooth everywhere and $m$-strongly convex outside a ball of radius $R$, but potentially nonconvex inside this ball. We study both overdamped and underdamped Langevin MCMC and establish upper bounds on the number of steps required to obtain a sample from a distribution that is within $\epsilon$ of $p^*$ in $1$-Wasserstein distance. For the first-order method (overdamped Langevin MCMC), the iteration complexity is $\tilde{\mathcal{O}}\left(e^{cLR^2}d/\epsilon^2\right)$, where $d$ is the dimension of the underlying space. For the second-order method (underdamped Langevin MCMC), the iteration complexity is $\tilde{\mathcal{O}}\left(e^{cLR^2}\sqrt{d}/\epsilon\right)$ for an explicit positive constant $c$. Surprisingly, the iteration complexity for both these algorithms is only polynomial in the dimension $d$ and the target accuracy $\epsilon$. It is exponential, however, in the problem parameter $LR^2$, which is a measure of non-log-concavity of the target distribution., Comment: 78 pages, 2 figures
- Published
- 2018
58. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization
- Author
-
Bartlett, Peter L., Evans, Steven N., and Long, Philip M.
- Subjects
Computer Science - Learning ,Computer Science - Artificial Intelligence ,Computer Science - Neural and Evolutionary Computing ,Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of near-identity nonlinear maps. We show that, regarding Fr\'echet derivatives with respect to the $h_1,...,h_m$, any critical point of a quadratic criterion in this near-identity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixed-size residual network with sigmoid activation functions, we show that there are near-identity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to near-identity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the near-identity region.
- Published
- 2018
59. Online learning with kernel losses
- Author
-
Pacchiano, Aldo, Chatterji, Niladri S., and Bartlett, Peter L.
- Subjects
Statistics - Machine Learning ,Computer Science - Learning - Abstract
We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. We study a version of the exponential weights algorithm and bound its regret in this setting. Under conditions on the eigendecay of the kernel we provide a sharp characterization of the regret for this algorithm. When we have polynomial eigendecay $\mu_j \le \mathcal{O}(j^{-\beta})$, we find that the regret is bounded by $\mathcal{R}_n \le \mathcal{O}(n^{\beta/(2(\beta-1))})$; while under the assumption of exponential eigendecay $\mu_j \le \mathcal{O}(e^{-\beta j })$, we get an even tighter bound on the regret $\mathcal{R}_n \le \mathcal{O}(n^{1/2}\log(n)^{1/2})$. We also study the full information setting when the underlying losses are kernel functions and present an adapted exponential weights algorithm and a conditional gradient descent algorithm., Comment: 40 pages, 4 figures
- Published
- 2018
60. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks
- Author
-
Bartlett, Peter L., Helmbold, David P., and Long, Philip M.
- Subjects
Computer Science - Learning ,Computer Science - Neural and Evolutionary Computing ,Mathematics - Optimization and Control ,Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
We analyze algorithms for approximating a function $f(x) = \Phi x$ mapping $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $\Theta_1,...,\Theta_L$ and defined by $h(x) = \Theta_L \Theta_{L-1} ... \Theta_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $\Phi$, in the case where the initial hypothesis $\Theta_1 = ... = \Theta_L = I$ has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for $\Phi$ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If $\Phi$ is symmetric positive definite, we show that an algorithm that initializes $\Theta_i = I$ learns an $\epsilon$-approximation of $f$ using a number of updates polynomial in $L$, the condition number of $\Phi$, and $\log(d/\epsilon)$. In contrast, we show that if the least squares matrix $\Phi$ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that $\Phi$ satisfies $u^{\top} \Phi u > 0$ for all $u$, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant $u^{\top} \Theta_L \Theta_{L-1} ... \Theta_1 u > 0$ for all $u$, and another that "balances" $\Theta_1, ..., \Theta_L$ so that they have the same singular values.
- Published
- 2018
61. On the Theory of Variance Reduction for Stochastic Gradient Monte Carlo
- Author
-
Chatterji, Niladri S., Flammarion, Nicolas, Ma, Yi-An, Bartlett, Peter L., and Jordan, Michael I.
- Subjects
Statistics - Machine Learning ,Computer Science - Learning - Abstract
We provide convergence guarantees in Wasserstein distance for a variety of variance-reduction methods: SAGA Langevin diffusion, SVRG Langevin diffusion and control-variate underdamped Langevin diffusion. We analyze these methods under a uniform set of assumptions on the log-posterior distribution, assuming it to be smooth, strongly convex and Hessian Lipschitz. This is achieved by a new proof technique combining ideas from finite-sum optimization and the analysis of sampling methods. Our sharp theoretical bounds allow us to identify regimes of interest where each method performs better than the others. Our theory is verified with experiments on real-world and synthetic datasets., Comment: 37 pages; 4 figures
- Published
- 2018
62. Alternating minimization for dictionary learning: Local Convergence Guarantees
- Author
-
Chatterji, Niladri S. and Bartlett, Peter L.
- Subjects
Statistics - Machine Learning ,Computer Science - Machine Learning - Abstract
We present theoretical guarantees for an alternating minimization algorithm for the dictionary learning/sparse coding problem. The dictionary learning problem is to factorize vector samples $y^{1},y^{2},\ldots, y^{n}$ into an appropriate basis (dictionary) $A^*$ and sparse vectors $x^{1*},\ldots,x^{n*}$. Our algorithm is a simple alternating minimization procedure that switches between $\ell_1$ minimization and gradient descent in alternate steps. Dictionary learning and specifically alternating minimization algorithms for dictionary learning are well studied both theoretically and empirically. However, in contrast to previous theoretical analyses for this problem, we replace a condition on the operator norm (that is, the largest magnitude singular value) of the true underlying dictionary $A^*$ with a condition on the matrix infinity norm (that is, the largest magnitude term). Our guarantees are under a reasonable generative model that allows for dictionaries with growing operator norms, and can handle an arbitrary level of overcompleteness, while having sparsity that is information theoretically optimal. We also establish upper bounds on the sample complexity of our algorithm., Comment: Erratum: An earlier version of this paper appeared in NIPS 2017 which had an erroneous claim about convergence guarantees with random initialization. The main result -- Theorem 3 -- has been corrected by adding an assumption about the initialization (Assumption B1)
- Published
- 2017
63. Acceleration and Averaging in Stochastic Mirror Descent Dynamics
- Author
-
Krichene, Walid and Bartlett, Peter L.
- Subjects
Mathematics - Optimization and Control ,Statistics - Machine Learning - Abstract
We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values, (a.s. and in expectation) both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging weights) and the covariation of the noise process, and show, in particular, how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.
- Published
- 2017
64. Underdamped Langevin MCMC: A non-asymptotic analysis
- Author
-
Cheng, Xiang, Chatterji, Niladri S., Bartlett, Peter L., and Jordan, Michael I.
- Subjects
Statistics - Machine Learning ,Computer Science - Learning ,Statistics - Computation - Abstract
We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves $\varepsilon$ error (in 2-Wasserstein distance) in $\mathcal{O}(\sqrt{d}/\varepsilon)$ steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is $\mathcal{O}(d/\varepsilon^2)$ steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom., Comment: 23 pages; Correction to Corollary 7
- Published
- 2017
65. Recovery Guarantees for One-hidden-layer Neural Networks
- Author
-
Zhong, Kai, Song, Zhao, Jain, Prateek, Bartlett, Peter L., and Dhillon, Inderjit S.
- Subjects
Computer Science - Learning ,Computer Science - Data Structures and Algorithms ,Statistics - Machine Learning - Abstract
In this paper, we consider regression problems with one-hidden-layer neural networks (1NNs). We distill some properties of activation functions that lead to $\mathit{local~strong~convexity}$ in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective. Most popular nonlinear activation functions satisfy the distilled properties, including rectified linear units (ReLUs), leaky ReLUs, squared ReLUs and sigmoids. For activation functions that are also smooth, we show $\mathit{local~linear~convergence}$ guarantees of gradient descent under a resampling rule. For homogeneous activations, we show tensor methods are able to initialize the parameters to fall into the local strong convexity region. As a result, tensor initialization followed by gradient descent is guaranteed to recover the ground truth with sample complexity $ d \cdot \log(1/\epsilon) \cdot \mathrm{poly}(k,\lambda )$ and computational complexity $n\cdot d \cdot \mathrm{poly}(k,\lambda) $ for smooth homogeneous activations with high probability, where $d$ is the dimension of the input, $k$ ($k\leq d$) is the number of hidden nodes, $\lambda$ is a conditioning property of the ground-truth parameter matrix between the input layer and the hidden layer, $\epsilon$ is the targeted precision and $n$ is the number of samples. To the best of our knowledge, this is the first work that provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\mathit{logarithmic}$ in the precision., Comment: ICML 2017
- Published
- 2017
66. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks
- Author
-
Bartlett, Peter L., Harvey, Nick, Liaw, Chris, and Mehrabian, Abbas
- Subjects
Computer Science - Machine Learning - Abstract
We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting $W$ be the number of weights and $L$ be the number of layers, we prove that the VC-dimension is $O(W L \log(W))$, and provide examples with VC-dimension $\Omega( W L \log(W/L) )$. This improves both the previously known upper bounds and lower bounds. In terms of the number $U$ of non-linear units, we prove a tight bound $\Theta(W U)$ on the VC-dimension. All of these bounds generalize to arbitrary piecewise linear activation functions, and also hold for the pseudodimensions of these function classes. Combined with previous results, this gives an intriguing range of dependencies of the VC-dimension on depth for networks with different non-linearities: there is no dependence for piecewise-constant, linear dependence for piecewise-linear, and no more than quadratic dependence for general piecewise-polynomial., Comment: Extended abstract appeared in COLT 2017; the upper bound was presented at the 2016 ACM Conference on Data Science. This version includes all the proofs and a refinement of the upper bound, Theorem 6. 16 pages, 2 figures
- Published
- 2017
67. RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning
- Author
-
Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and Abbeel, Pieter
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Learning ,Computer Science - Neural and Evolutionary Computing ,Statistics - Machine Learning - Abstract
Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems., Comment: 14 pages. Under review as a conference paper at ICLR 2017
- Published
- 2016
68. Corrigendum to “Prediction, learning, uniform convergence, and scale-sensitive dimensions” [J. Comput. Syst. Sci. 56 (2) (1998) 174–190]
- Author
-
Bartlett, Peter L., primary and Long, Philip M., additional
- Published
- 2024
- Full Text
- View/download PDF
69. Hit-and-Run for Sampling and Planning in Non-Convex Spaces
- Author
-
Abbasi-Yadkori, Yasin, Bartlett, Peter L., Gabillon, Victor, and Malek, Alan
- Subjects
Statistics - Computation ,Computer Science - Artificial Intelligence ,Mathematics - Combinatorics ,Mathematics - Probability - Abstract
We propose the Hit-and-Run algorithm for planning and sampling problems in non-convex spaces. For sampling, we show the first analysis of the Hit-and-Run algorithm in non-convex spaces and show that it mixes fast as long as certain smoothness conditions are satisfied. In particular, our analysis reveals an intriguing connection between fast mixing and the existence of smooth measure-preserving mappings from a convex space to the non-convex space. For planning, we show advantages of Hit-and-Run compared to state-of-the-art planning methods such as Rapidly-Exploring Random Trees.
- Published
- 2016
70. FLAG n' FLARE: Fast Linearly-Coupled Adaptive Gradient Methods
- Author
-
Cheng, Xiang, Roosta-Khorasani, Farbod, Palombo, Stefan, Bartlett, Peter L., and Mahoney, Michael W.
- Subjects
Mathematics - Optimization and Control ,Computer Science - Learning ,Statistics - Machine Learning - Abstract
We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex optimization. Additionally, they can adaptively and non-uniformly re-scale the gradient direction to adapt to the limited curvature available and conform to the geometry of the domain. We show theoretically and empirically that, through the compounding effects of acceleration and adaptivity, FLAG and FLARE can be highly effective for many data fitting and machine learning applications.
- Published
- 2016
71. Bayesian Robustness: A Nonasymptotic Viewpoint.
- Author
-
Bhatia, Kush, Ma, Yi-An, Dragan, Anca D., Bartlett, Peter L., and Jordan, Michael I.
- Subjects
ROBUST statistics ,ALGORITHMS - Abstract
We study the problem of robustly estimating the posterior distribution for the setting where observed data can be contaminated with potentially adversarial outliers. We propose Rob-ULA, a robust variant of the Unadjusted Langevin Algorithm (ULA), and provide a finite-sample analysis of its sampling distribution. In particular, we show that after T = O ˜ (d / ε acc) iterations, we can sample from p
T such that dist (p T , p *) ≤ ε acc + O ˜ (ϵ) , where ϵ is the fraction of corruptions and dist represents the squared 2-Wasserstein distance metric. Our results for the class of posteriors p * which satisfy log-concavity and smoothness assumptions. We corroborate our theoretical analysis with experiments on both synthetic and real-world datasets for mean estimation, regression and binary classification. for this article are available online. [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
72. Linear Programming for Large-Scale Markov Decision Problems
- Author
-
Abbasi-Yadkori, Yasin, Bartlett, Peter L., and Malek, Alan
- Subjects
Mathematics - Optimization and Control ,Computer Science - Artificial Intelligence ,Computer Science - Numerical Analysis - Abstract
We consider the problem of controlling a Markov decision process (MDP) with a large state space, so as to minimize average cost. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. We use the dual linear programming formulation of the MDP average cost problem, in which the variable is a stationary distribution over state-action pairs, and we consider a neighborhood of a low-dimensional subset of the set of stationary distributions (defined in terms of state-action features) as the comparison class. We propose two techniques, one based on stochastic convex optimization, and one based on constraint sampling. In both cases, we give bounds that show that the performance of our algorithms approaches the best achievable by any policy in the comparison class. Most importantly, these results depend on the size of the comparison class, but not on the size of the state space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application., Comment: 27 pages, 3 figures
- Published
- 2014
73. Bounding Embeddings of VC Classes into Maximum Classes
- Author
-
Rubinstein, J. Hyam, Rubinstein, Benjamin I. P., and Bartlett, Peter L.
- Subjects
Computer Science - Learning ,Mathematics - Combinatorics ,Statistics - Machine Learning - Abstract
One of the earliest conjectures in computational learning theory-the Sample Compression conjecture-asserts that concept classes (equivalently set systems) admit compression schemes of size linear in their VC dimension. To-date this statement is known to be true for maximum classes---those that possess maximum cardinality for their VC dimension. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings would extend the known compression schemes to all VC classes. We show that maximum classes can be characterised by a local-connectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterisation of maximum VC classes is applied to prove a negative embedding result which demonstrates VC-d classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we show that every VC-d class C embeds in a VC-(d+D) maximum class where D is the deficiency of C, i.e., the difference between the cardinalities of a maximum VC-d class and of C. For VC-2 classes in binary n-cubes for 4 <= n <= 6, we give best possible results on embedding into maximum classes. For some special classes of Boolean functions, relationships with maximum classes are investigated. Finally we give a general recursive procedure for embedding VC-d classes into VC-(d+k) maximum classes for smallest k., Comment: 22 pages, 2 figures
- Published
- 2014
74. Accelerated Mirror Descent in Continuous and Discrete Time
- Author
-
Krichene, Walid, Bayen, Alexandre M, and Bartlett, Peter L
- Subjects
Psychology ,Cognitive Sciences ,Machine learning - Abstract
We study accelerated mirror descent dynamics in continuous and discrete time. Combining the original continuous-time motivation of mirror descent with a recent ODE interpretation of Nesterov's accelerated method, we propose a family of continuous-time descent dynamics for convex functions with Lipschitz gradients, such that the solution trajectories converge to the optimum at a O(1/t2) rate. We then show that a large family of first-order accelerated methods can be obtained as a discretization of the ODE, and these methods converge at a O(1/k2) rate. This connection between accelerated mirror descent and the ODE provides an intuitive approach to the design and analysis of accelerated first-order algorithms.
- Published
- 2015
75. Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions
- Author
-
Abbasi-Yadkori, Yasin, Bartlett, Peter L., and Szepesvari, Csaba
- Subjects
Computer Science - Learning ,Statistics - Machine Learning - Abstract
We study the problem of learning Markov decision processes with finite state and action spaces when the transition probability distributions and loss functions are chosen adversarially and are allowed to change with time. We introduce an algorithm whose regret with respect to any policy in a comparison class grows as the square root of the number of rounds of the game, provided the transition probabilities satisfy a uniform mixing condition. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy. Designing an efficient algorithm with small regret for the general case remains an open problem.
- Published
- 2013
76. Oracle inequalities for computationally adaptive model selection
- Author
-
Agarwal, Alekh, Bartlett, Peter L., and Duchi, John C.
- Subjects
Statistics - Machine Learning ,Computer Science - Learning - Abstract
We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our omputational budget to the optimal function class.
- Published
- 2012
77. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs
- Author
-
Bartlett, Peter L. and Tewari, Ambuj
- Subjects
Computer Science - Learning - Abstract
We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~O(HSpAT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds., Comment: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)
- Published
- 2012
78. Infinite-Horizon Policy-Gradient Estimation
- Author
-
Baxter, Jonathan and Bartlett, Peter L.
- Subjects
Computer Science - Artificial Intelligence - Abstract
Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a {\em biased} estimate of the gradient of the {\em average reward} in Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. A similar algorithm was proposed by Kimura, Yamamura, and Kobayashi (1995). The algorithm's chief advantages are that it requires storage of only twice the number of policy parameters, uses one free parameter $\beta\in [0,1)$ (which has a natural interpretation in terms of bias-variance trade-off), and requires no knowledge of the underlying state. We prove convergence of GPOMDP, and show how the correct choice of the parameter $\beta$ is related to the {\em mixing time} of the controlled POMDP. We briefly describe extensions of GPOMDP to controlled Markov chains, continuous state, observation and control spaces, multiple-agents, higher-order derivatives, and a version for training stochastic policies with internal states. In a companion paper (Baxter, Bartlett, & Weaver, 2001) we show how the gradient estimates generated by GPOMDP can be used in both a traditional stochastic gradient algorithm and a conjugate-gradient procedure to find local optima of the average reward
- Published
- 2011
- Full Text
- View/download PDF
79. Randomized Smoothing for Stochastic Optimization
- Author
-
Duchi, John C., Bartlett, Peter L., and Wainwright, Martin J.
- Subjects
Mathematics - Optimization and Control ,Statistics - Machine Learning - Abstract
We analyze convergence rates of stochastic optimization procedures for non-smooth convex optimization problems. By combining randomized smoothing techniques with accelerated gradient methods, we obtain convergence rates of stochastic optimization procedures, both in expectation and with high probability, that have optimal dependence on the variance of the gradient estimates. To the best of our knowledge, these are the first variance-based rates for non-smooth optimization. We give several applications of our results to statistical estimation problems, and provide experimental results that demonstrate the effectiveness of the proposed algorithms. We also describe how a combination of our algorithm with recent work on decentralized optimization yields a distributed stochastic optimization algorithm that is order-optimal., Comment: 39 pages, 3 figures
- Published
- 2011
80. Blackwell Approachability and Low-Regret Learning are Equivalent
- Author
-
Abernethy, Jacob, Bartlett, Peter L., and Hazan, Elad
- Subjects
Computer Science - Learning ,Computer Science - Computer Science and Game Theory - Abstract
We consider the celebrated Blackwell Approachability Theorem for two-player games with vector payoffs. We show that Blackwell's result is equivalent, via efficient reductions, to the existence of "no-regret" algorithms for Online Linear Optimization. Indeed, we show that any algorithm for one such problem can be efficiently converted into an algorithm for the other. We provide a useful application of this reduction: the first efficient algorithm for calibrated forecasting.
- Published
- 2010
81. Information-theoretic lower bounds on the oracle complexity of stochastic convex optimization
- Author
-
Agarwal, Alekh, Bartlett, Peter L., Ravikumar, Pradeep, and Wainwright, Martin J.
- Subjects
Statistics - Machine Learning ,Computer Science - Systems and Control ,Mathematics - Optimization and Control - Abstract
Relative to the large literature on upper bounds on complexity of convex optimization, lesser attention has been paid to the fundamental hardness of these problems. Given the extensive use of convex optimization in machine learning and statistics, gaining an understanding of these complexity-theoretic issues is important. In this paper, we study the complexity of stochastic convex optimization in an oracle model of computation. We improve upon known results and obtain tight minimax complexity estimates for various function classes.
- Published
- 2010
82. A Unifying View of Multiple Kernel Learning
- Author
-
Kloft, Marius, Rückert, Ulrich, and Bartlett, Peter L.
- Subjects
Statistics - Machine Learning ,Computer Science - Learning - Abstract
Recent research on multiple kernel learning has lead to a number of approaches for combining kernels in regularized risk minimization. The proposed approaches include different formulations of objectives and varying regularization strategies. In this paper we present a unifying general optimization criterion for multiple kernel learning and show how existing formulations are subsumed as special cases. We also derive the criterion's dual representation, which is suitable for general smooth optimization algorithms. Finally, we evaluate multiple kernel learning in this framework analytically using a Rademacher complexity bound on the generalization error and empirically in a set of experiments.
- Published
- 2010
83. A Learning-Based Approach to Reactive Security
- Author
-
Barth, Adam, Rubinstein, Benjamin I. P., Sundararajan, Mukund, Mitchell, John C., Song, Dawn, and Bartlett, Peter L.
- Subjects
Computer Science - Cryptography and Security ,Computer Science - Computer Science and Game Theory ,Computer Science - Learning - Abstract
Despite the conventional wisdom that proactive security is superior to reactive security, we show that reactive security can be competitive with proactive security as long as the reactive defender learns from past attacks instead of myopically overreacting to the last attack. Our game-theoretic model follows common practice in the security literature by making worst-case assumptions about the attacker: we grant the attacker complete knowledge of the defender's strategy and do not require the attacker to act rationally. In this model, we bound the competitive ratio between a reactive defense algorithm (which is inspired by online learning theory) and the best fixed proactive defense. Additionally, we show that, unlike proactive defenses, this reactive strategy is robust to a lack of information about the attacker's incentives and knowledge., Comment: 22 pages, 4 figures; full version of paper to be published in Financial Cryptography and Data Security 2010 (FC'10)
- Published
- 2009
- Full Text
- View/download PDF
84. Learning in a Large Function Space: Privacy-Preserving Mechanisms for SVM Learning
- Author
-
Rubinstein, Benjamin I. P., Bartlett, Peter L., Huang, Ling, and Taft, Nina
- Subjects
Computer Science - Learning ,Computer Science - Cryptography and Security ,Computer Science - Databases - Abstract
Several recent studies in privacy-preserving learning have considered the trade-off between utility or risk and the level of differential privacy guaranteed by mechanisms for statistical query processing. In this paper we study this trade-off in private Support Vector Machine (SVM) learning. We present two efficient mechanisms, one for the case of finite-dimensional feature mappings and one for potentially infinite-dimensional feature mappings with translation-invariant kernels. For the case of translation-invariant kernels, the proposed mechanism minimizes regularized empirical risk in a random Reproducing Kernel Hilbert Space whose kernel uniformly approximates the desired kernel with high probability. This technique, borrowed from large-scale learning, allows the mechanism to respond with a finite encoding of the classifier, even when the function class is of infinite VC dimension. Differential privacy is established using a proof technique from algorithmic stability. Utility--the mechanism's response function is pointwise epsilon-close to non-private SVM with probability 1-delta--is proven by appealing to the smoothness of regularized empirical risk minimization with respect to small perturbations to the feature mapping. We conclude with a lower bound on the optimal differential privacy of the SVM. This negative result states that for any delta, no mechanism can be simultaneously (epsilon,delta)-useful and beta-differentially private for small epsilon and small beta., Comment: 21 pages, 1 figure
- Published
- 2009
85. A Stochastic View of Optimal Regret through Minimax Duality
- Author
-
Abernethy, Jacob, Agarwal, Alekh, Bartlett, Peter L., and Rakhlin, Alexander
- Subjects
Computer Science - Learning ,Statistics - Machine Learning - Abstract
We study the regret of optimal strategies for online convex optimization games. Using von Neumann's minimax theorem, we show that the optimal regret in this adversarial setting is closely related to the behavior of the empirical minimization algorithm in a stochastic process setting: it is equal to the maximum, over joint distributions of the adversary's action sequence, of the difference between a sum of minimal expected losses and the minimal empirical loss. We show that the optimal regret has a natural geometric interpretation, since it can be viewed as the gap in Jensen's inequality for a concave functional--the minimizer over the player's actions of expected loss--defined on a set of probability distributions. We use this expression to obtain upper and lower bounds on the regret of an optimal strategy for a variety of online learning problems. Our method provides upper bounds without the need to construct a learning algorithm; the lower bounds provide explicit optimal strategies for the adversary.
- Published
- 2009
86. Margin-adaptive model selection in statistical learning
- Author
-
Arlot, Sylvain and Bartlett, Peter L.
- Subjects
Mathematics - Statistics Theory ,Statistics - Machine Learning - Abstract
A classical condition for fast learning rates is the margin condition, first introduced by Mammen and Tsybakov. We tackle in this paper the problem of adaptivity to this condition in the context of model selection, in a general learning framework. Actually, we consider a weaker version of this condition that allows one to take into account that learning within a small model can be much easier than within a large one. Requiring this "strong margin adaptivity" makes the model selection problem more challenging. We first prove, in a general framework, that some penalization procedures (including local Rademacher complexities) exhibit this adaptivity when the models are nested. Contrary to previous results, this holds with penalties that only depend on the data. Our second main result is that strong margin adaptivity is not always possible when the models are not nested: for every model selection procedure (even a randomized one), there is a problem for which it does not demonstrate strong margin adaptivity., Comment: Published in at http://dx.doi.org/10.3150/10-BEJ288 the Bernoulli (http://isi.cbs.nl/bernoulli/) by the International Statistical Institute/Bernoulli Society (http://isi.cbs.nl/BS/bshome.htm)
- Published
- 2008
- Full Text
- View/download PDF
87. Discussion of '2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization' by V. Koltchinskii
- Author
-
Bartlett, Peter L. and Mendelson, Shahar
- Subjects
Quantitative Finance - Risk Management ,Mathematics - Statistics Theory - Abstract
Discussion of "2004 IMS Medallion Lecture: Local Rademacher complexities and oracle inequalities in risk minimization" by V. Koltchinskii [arXiv:0708.0083], Comment: Published at http://dx.doi.org/10.1214/009053606000001028 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2007
- Full Text
- View/download PDF
88. Comment on 'Support Vector Machines with Applications'
- Author
-
Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D.
- Subjects
Mathematics - Statistics Theory - Abstract
Comment on "Support Vector Machines with Applications" [math.ST/0612817], Comment: Published at http://dx.doi.org/10.1214/088342306000000475 in the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2006
- Full Text
- View/download PDF
89. Local Rademacher complexities
- Author
-
Bartlett, Peter L., Bousquet, Olivier, and Mendelson, Shahar
- Subjects
Mathematics - Statistics Theory ,62G08, 68Q32 (Primary) - Abstract
We propose new bounds on the error of learning algorithms in terms of a data-dependent notion of complexity. The estimates we establish give optimal rates and are based on a local and empirical version of Rademacher averages, in the sense that the Rademacher averages are computed from the data, on a subset of functions with small empirical error. We present some applications to classification and prediction with convex function classes, and with kernel classes in particular., Comment: Published at http://dx.doi.org/10.1214/009053605000000282 in the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org)
- Published
- 2005
- Full Text
- View/download PDF
90. [Consistency in Boosting]: Discussion
- Author
-
Bartlett, Peter L., Jordan, Michael I., and McAuliffe, Jon D.
- Published
- 2004
91. Optimal and instance-dependent guarantees for Markovian linear stochastic approximation.
- Author
-
Wenlong Mou, Pananjady, Ashwin, Wainwright, Martin J., and Bartlett, Peter L.
- Subjects
STOCHASTIC approximation ,MARKOV processes ,AUTOREGRESSIVE models ,LINEAR equations ,REINFORCEMENT learning - Abstract
We study stochastic approximation procedures for approximately solving a d-dimensional linear fixed-point equation based on observing a trajectory of length n from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order t
mix d/n on the squared error of the last iterate of a standard scheme, where tmix is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters (d, tmix ) in the higher-order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise--covering the TD(λ) family of algorithms for all λ ∈ [0, 1)--and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of λ when running the TD(λ) algorithm). [ABSTRACT FROM AUTHOR]- Published
- 2024
- Full Text
- View/download PDF
92. Local complexities for empirical risk minimization
- Author
-
Bartlett, Peter L, Mendelson, Shahar, and Philips, Petra
- Subjects
statistical learning theory ,empirical risk minimization ,generalization bounds ,concentration inequalities ,isomorphic coordinate projections ,data-dependent complexity - Abstract
We present sharp bounds on the risk of the empirical minimization algorithm under mild assumptions on the class. We introduce the notion of isomorphic coordinate projections and show that this leads to a sharper error bound than the best previously known. The quantity which governs this bound on the empirical minimizer is the largest fixed point of the function xi(n)(r) = E sup {Ef - E(n)f : f is an element of F, Ef = r}. We prove that this is the best estimate one can obtain using "structural results", and that it is possible to estimate the error rate from data. We then prove that the bound on the empirical minimization algorithm can be improved further by a direct analysis, and that the correct error rate is the maximizer of xi'(n)(r) - r, where xi'(n)(r) = E sup {Ef - E(n)f : f is an element of F, Ef = r}.
- Published
- 2004
93. Bounding Embeddings of VC Classes into Maximum Classes
- Author
-
Rubinstein, J. Hyam, Rubinstein, Benjamin I. P., Bartlett, Peter L., Vovk, Vladimir, editor, Papadopoulos, Harris, editor, and Gammerman, Alexander, editor
- Published
- 2015
- Full Text
- View/download PDF
94. Bayesian Robustness: A Nonasymptotic Viewpoint
- Author
-
Bhatia, Kush, primary, Ma, Yi-An, additional, Dragan, Anca D., additional, Bartlett, Peter L., additional, and Jordan, Michael I., additional
- Published
- 2023
- Full Text
- View/download PDF
95. Improved bounds for discretization of Langevin diffusions: Near-optimal rates without convexity
- Author
-
Mou, Wenlong, primary, Flammarion, Nicolas, additional, Wainwright, Martin J., additional, and Bartlett, Peter L., additional
- Published
- 2022
- Full Text
- View/download PDF
96. A Regularization Approach to Metrical Task Systems
- Author
-
Abernethy, Jacob, Bartlett, Peter L., Buchbinder, Niv, Stanton, Isabelle, Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Goebel, Randy, editor, Siekmann, Jörg, editor, Wahlster, Wolfgang, editor, Hutter, Marcus, editor, Stephan, Frank, editor, Vovk, Vladimir, editor, and Zeugmann, Thomas, editor
- Published
- 2010
- Full Text
- View/download PDF
97. A Learning-Based Approach to Reactive Security
- Author
-
Barth, Adam, Rubinstein, Benjamin I. P., Sundararajan, Mukund, Mitchell, John C., Song, Dawn, Bartlett, Peter L., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Doug, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, and Sion, Radu, editor
- Published
- 2010
- Full Text
- View/download PDF
98. Bounded Parameter Markov Decision Processes with Average Reward Criterion
- Author
-
Tewari, Ambuj, Bartlett, Peter L., Carbonell, Jaime G., editor, Siekmann, Jörg, editor, Bshouty, Nader H., editor, and Gentile, Claudio, editor
- Published
- 2007
- Full Text
- View/download PDF
99. On the Consistency of Multiclass Classification Methods
- Author
-
Tewari, Ambuj, Bartlett, Peter L., Hutchison, David, editor, Kanade, Takeo, editor, Kittler, Josef, editor, Kleinberg, Jon M., editor, Mattern, Friedemann, editor, Mitchell, John C., editor, Naor, Moni, editor, Nierstrasz, Oscar, editor, Pandu Rangan, C., editor, Steffen, Bernhard, editor, Sudan, Madhu, editor, Terzopoulos, Demetri, editor, Tygar, Dough, editor, Vardi, Moshe Y., editor, Weikum, Gerhard, editor, Carbonell, Jaime G., editor, Siekmann, Jörg, editor, Auer, Peter, editor, and Meir, Ron, editor
- Published
- 2005
- Full Text
- View/download PDF
100. Oracle lower bounds for stochastic gradient sampling algorithms
- Author
-
Chatterji, Niladri S., primary, Bartlett, Peter L., additional, and Long, Philip M., additional
- Published
- 2022
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.