Author: "Bartlett, Peter L." / Topic: machine learning (cs.lg) - Searchworks@Jio Institute Digital Library Search Results

1. The Double-Edged Sword of Implicit Bias: Generalization vs. Robustness in ReLU Networks

Author: Frei, Spencer, Vardi, Gal, Bartlett, Peter L., and Srebro, Nathan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In this work, we study the implications of the implicit bias of gradient flow on generalization and adversarial robustness in ReLU networks. We focus on a setting where the data consists of clusters and the correlations between cluster means are small, and show that in two-layer ReLU networks gradient flow is biased towards solutions that generalize well, but are highly vulnerable to adversarial examples. Our results hold even in cases where the network has many more parameters than training examples. Despite the potential for harmful overfitting in such overparameterized settings, we prove that the implicit bias of gradient flow prevents it. However, the implicit bias also leads to non-robust solutions (susceptible to small adversarial $\ell_2$-perturbations), even though robust networks that fit the data exist., 41 pages
Published: 2023

2. Benign Overfitting in Linear Classifiers and Leaky ReLU Networks from KKT Conditions for Margin Maximization

Author: Frei, Spencer, Vardi, Gal, Bartlett, Peter L., and Srebro, Nathan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Linear classifiers and leaky ReLU networks trained by gradient flow on the logistic loss have an implicit bias towards solutions which satisfy the Karush--Kuhn--Tucker (KKT) conditions for margin maximization. In this work we establish a number of settings where the satisfaction of these KKT conditions implies benign overfitting in linear classifiers and in two-layer leaky ReLU networks: the estimators interpolate noisy training data and simultaneously generalize well to test data. The settings include variants of the noisy class-conditional Gaussians considered in previous work as well as new distributional settings where benign overfitting has not been previously observed. The key ingredient to our proof is the observation that when the training data is nearly-orthogonal, both linear classifiers and leaky ReLU networks satisfying the KKT conditions for their respective margin maximization problems behave like a nearly uniform average of the training examples., 53 pages
Published: 2023

3. Trained Transformers Learn Linear Models In-Context

Author: Zhang, Ruiqi, Frei, Spencer, and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Statistics - Machine Learning, Computer Science - Artificial Intelligence, Machine Learning (stat.ML), Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Attention-based neural networks such as transformers have demonstrated a remarkable ability to exhibit in-context learning (ICL): Given a short prompt sequence of tokens from an unseen task, they can formulate relevant per-token and next-token predictions without any parameter updates. By embedding a sequence of labeled training data and unlabeled test data as a prompt, this allows for transformers to behave like supervised learning algorithms. Indeed, recent work has shown that when training transformer architectures over random instances of linear regression problems, these models' predictions mimic those of ordinary least squares. Towards understanding the mechanisms underlying this phenomenon, we investigate the dynamics of ICL in transformers with a single linear self-attention layer trained by gradient flow on linear regression tasks. We show that despite non-convexity, gradient flow with a suitable random initialization finds a global minimum of the objective function. At this global minimum, when given a test prompt of labeled examples from a new prediction task, the transformer achieves prediction error competitive with the best linear predictor over the test prompt distribution. We additionally characterize the robustness of the trained transformer to a variety of distribution shifts and show that although a number of shifts are tolerated, shifts in the covariate distribution of the prompts are not. Motivated by this, we consider a generalized ICL setting where the covariate distributions can vary across prompts. We show that although gradient flow succeeds at finding a global minimum in this setting, the trained transformer is still brittle under mild covariate shifts., Comment: 46 pages
Published: 2023
Full Text: View/download PDF

4. Optimal variance-reduced stochastic approximation in Banach spaces

Author: Mou, Wenlong, Khamaru, Koulik, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Statistics Theory, Statistics Theory (math.ST), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We study the problem of estimating the fixed point of a contractive operator defined on a separable Banach space. Focusing on a stochastic query model that provides noisy evaluations of the operator, we analyze a variance-reduced stochastic approximation scheme, and establish non-asymptotic bounds for both the operator defect and the estimation error, measured in an arbitrary semi-norm. In contrast to worst-case guarantees, our bounds are instance-dependent, and achieve the local asymptotic minimax risk non-asymptotically. For linear operators, contractivity can be relaxed to multi-step contractivity, so that the theory can be applied to problems like average reward policy evaluation problem in reinforcement learning. We illustrate the theory via applications to stochastic shortest path problems, two-player zero-sum Markov games, as well as policy evaluation and $Q$-learning for tabular Markov decision processes.
Published: 2022

5. Benign Overfitting without Linearity: Neural Network Classifiers Trained by Gradient Descent for Noisy Linear Data

Author: Frei, Spencer, Chatterji, Niladri S., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Mathematics, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: Benign overfitting, the phenomenon where interpolating models generalize well in the presence of noisy data, was first observed in neural network models trained with gradient descent. To better understand this empirical observation, we consider the generalization error of two-layer neural networks trained to interpolation by gradient descent on the logistic loss following random initialization. We assume the data comes from well-separated class-conditional log-concave distributions and allow for a constant fraction of the training labels to be corrupted by an adversary. We show that in this setting, neural networks exhibit benign overfitting: they can be driven to zero training error, perfectly fitting any noisy training labels, and simultaneously achieve minimax optimal test error. In contrast to previous work on benign overfitting that require linear or kernel-based predictors, our analysis holds in a setting where both the model and learning dynamics are fundamentally nonlinear., Comment: 39 pages; updated proof of loss ratio bound
Published: 2022
Full Text: View/download PDF

6. Random Feature Amplification: Feature Learning and Generalization in Neural Networks

Author: Frei, Spencer, Chatterji, Niladri S., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Mathematics, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features., Comment: 42 pages; minor corrections
Published: 2022
Full Text: View/download PDF

7. Implicit Bias in Leaky ReLU Networks Trained on High-Dimensional Data

Author: Frei, Spencer, Vardi, Gal, Bartlett, Peter L., Srebro, Nathan, and Hu, Wei
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: The implicit biases of gradient-based optimization algorithms are conjectured to be a major factor in the success of modern deep learning. In this work, we investigate the implicit bias of gradient flow and gradient descent in two-layer fully-connected neural networks with leaky ReLU activations when the training data are nearly-orthogonal, a common property of high-dimensional data. For gradient flow, we leverage recent work on the implicit bias for homogeneous neural networks to show that asymptotically, gradient flow produces a neural network with rank at most two. Moreover, this network is an $\ell_2$-max-margin solution (in parameter space), and has a linear decision boundary that corresponds to an approximate-max-margin linear predictor. For gradient descent, provided the random initialization variance is small enough, we show that a single step of gradient descent suffices to drastically reduce the rank of the network, and that the rank remains small throughout training. We provide experiments which suggest that a small initialization scale is important for finding low-rank neural networks with gradient descent., Comment: 54 pages
Published: 2022
Full Text: View/download PDF

8. On the Theory of Reinforcement Learning with Once-per-Episode Feedback

Author: Chatterji, Niladri S., Pacchiano, Aldo, Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study a theory of reinforcement learning (RL) in which the learner receives binary feedback only once at the end of an episode. While this is an extreme test case for theory, it is also arguably more representative of real-world applications than the traditional requirement in RL practice that the learner receive feedback at every time step. Indeed, in many real-world applications of reinforcement learning, such as self-driving cars and robotics, it is easier to evaluate whether a learner's complete trajectory was either "good" or "bad," but harder to provide a reward signal at each step. To show that learning is possible in this more challenging setting, we study the case where trajectory labels are generated by an unknown parametric model, and provide a statistically and computationally efficient algorithm that achieves sublinear regret., Published at NeurIPS 2022
Published: 2021

9. Agnostic learning with unknown utilities

Author: Bhatia, Kush, Bartlett, Peter L., Dragan, Anca D., and Steinhardt, Jacob
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, agnostic learning, learning by comparisons, Computing methodologies → Active learning settings, Statistics - Machine Learning, active learning, Computing methodologies → Machine learning, Machine Learning (stat.ML), utility estimation, Machine Learning (cs.LG)
Abstract: Traditional learning approaches for classification implicitly assume that each mistake has the same cost. In many real-world problems though, the utility of a decision depends on the underlying context x and decision y; for instance, misclassifying a stop sign is worse than misclassifying a road-side postbox. However, directly incorporating these utilities into the learning objective is often infeasible since these can be quite complex and difficult for humans to specify. We formally study this as agnostic learning with unknown utilities: given a dataset S = {x_1, …, x_n} where each data point x_i ∼ 𝒟_x from some unknown distribution 𝒟_x, the objective of the learner is to output a function f in some class of decision functions ℱ with small excess risk. This risk measures the performance of the output predictor f with respect to the best predictor in the class ℱ on the unknown underlying utility u^*:𝒳×𝒴↦ [0,1]. This utility u^* is not assumed to have any specific structure and is allowed to be any bounded function. This raises an interesting question whether learning is even possible in our setup, given that obtaining a generalizable estimate of utility u^* might not be possible from finitely many samples. Surprisingly, we show that estimating the utilities of only the sampled points S suffices to learn a decision function which generalizes well. With this insight, we study mechanisms for eliciting information from human experts which allow a learner to estimate the utilities u^* on the set S. While humans find it difficult to directly provide utility values reliably, it is often easier for them to provide comparison feedback based on these utilities. We show that, unlike in the realizable setup, the vanilla comparison queries where humans compare a pair of decisions for a single input x are insufficient. We introduce a family of elicitation mechanisms by generalizing comparisons, called the k-comparison oracle, which enables the learner to ask for comparisons across k different inputs x at once. We show that the excess risk in our agnostic learning framework decreases at a rate of O (1/k) with such queries. This result brings out an interesting accuracy-elicitation trade-off - as the order k of the oracle increases, the comparative queries become harder to elicit from humans but allow for more accurate learning., LIPIcs, Vol. 185, 12th Innovations in Theoretical Computer Science Conference (ITCS 2021), pages 55:1-55:20
Published: 2021

10. When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?

Author: Chatterji, Niladri S., Long, Philip M., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Optimization and Control (math.OC), Statistics - Machine Learning, Computer Science - Artificial Intelligence, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.
Published: 2021

11. Preference learning along multiple criteria: A game-theoretic perspective

Author: Bhatia, Kush, Pananjady, Ashwin, Bartlett, Peter L., Dragan, Anca D., and Wainwright, Martin J.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science::Computer Science and Game Theory, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: The literature on ranking from ordinal data is vast, and there are several ways to aggregate overall preferences from pairwise comparisons between objects. In particular, it is well known that any Nash equilibrium of the zero sum game induced by the preference matrix defines a natural solution concept (winning distribution over objects) known as a von Neumann winner. Many real-world problems, however, are inevitably multi-criteria, with different pairwise preferences governing the different criteria. In this work, we generalize the notion of a von Neumann winner to the multi-criteria setting by taking inspiration from Blackwell's approachability. Our framework allows for non-linear aggregation of preferences across criteria, and generalizes the linearization-based approach from multi-objective optimization. From a theoretical standpoint, we show that the Blackwell winner of a multi-criteria problem instance can be computed as the solution to a convex optimization problem. Furthermore, given random samples of pairwise comparisons, we show that a simple plug-in estimator achieves near-optimal minimax sample complexity. Finally, we showcase the practical utility of our framework in a user study on autonomous driving, where we find that the Blackwell winner outperforms the von Neumann winner for the overall preferences., Comment: 47 pages; published as a conference paper at NeurIPS 2020
Published: 2021
Full Text: View/download PDF

12. Optimal and instance-dependent guarantees for Markovian linear stochastic approximation

Author: Mou, Wenlong, Pananjady, Ashwin, Wainwright, Martin J., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), Probability (math.PR), FOS: Mathematics, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), Mathematics - Optimization and Control, Mathematics - Probability, Machine Learning (cs.LG)
Abstract: We study stochastic approximation procedures for approximately solving a $d$-dimensional linear fixed point equation based on observing a trajectory of length $n$ from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order $t_{\mathrm{mix}} \tfrac{d}{n}$ on the squared error of the last iterate of a standard scheme, where $t_{\mathrm{mix}}$ is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $(d, t_{\mathrm{mix}})$ in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise -- covering the TD($\lambda$) family of algorithms for all $\lambda \in [0, 1)$ -- and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of $\lambda$ when running the TD($\lambda$) algorithm).
Published: 2021
Full Text: View/download PDF

13. The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks

Author: Chatterji, Niladri S., Long, Philip M., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Mathematics, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of $\textit{benign overfitting}$ has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient flow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk., Comment: Accepted for publication at JMLR
Published: 2021
Full Text: View/download PDF

14. Adversarial Examples in Multi-Layer Random ReLU Networks

Author: Bartlett, Peter L., Bubeck, Sébastien, and Cherapanamjeri, Yeshwanth
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
Published: 2021
Full Text: View/download PDF

15. Infinite-Horizon Offline Reinforcement Learning with Linear Function Approximation: Curse of Dimensionality and Algorithm

Author: Chen, Lin, Scherrer, Bruno, and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In this paper, we investigate the sample complexity of policy evaluation in infinite-horizon offline reinforcement learning (also known as the off-policy evaluation problem) with linear function approximation. We identify a hard regime $d\gamma^{2}>1$, where $d$ is the dimension of the feature vector and $\gamma$ is the discount rate. In this regime, for any $q\in[\gamma^{2},1]$, we can construct a hard instance such that the smallest eigenvalue of its feature covariance matrix is $q/d$ and it requires $\Omega\left(\frac{d}{\gamma^{2}\left(q-\gamma^{2}\right)\varepsilon^{2}}\exp\left(\Theta\left(d\gamma^{2}\right)\right)\right)$ samples to approximate the value function up to an additive error $\varepsilon$. Note that the lower bound of the sample complexity is exponential in $d$. If $q=\gamma^{2}$, even infinite data cannot suffice. Under the low distribution shift assumption, we show that there is an algorithm that needs at most $O\left(\max\left\{ \frac{\left\Vert \theta^{\pi}\right\Vert _{2}^{4}}{\varepsilon^{4}}\log\frac{d}{\delta},\frac{1}{\varepsilon^{2}}\left(d+\log\frac{1}{\delta}\right)\right\} \right)$ samples ($\theta^{\pi}$ is the parameter of the policy in linear function approximation) and guarantees approximation to the value function up to an additive error of $\varepsilon$ with probability at least $1-\delta$.
Published: 2021
Full Text: View/download PDF

16. When does gradient descent with logistic loss find interpolating two-layer networks?

Author: Chatterji, Niladri S., Long, Philip M., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
Published: 2020

17. Optimal Mean Estimation without a Variance

Author: Cherapanamjeri, Yeshwanth, Tripuraneni, Nilesh, Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, FOS: Mathematics, Data Structures and Algorithms (cs.DS), Machine Learning (stat.ML), Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: We study the problem of heavy-tailed mean estimation in settings where the variance of the data-generating distribution does not exist. Concretely, given a sample $\mathbf{X} = \{X_i\}_{i = 1}^n$ from a distribution $\mathcal{D}$ over $\mathbb{R}^d$ with mean $\mu$ which satisfies the following \emph{weak-moment} assumption for some ${\alpha \in [0, 1]}$: \begin{equation*} \forall \|v\| = 1: \mathbb{E}_{X \thicksim \mathcal{D}}[\lvert \langle X - \mu, v\rangle \rvert^{1 + \alpha}] \leq 1, \end{equation*} and given a target failure probability, $\delta$, our goal is to design an estimator which attains the smallest possible confidence interval as a function of $n,d,\delta$. For the specific case of $\alpha = 1$, foundational work of Lugosi and Mendelson exhibits an estimator achieving subgaussian confidence intervals, and subsequent work has led to computationally efficient versions of this estimator. Here, we study the case of general $\alpha$, and establish the following information-theoretic lower bound on the optimal attainable confidence interval: \begin{equation*} \Omega \left(\sqrt{\frac{d}{n}} + \left(\frac{d}{n}\right)^{\frac{\alpha}{(1 + \alpha)}} + \left(\frac{\log 1 / \delta}{n}\right)^{\frac{\alpha}{(1 + \alpha)}}\right). \end{equation*} Moreover, we devise a computationally-efficient estimator which achieves this lower bound., Comment: Fixed typographical errors in Theorem 1.2, Lemmas 4.3 and C.8
Published: 2020

18. Failures of model-dependent generalization bounds for least-norm interpolation

Author: Bartlett, Peter L. and Long, Philip M.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose -- it can be bounded below by a constant when the true excess risk goes to zero.
Published: 2020

19. Optimal Robust Linear Regression in Nearly Linear Time

Author: Cherapanamjeri, Yeshwanth, Aras, Efe, Tripuraneni, Nilesh, Jordan, Michael I., Flammarion, Nicolas, and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Machine Learning (stat.ML), Data Structures and Algorithms (cs.DS), Machine Learning (cs.LG)
Abstract: We study the problem of high-dimensional robust linear regression where a learner is given access to $n$ samples from the generative model $Y = \langle X,w^* \rangle + \epsilon$ (with $X \in \mathbb{R}^d$ and $\epsilon$ independent), in which an $\eta$ fraction of the samples have been adversarially corrupted. We propose estimators for this problem under two settings: (i) $X$ is L4-L2 hypercontractive, $\mathbb{E} [XX^\top]$ has bounded condition number and $\epsilon$ has bounded variance and (ii) $X$ is sub-Gaussian with identity second moment and $\epsilon$ is sub-Gaussian. In both settings, our estimators: (a) Achieve optimal sample complexities and recovery guarantees up to log factors and (b) Run in near linear time ($\tilde{O}(nd / \eta^6)$). Prior to our work, polynomial time algorithms achieving near optimal sample complexities were only known in the setting where $X$ is Gaussian with identity covariance and $\epsilon$ is Gaussian, and no linear time estimators were known for robust linear regression in any setting. Our estimators and their analysis leverage recent developments in the construction of faster algorithms for robust mean estimation to improve runtimes, and refined concentration of measure arguments alongside Gaussian rounding techniques to improve statistical sample complexities.
Published: 2020

20. On Thompson Sampling with Langevin Algorithms

Author: Mazumdar, Eric, Pacchiano, Aldo, Ma, Yi-an, Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Thompson sampling for multi-armed bandit problems is known to enjoy favorable performance in both theory and practice. However, it suffers from a significant limitation computationally, arising from the need for samples from posterior distributions at every iteration. We propose two Markov Chain Monte Carlo (MCMC) methods tailored to Thompson sampling to address this issue. We construct quickly converging Langevin algorithms to generate approximate samples that have accuracy guarantees, and we leverage novel posterior concentration rates to analyze the regret of the resulting approximate Thompson sampling algorithm. Further, we specify the necessary hyperparameters for the MCMC procedure to guarantee optimal instance-dependent frequentist regret while having low computational complexity. In particular, our algorithms take advantage of both posterior concentration and a sample reuse mechanism to ensure that only a constant number of iterations and a constant amount of data is needed in each round. The resulting approximate Thompson sampling algorithm has logarithmic regret and its computational complexity does not scale with the time horizon of the algorithm.
Published: 2020

21. Self-Distillation Amplifies Regularization in Hilbert Space

Author: Mobahi, Hossein, Farajtabar, Mehrdad, and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Quantum Physics, Machine Learning (cs.LG)
Abstract: Knowledge distillation introduced in the deep learning context is a method to transfer knowledge from one architecture to another. In particular, when the architectures are identical, this is called self-distillation. The idea is to feed in predictions of the trained model as new target values for retraining (and iterate this loop possibly a few times). It has been empirically observed that the self-distilled model often achieves higher accuracy on held out data. Why this happens, however, has been a mystery: the self-distillation dynamics does not receive any new information about the task and solely evolves by looping over training. To the best of our knowledge, there is no rigorous understanding of this phenomenon. This work provides the first theoretical analysis of self-distillation. We focus on fitting a nonlinear function to training data, where the model space is Hilbert space and fitting is subject to $\ell_2$ regularization in this function space. We show that self-distillation iterations modify regularization by progressively limiting the number of basis functions that can be used to represent the solution. This implies (as we also verify empirically) that while a few rounds of self-distillation may reduce over-fitting, further rounds may lead to under-fitting and thus worse performance.
Published: 2020

22. On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration

Author: Mou, Wenlong, Li, Chris Junchi, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Statistics Theory, Machine Learning (stat.ML), Statistics Theory (math.ST), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We undertake a precise study of the asymptotic and non-asymptotic properties of stochastic approximation procedures with Polyak-Ruppert averaging for solving a linear system $\bar{A} \theta = \bar{b}$. When the matrix $\bar{A}$ is Hurwitz, we prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity. The CLT characterizes the exact asymptotic covariance matrix, which is the sum of the classical Polyak-Ruppert covariance and a correction term that scales with the step size. Under assumptions on the tail of the noise distribution, we prove a non-asymptotic concentration inequality whose main term matches the covariance in CLT in any direction, up to universal constants. When the matrix $\bar{A}$ is not Hurwitz but only has non-negative real parts in its eigenvalues, we prove that the averaged LSA procedure actually achieves an $O(1/T)$ rate in mean-squared error. Our results provide a more refined understanding of linear stochastic approximation in both the asymptotic and non-asymptotic settings. We also show various applications of the main results, including the study of momentum-based stochastic gradient methods as well as temporal difference algorithms in reinforcement learning.
Published: 2020
Full Text: View/download PDF

23. Hebbian Synaptic Modifications in Spiking Neurons that Learn

Author: Bartlett, Peter L. and Baxter, Jonathan
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Computer Science - Machine Learning, Quantitative Biology::Neurons and Cognition, Statistics - Machine Learning, Computer Science - Neural and Evolutionary Computing, Machine Learning (stat.ML), Neural and Evolutionary Computing (cs.NE), Machine Learning (cs.LG)
Abstract: In this paper, we derive a new model of synaptic plasticity, based on recent algorithms for reinforcement learning (in which an agent attempts to learn appropriate actions to maximize its long-term average reward). We show that these direct reinforcement learning algorithms also give locally optimal performance for the problem of reinforcement learning with multiple agents, without any explicit communication between agents. By considering a network of spiking neurons as a collection of agents attempting to maximize the long-term average of a reward signal, we derive a synaptic update rule that is qualitatively similar to Hebb's postulate. This rule requires only simple computations, such as addition and leaky integration, and involves only quantities that are available in the vicinity of the synapse. Furthermore, it leads to synaptic connection strengths that give locally optimal values of the long term average reward. The reinforcement learning paradigm is sufficiently broad to encompass many learning problems that are solved by the brain. We illustrate, with simulations, that the approach is effective for simple pattern classification and motor learning tasks.
Published: 2019

24. High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm

Author: Mou, Wenlong, Ma, Yi-An, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, FOS: Mathematics, Machine Learning (stat.ML), Data Structures and Algorithms (cs.DS), Mathematics - Optimization and Control, Statistics - Computation, Computation (stat.CO), Machine Learning (cs.LG)
Abstract: We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with log-concave and smooth densities. The higher-order dynamics allow for more flexible discretization schemes, and we develop a specific method that combines splitting with more accurate integration. For a broad class of $d$-dimensional distributions arising from generalized linear models, we prove that the resulting third-order algorithm produces samples from a distribution that is at most $\varepsilon > 0$ in Wasserstein distance from the target distribution in $O\left(\frac{d^{1/4}}{ \varepsilon^{1/2}} \right)$ steps. This result requires only Lipschitz conditions on the gradient. For general strongly convex potentials with $\alpha$-th order smoothness, we prove that the mixing time scales as $O \left(\frac{d^{1/4}}{\varepsilon^{1/2}} + \frac{d^{1/2}}{\varepsilon^{1/(\alpha - 1)}} \right)$., Comment: Changes from v1: improved algorithm with $O (d^{1/4} / \varepsilon^{1/2})$ mixing time
Published: 2019

25. Stochastic Gradient and Langevin Processes

Author: Cheng, Xiang, Yin, Dong, Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We prove quantitative convergence rates at which discrete Langevin-like processes converge to the invariant distribution of a related stochastic differential equation. We study the setup where the additive noise can be non-Gaussian and state-dependent and the potential function can be non-convex. We show that the key properties of these processes depend on the potential function and the second moment of the additive noise. We apply our theoretical findings to studying the convergence of Stochastic Gradient Descent (SGD) for non-convex problems and corroborate them with experiments using SGD to train deep neural networks on the CIFAR-10 dataset., ICML 2020, code available at https://github.com/dongyin92/noise_covariance
Published: 2019

26. OSOM: A simultaneously optimal algorithm for multi-armed and linear contextual bandits

Author: Chatterji, Niladri S., Muthukumar, Vidya, and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Statistics::Machine Learning, Computer Science - Machine Learning, Statistics - Machine Learning, TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY, Machine Learning (stat.ML), ComputingMethodologies_ARTIFICIALINTELLIGENCE, Machine Learning (cs.LG)
Abstract: We consider the stochastic linear (multi-armed) contextual bandit problem with the possibility of hidden simple multi-armed bandit structure in which the rewards are independent of the contextual information. Algorithms that are designed solely for one of the regimes are known to be sub-optimal for the alternate regime. We design a single computationally efficient algorithm that simultaneously obtains problem-dependent optimal regret rates in the simple multi-armed bandit regime and minimax optimal regret rates in the linear contextual bandit regime, without knowing a priori which of the two models generates the rewards. These results are proved under the condition of stochasticity of contextual information over multiple rounds. Our results should be viewed as a step towards principled data-dependent policy class selection for contextual bandits.
Published: 2019

27. Testing Markov Chains without Hitting

Author: Cherapanamjeri, Yeshwanth and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, FOS: Mathematics, Data Structures and Algorithms (cs.DS), Machine Learning (stat.ML), Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: We study the problem of identity testing of markov chains. In this setting, we are given access to a single trajectory from a markov chain with unknown transition matrix $Q$ and the goal is to determine whether $Q = P$ for some known matrix $P$ or $\text{Dist}(P, Q) \geq \epsilon$ where $\text{Dist}$ is suitably defined. In recent work by Daskalakis, Dikkala and Gravin, 2018, it was shown that it is possible to distinguish between the two cases provided the length of the observed trajectory is at least super-linear in the hitting time of $P$ which may be arbitrarily large. In this paper, we propose an algorithm that avoids this dependence on hitting time thus enabling efficient testing of markov chains even in cases where it is infeasible to observe every state in the chain. Our algorithm is based on combining classical ideas from approximation algorithms with techniques for the spectral analysis of markov chains.
Published: 2019

28. Quantitative Weak Convergence for Discrete Stochastic Processes

Author: Cheng, Xiang, Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: In this paper, we quantitative convergence in $W_2$ for a family of Langevin-like stochastic processes that includes stochastic gradient descent and related gradient-based algorithms. Under certain regularity assumptions, we show that the iterates of these stochastic processes converge to an invariant distribution at a rate of $\tilde{O}\lrp{1/\sqrt{k}}$ where $k$ is the number of steps; this rate is provably tight up to log factors. Our result reduces to a quantitative form of the classical Central Limit Theorem in the special case when the potential is quadratic.
Published: 2019

29. Large-Scale Markov Decision Problems via the Linear Programming Dual

Author: Abbasi-Yadkori, Yasin, Bartlett, Peter L., Chen, Xi, and Malek, Alan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We consider the problem of controlling a fully specified Markov decision process (MDP), also known as the planning problem, when the state space is very large and calculating the optimal policy is intractable. Instead, we pursue the more modest goal of optimizing over some small family of policies. Specifically, we show that the family of policies associated with a low-dimensional approximation of occupancy measures yields a tractable optimization. Moreover, we propose an efficient algorithm, scaling with the size of the subspace but not the state space, that is able to find a policy with low excess loss relative to the best policy in this class. To the best of our knowledge, such results did not exist in the literature previously. We bound excess loss in the average cost and discounted cost cases, which are treated separately. Preliminary experiments show the effectiveness of the proposed algorithms in a queueing application., 53 pages. arXiv admin note: text overlap with arXiv:1402.6763
Published: 2019

30. Sampling for Bayesian Mixture Models: MCMC with Polynomial-Time Mixing

Author: Mou, Wenlong, Ho, Nhat, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Probability (math.PR), FOS: Mathematics, Machine Learning (stat.ML), Data Structures and Algorithms (cs.DS), Statistics - Computation, Mathematics - Probability, Computation (stat.CO), Statistics::Computation, Machine Learning (cs.LG)
Abstract: We study the problem of sampling from the power posterior distribution in Bayesian Gaussian mixture models, a robust version of the classical posterior. This power posterior is known to be non-log-concave and multi-modal, which leads to exponential mixing times for some standard MCMC algorithms. We introduce and study the Reflected Metropolis-Hastings Random Walk (RMRW) algorithm for sampling. For symmetric two-component Gaussian mixtures, we prove that its mixing time is bounded as $d^{1.5}(d + \Vert \theta_{0} \Vert^2)^{4.5}$ as long as the sample size $n$ is of the order $d (d + \Vert \theta_{0} \Vert^2)$. Notably, this result requires no conditions on the separation of the two means. En route to proving this bound, we establish some new results of possible independent interest that allow for combining Poincar\'{e} inequalities for conditional and marginal densities.
Published: 2019
Full Text: View/download PDF

31. Langevin Monte Carlo without smoothness

Author: Chatterji, Niladri S., Diakonikolas, Jelena, Jordan, Michael I., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Statistics - Computation, Computation (stat.CO), Machine Learning (cs.LG)
Abstract: Langevin Monte Carlo (LMC) is an iterative algorithm used to generate samples from a distribution that is known only up to a normalizing constant. The nonasymptotic dependence of its mixing time on the dimension and target accuracy is understood mainly in the setting of smooth (gradient-Lipschitz) log-densities, a serious limitation for applications in machine learning. In this paper, we remove this limitation, providing polynomial-time convergence guarantees for a variant of LMC in the setting of nonsmooth log-concave distributions. At a high level, our results follow by leveraging the implicit smoothing of the log-density that comes from a small Gaussian perturbation that we add to the iterates of the algorithm and controlling the bias and variance that are induced by this perturbation., Comment: Updated to match the AISTATS 2020 camera ready version. Some example applications added and typos corrected
Published: 2019
Full Text: View/download PDF

32. Fast Mean Estimation with Sub-Gaussian Rates

Author: Cherapanamjeri, Yeshwanth, Flammarion, Nicolas, and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, FOS: Mathematics, Mathematics - Statistics Theory, Data Structures and Algorithms (cs.DS), Machine Learning (stat.ML), Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: We propose an estimator for the mean of a random vector in $\mathbb{R}^d$ that can be computed in time $O(n^4+n^2d)$ for $n$ i.i.d.~samples and that has error bounds matching the sub-Gaussian case. The only assumptions we make about the data distribution are that it has finite mean and covariance; in particular, we make no assumptions about higher-order moments. Like the polynomial time estimator introduced by Hopkins, 2018, which is based on the sum-of-squares hierarchy, our estimator achieves optimal statistical efficiency in this challenging setting, but it has a significantly faster runtime and a simpler analysis.
Published: 2019
Full Text: View/download PDF

33. Bayesian Robustness: A Nonasymptotic Viewpoint

Author: Bhatia, Kush, Ma, Yi-An, Dragan, Anca D., Bartlett, Peter L., and Jordan, Michael I.
Subjects: Statistics and Probability, FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Statistics, Probability and Uncertainty, Statistics - Computation, Computation (stat.CO), Machine Learning (cs.LG)
Abstract: We study the problem of robustly estimating the posterior distribution for the setting where observed data can be contaminated with potentially adversarial outliers. We propose Rob-ULA, a robust variant of the Unadjusted Langevin Algorithm (ULA), and provide a finite-sample analysis of its sampling distribution. In particular, we show that after $T= \tilde{\mathcal{O}}(d/\varepsilon_{\textsf{acc}})$ iterations, we can sample from $p_T$ such that $\text{dist}(p_T, p^*) \leq \varepsilon_{\textsf{acc}} + \tilde{\mathcal{O}}(\epsilon)$, where $\epsilon$ is the fraction of corruptions. We corroborate our theoretical analysis with experiments on both synthetic and real-world data sets for mean estimation, regression and binary classification., Comment: 30 pages, 5 figures
Published: 2019
Full Text: View/download PDF

34. Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems

Author: Malik, Dhruv, Pananjady, Ashwin, Bhatia, Kush, Khamaru, Koulik, Bartlett, Peter L., and Wainwright, Martin J.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We study derivative-free methods for policy optimization over the class of linear policies. We focus on characterizing the convergence rate of these methods when applied to linear-quadratic systems, and study various settings of driving noise and reward feedback. We show that these methods provably converge to within any pre-specified tolerance of the optimal policy with a number of zero-order evaluations that is an explicit polynomial of the error tolerance, dimension, and curvature properties of the problem. Our analysis reveals some interesting differences between the settings of additive driving noise and random initialization, as well as the settings of one-point and two-point reward feedback. Our theory is corroborated by extensive simulations of derivative-free methods on these systems. Along the way, we derive convergence rates for stochastic zero-order optimization algorithms when applied to a certain class of non-convex problems., Version v3 consistent with paper appearing in JMLR
Published: 2018

35. Gen-Oja: A Two-time-scale approach for Streaming CCA

Author: Bhatia, Kush, Pacchiano, Aldo, Flammarion, Nicolas, Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Data Structures and Algorithms (cs.DS), Machine Learning (stat.ML), Neural and Evolutionary Computing (cs.NE), Machine Learning (cs.LG)
Abstract: In this paper, we study the problems of principal Generalized Eigenvector computation and Canonical Correlation Analysis in the stochastic setting. We propose a simple and efficient algorithm, Gen-Oja, for these problems. We prove the global convergence of our algorithm, borrowing ideas from the theory of fast-mixing Markov chains and two-time-scale stochastic approximation, showing that it achieves the optimal rate of convergence. In the process, we develop tools for understanding stochastic processes with Markovian noise which might be of independent interest., Accepted at NeurIPS 2018
Published: 2018

36. Best of many worlds: Robust model selection for online supervised learning

Author: Muthukumar, Vidya, Ray, Mitas, Sahai, Anant, and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We introduce algorithms for online, full-information prediction that are competitive with contextual tree experts of unknown complexity, in both probabilistic and adversarial settings. We show that by incorporating a probabilistic framework of structural risk minimization into existing adaptive algorithms, we can robustly learn not only the presence of stochastic structure when it exists (leading to constant as opposed to $\mathcal{O}(\sqrt{T})$ regret), but also the correct model order. We thus obtain regret bounds that are competitive with the regret of an optimal algorithm that possesses strong side information about both the complexity of the optimal contextual tree expert and whether the process generating the data is stochastic or adversarial. These are the first constructive guarantees on simultaneous adaptivity to the model and the presence of stochasticity., 33 pages, 5 figures
Published: 2018

37. Sharp convergence rates for Langevin dynamics in the nonconvex setting

Author: Cheng, Xiang, Chatterji, Niladri S., Abbasi-Yadkori, Yasin, Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Probability (math.PR), FOS: Mathematics, Machine Learning (stat.ML), Statistics - Computation, Computation (stat.CO), Mathematics - Probability, Machine Learning (cs.LG)
Abstract: We study the problem of sampling from a distribution $p^*(x) \propto \exp\left(-U(x)\right)$, where the function $U$ is $L$-smooth everywhere and $m$-strongly convex outside a ball of radius $R$, but potentially nonconvex inside this ball. We study both overdamped and underdamped Langevin MCMC and establish upper bounds on the number of steps required to obtain a sample from a distribution that is within $\epsilon$ of $p^*$ in $1$-Wasserstein distance. For the first-order method (overdamped Langevin MCMC), the iteration complexity is $\tilde{\mathcal{O}}\left(e^{cLR^2}d/\epsilon^2\right)$, where $d$ is the dimension of the underlying space. For the second-order method (underdamped Langevin MCMC), the iteration complexity is $\tilde{\mathcal{O}}\left(e^{cLR^2}\sqrt{d}/\epsilon\right)$ for an explicit positive constant $c$. Surprisingly, the iteration complexity for both these algorithms is only polynomial in the dimension $d$ and the target accuracy $\epsilon$. It is exponential, however, in the problem parameter $LR^2$, which is a measure of non-log-concavity of the target distribution., Comment: 78 pages, 2 figures
Published: 2018

38. Online learning with kernel losses

Author: Pacchiano, Aldo, Chatterji, Niladri S., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We present a generalization of the adversarial linear bandits framework, where the underlying losses are kernel functions (with an associated reproducing kernel Hilbert space) rather than linear functions. We study a version of the exponential weights algorithm and bound its regret in this setting. Under conditions on the eigendecay of the kernel we provide a sharp characterization of the regret for this algorithm. When we have polynomial eigendecay $\mu_j \le \mathcal{O}(j^{-\beta})$, we find that the regret is bounded by $\mathcal{R}_n \le \mathcal{O}(n^{\beta/(2(\beta-1))})$; while under the assumption of exponential eigendecay $\mu_j \le \mathcal{O}(e^{-\beta j })$, we get an even tighter bound on the regret $\mathcal{R}_n \le \mathcal{O}(n^{1/2}\log(n)^{1/2})$. We also study the full information setting when the underlying losses are kernel functions and present an adapted exponential weights algorithm and a conditional gradient descent algorithm., Comment: 40 pages, 4 figures
Published: 2018

39. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks

Author: Bartlett, Peter L., Helmbold, David P., and Long, Philip M.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Computer Science - Neural and Evolutionary Computing, Mathematics - Statistics Theory, Machine Learning (stat.ML), Neural and Evolutionary Computing (cs.NE), Statistics Theory (math.ST), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We analyze algorithms for approximating a function $f(x) = \Phi x$ mapping $\Re^d$ to $\Re^d$ using deep linear neural networks, i.e. that learn a function $h$ parameterized by matrices $\Theta_1,...,\Theta_L$ and defined by $h(x) = \Theta_L \Theta_{L-1} ... \Theta_1 x$. We focus on algorithms that learn through gradient descent on the population quadratic loss in the case that the distribution over the inputs is isotropic. We provide polynomial bounds on the number of iterations for gradient descent to approximate the least squares matrix $\Phi$, in the case where the initial hypothesis $\Theta_1 = ... = \Theta_L = I$ has excess loss bounded by a small enough constant. On the other hand, we show that gradient descent fails to converge for $\Phi$ whose distance from the identity is a larger constant, and we show that some forms of regularization toward the identity in each layer do not help. If $\Phi$ is symmetric positive definite, we show that an algorithm that initializes $\Theta_i = I$ learns an $\epsilon$-approximation of $f$ using a number of updates polynomial in $L$, the condition number of $\Phi$, and $\log(d/\epsilon)$. In contrast, we show that if the least squares matrix $\Phi$ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge. We analyze an algorithm for the case that $\Phi$ satisfies $u^{\top} \Phi u > 0$ for all $u$, but may not be symmetric. This algorithm uses two regularizers: one that maintains the invariant $u^{\top} \Theta_L \Theta_{L-1} ... \Theta_1 u > 0$ for all $u$, and another that "balances" $\Theta_1, ..., \Theta_L$ so that they have the same singular values.
Published: 2018
Full Text: View/download PDF

40. Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

Author: Bartlett, Peter L., Evans, Steven N., and Long, Philip M.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, FOS: Mathematics, Computer Science - Neural and Evolutionary Computing, Mathematics - Statistics Theory, Machine Learning (stat.ML), Neural and Evolutionary Computing (cs.NE), Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of near-identity nonlinear maps. We show that, regarding Fr\'echet derivatives with respect to the $h_1,...,h_m$, any critical point of a quadratic criterion in this near-identity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixed-size residual network with sigmoid activation functions, we show that there are near-identity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to near-identity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the near-identity region.
Published: 2018
Full Text: View/download PDF

41. Nearly-tight VC-dimension and pseudodimension bounds for piecewise linear neural networks

Author: Bartlett, Peter L., Harvey, Nick, Liaw, Chris, and Mehrabian, Abbas
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting $W$ be the number of weights and $L$ be the number of layers, we prove that the VC-dimension is $O(W L \log(W))$, and provide examples with VC-dimension $\Omega( W L \log(W/L) )$. This improves both the previously known upper bounds and lower bounds. In terms of the number $U$ of non-linear units, we prove a tight bound $\Theta(W U)$ on the VC-dimension. All of these bounds generalize to arbitrary piecewise linear activation functions, and also hold for the pseudodimensions of these function classes. Combined with previous results, this gives an intriguing range of dependencies of the VC-dimension on depth for networks with different non-linearities: there is no dependence for piecewise-constant, linear dependence for piecewise-linear, and no more than quadratic dependence for general piecewise-polynomial., Comment: Extended abstract appeared in COLT 2017; the upper bound was presented at the 2016 ACM Conference on Data Science. This version includes all the proofs and a refinement of the upper bound, Theorem 6. 16 pages, 2 figures
Published: 2017

42. Alternating minimization for dictionary learning: Local Convergence Guarantees

Author: Chatterji, Niladri S. and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, MathematicsofComputing_NUMERICALANALYSIS, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We present theoretical guarantees for an alternating minimization algorithm for the dictionary learning/sparse coding problem. The dictionary learning problem is to factorize vector samples $y^{1},y^{2},\ldots, y^{n}$ into an appropriate basis (dictionary) $A^*$ and sparse vectors $x^{1*},\ldots,x^{n*}$. Our algorithm is a simple alternating minimization procedure that switches between $\ell_1$ minimization and gradient descent in alternate steps. Dictionary learning and specifically alternating minimization algorithms for dictionary learning are well studied both theoretically and empirically. However, in contrast to previous theoretical analyses for this problem, we replace a condition on the operator norm (that is, the largest magnitude singular value) of the true underlying dictionary $A^*$ with a condition on the matrix infinity norm (that is, the largest magnitude term). Our guarantees are under a reasonable generative model that allows for dictionaries with growing operator norms, and can handle an arbitrary level of overcompleteness, while having sparsity that is information theoretically optimal. We also establish upper bounds on the sample complexity of our algorithm., Comment: Erratum: An earlier version of this paper appeared in NIPS 2017 which had an erroneous claim about convergence guarantees with random initialization. The main result -- Theorem 3 -- has been corrected by adding an assumption about the initialization (Assumption B1)
Published: 2017
Full Text: View/download PDF

43. Underdamped Langevin MCMC: A non-asymptotic analysis

Author: Cheng, Xiang, Chatterji, Niladri S., Bartlett, Peter L., and Jordan, Michael I.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Statistics - Computation, Computation (stat.CO), Statistics::Computation, Machine Learning (cs.LG)
Abstract: We study the underdamped Langevin diffusion when the log of the target distribution is smooth and strongly concave. We present a MCMC algorithm based on its discretization and show that it achieves $\varepsilon$ error (in 2-Wasserstein distance) in $\mathcal{O}(\sqrt{d}/\varepsilon)$ steps. This is a significant improvement over the best known rate for overdamped Langevin MCMC, which is $\mathcal{O}(d/\varepsilon^2)$ steps under the same smoothness/concavity assumptions. The underdamped Langevin MCMC scheme can be viewed as a version of Hamiltonian Monte Carlo (HMC) which has been observed to outperform overdamped Langevin MCMC methods in a number of application areas. We provide quantitative rates that support this empirical wisdom., Comment: 23 pages; Correction to Corollary 7
Published: 2017
Full Text: View/download PDF

44. Recovery Guarantees for One-hidden-layer Neural Networks

Author: Zhong, Kai, Song, Zhao, Jain, Prateek, Bartlett, Peter L., and Dhillon, Inderjit S.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Computer Science - Data Structures and Algorithms, Data Structures and Algorithms (cs.DS), Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: In this paper, we consider regression problems with one-hidden-layer neural networks (1NNs). We distill some properties of activation functions that lead to $\mathit{local~strong~convexity}$ in the neighborhood of the ground-truth parameters for the 1NN squared-loss objective. Most popular nonlinear activation functions satisfy the distilled properties, including rectified linear units (ReLUs), leaky ReLUs, squared ReLUs and sigmoids. For activation functions that are also smooth, we show $\mathit{local~linear~convergence}$ guarantees of gradient descent under a resampling rule. For homogeneous activations, we show tensor methods are able to initialize the parameters to fall into the local strong convexity region. As a result, tensor initialization followed by gradient descent is guaranteed to recover the ground truth with sample complexity $ d \cdot \log(1/\epsilon) \cdot \mathrm{poly}(k,\lambda )$ and computational complexity $n\cdot d \cdot \mathrm{poly}(k,\lambda) $ for smooth homogeneous activations with high probability, where $d$ is the dimension of the input, $k$ ($k\leq d$) is the number of hidden nodes, $\lambda$ is a conditioning property of the ground-truth parameter matrix between the input layer and the hidden layer, $\epsilon$ is the targeted precision and $n$ is the number of samples. To the best of our knowledge, this is the first work that provides recovery guarantees for 1NNs with both sample complexity and computational complexity $\mathit{linear}$ in the input dimension and $\mathit{logarithmic}$ in the precision., Comment: ICML 2017
Published: 2017
Full Text: View/download PDF

45. RL$^2$: Fast Reinforcement Learning via Slow Reinforcement Learning

Author: Duan, Yan, Schulman, John, Chen, Xi, Bartlett, Peter L., Sutskever, Ilya, and Abbeel, Pieter
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Computer Science - Neural and Evolutionary Computing, Machine Learning (stat.ML), Neural and Evolutionary Computing (cs.NE), Machine Learning (cs.LG)
Abstract: Deep reinforcement learning (deep RL) has been successful in learning sophisticated behaviors automatically; however, the learning process requires a huge number of trials. In contrast, animals can learn new tasks in just a few trials, benefiting from their prior knowledge about the world. This paper seeks to bridge this gap. Rather than designing a "fast" reinforcement learning algorithm, we propose to represent it as a recurrent neural network (RNN) and learn it from data. In our proposed method, RL$^2$, the algorithm is encoded in the weights of the RNN, which are learned slowly through a general-purpose ("slow") RL algorithm. The RNN receives all information a typical RL algorithm would receive, including observations, actions, rewards, and termination flags; and it retains its state across episodes in a given Markov Decision Process (MDP). The activations of the RNN store the state of the "fast" RL algorithm on the current (previously unseen) MDP. We evaluate RL$^2$ experimentally on both small-scale and large-scale problems. On the small-scale side, we train it to solve randomly generated multi-arm bandit problems and finite MDPs. After RL$^2$ is trained, its performance on new MDPs is close to human-designed algorithms with optimality guarantees. On the large-scale side, we test RL$^2$ on a vision-based navigation task and show that it scales up to high-dimensional problems., 14 pages. Under review as a conference paper at ICLR 2017
Published: 2016

46. FLAG n' FLARE: Fast Linearly-Coupled Adaptive Gradient Methods

Author: Cheng, Xiang, Roosta-Khorasani, Farbod, Palombo, Stefan, Bartlett, Peter L., and Mahoney, Michael W.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Computer Science::Databases, Machine Learning (cs.LG)
Abstract: We consider first order gradient methods for effectively optimizing a composite objective in the form of a sum of smooth and, potentially, non-smooth functions. We present accelerated and adaptive gradient methods, called FLAG and FLARE, which can offer the best of both worlds. They can achieve the optimal convergence rate by attaining the optimal first-order oracle complexity for smooth convex optimization. Additionally, they can adaptively and non-uniformly re-scale the gradient direction to adapt to the limited curvature available and conform to the geometry of the domain. We show theoretically and empirically that, through the compounding effects of acceleration and adaptivity, FLAG and FLARE can be highly effective for many data fitting and machine learning applications.
Published: 2016

47. Bounding Embeddings of VC Classes into Maximum Classes

Author: Rubinstein, J. Hyam, Rubinstein, Benjamin I. P., and Bartlett, Peter L.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, FOS: Mathematics, Mathematics - Combinatorics, Machine Learning (stat.ML), Combinatorics (math.CO), Machine Learning (cs.LG)
Abstract: One of the earliest conjectures in computational learning theory-the Sample Compression conjecture-asserts that concept classes (equivalently set systems) admit compression schemes of size linear in their VC dimension. To-date this statement is known to be true for maximum classes---those that possess maximum cardinality for their VC dimension. The most promising approach to positively resolving the conjecture is by embedding general VC classes into maximum classes without super-linear increase to their VC dimensions, as such embeddings would extend the known compression schemes to all VC classes. We show that maximum classes can be characterised by a local-connectivity property of the graph obtained by viewing the class as a cubical complex. This geometric characterisation of maximum VC classes is applied to prove a negative embedding result which demonstrates VC-d classes that cannot be embedded in any maximum class of VC dimension lower than 2d. On the other hand, we show that every VC-d class C embeds in a VC-(d+D) maximum class where D is the deficiency of C, i.e., the difference between the cardinalities of a maximum VC-d class and of C. For VC-2 classes in binary n-cubes for 4, 22 pages, 2 figures
Published: 2014
Full Text: View/download PDF

48. Online Learning in Markov Decision Processes with Adversarially Chosen Transition Probability Distributions

Author: Abbasi-Yadkori, Yasin, Bartlett, Peter L., and Szepesvari, Csaba
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We study the problem of learning Markov decision processes with finite state and action spaces when the transition probability distributions and loss functions are chosen adversarially and are allowed to change with time. We introduce an algorithm whose regret with respect to any policy in a comparison class grows as the square root of the number of rounds of the game, provided the transition probabilities satisfy a uniform mixing condition. Our approach is efficient as long as the comparison class is polynomial and we can compute expectations over sample paths for each policy. Designing an efficient algorithm with small regret for the general case remains an open problem.
Published: 2013

49. Oracle inequalities for computationally adaptive model selection

Author: Agarwal, Alekh, Bartlett, Peter L., and Duchi, John C.
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We analyze general model selection procedures using penalized empirical loss minimization under computational constraints. While classical model selection approaches do not consider computational aspects of performing model selection, we argue that any practical model selection procedure must not only trade off estimation and approximation error, but also the computational effort required to compute empirical minimizers for different function classes. We provide a framework for analyzing such problems, and we give algorithms for model selection under a computational budget. These algorithms satisfy oracle inequalities that show that the risk of the selected model is not much worse than if we had devoted all of our omputational budget to the optimal function class.
Published: 2012

50. REGAL: A Regularization based Algorithm for Reinforcement Learning in Weakly Communicating MDPs

Author: Bartlett, Peter L. and Tewari, Ambuj
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Machine Learning (cs.LG)
Abstract: We provide an algorithm that achieves the optimal regret rate in an unknown weakly communicating Markov Decision Process (MDP). The algorithm proceeds in episodes where, in each episode, it picks a policy using regularization based on the span of the optimal bias vector. For an MDP with S states and A actions whose optimal bias vector has span bounded by H, we show a regret bound of ~O(HSpAT). We also relate the span to various diameter-like quantities associated with the MDP, demonstrating how our results improve on previous regret bounds., Comment: Appears in Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence (UAI2009)
Published: 2012
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

53 results on '"Bartlett, Peter L."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources