17 results on '"Bartlett, Peter L."'
Search Results
2. Benign overfitting in ridge regression.
- Author
-
Tsigler, Alexander and Bartlett, Peter L.
- Subjects
- *
INDEPENDENT component analysis , *DEEP learning , *NOISE - Abstract
In many modern applications of deep learning the neural network has many more parameters than the data points used for its training. Motivated by those practices, a large body of recent theoretical research has been devoted to studying overparameterized models. One of the central phenomena in this regime is the ability of the model to interpolate noisy data, but still have test error lower than the amount of noise in that data. Bartlett et al. (2020) characterized for which covariance structure of the data such a phenomenon can happen in linear regression if one considers the interpolating solution with minimum ℓ2-norm and the data has independent components: they gave a sharp bound on the variance term and showed that it can be small if and only if the data covariance has high effective rank in a subspace of small co-dimension. We strengthen and complete their results by eliminating the independence assumption and providing sharp bounds for the bias term. Thus, our results apply in a much more general setting than those of Bartlett et al. (2020), e.g., kernel regression, and not only characterize how the noise is damped but also which part of the true signal is learned. Moreover, we extend the result to the setting of ridge regression, which allows us to explain another interesting phenomenon: we give general sufficient conditions under which the optimal regularization is negative. [ABSTRACT FROM AUTHOR]
- Published
- 2023
3. Failures of Model-dependent Generalization Bounds for Least-norm Interpolation.
- Author
-
Bartlett, Peter L. and Long, Philip M.
- Subjects
- *
STATISTICAL learning , *GENERALIZATION , *INTERPOLATION , *MACHINE learning - Abstract
We consider bounds on the generalization performance of the least-norm linear regressor, in the over-parameterized regime where it can interpolate the data. We describe a sense in which any generalization bound of a type that is commonly proved in statistical learning theory must sometimes be very loose when applied to analyze the least-norm interpolant. In particular, for a variety of natural joint distributions on training examples, any valid generalization bound that depends only on the output of the learning algorithm, the number of training examples, and the confidence parameter, and that satisfies a mild condition (substantially weaker than monotonicity in sample size), must sometimes be very loose|it can be bounded below by a constant when the true excess risk goes to zero. [ABSTRACT FROM AUTHOR]
- Published
- 2021
4. Nearly-tight VC-dimension and Pseudodimension Bounds for Piecewise Linear Neural Networks.
- Author
-
Bartlett, Peter L., Harvey, Nick, Liaw, Christopher, and Mehrabian, Abbas
- Subjects
- *
ARTIFICIAL neural networks , *STATISTICAL learning - Abstract
We prove new upper and lower bounds on the VC-dimension of deep neural networks with the ReLU activation function. These bounds are tight for almost the entire range of parameters. Letting W be the number of weights and L be the number of layers, we prove that the VC-dimension is O(WLlog(W)), and provide examples with VC-dimension (WLlog(W=L)). This improves both the previously known upper bounds and lower bounds. In terms of the number U of non-linear units, we prove a tight bound O(WU) on the VC-dimension. All of these bounds generalize to arbitrary piecewise linear activation functions, and also hold for the pseudodimensions of these function classes. Combined with previous results, this gives an intriguing range of dependencies of the VC-dimension on depth for networks with different non-linearities: there is no dependence for piecewise-constant, linear dependence for piecewise-linear, and no more than quadratic dependence for general piecewise-polynomial. [ABSTRACT FROM AUTHOR]
- Published
- 2019
5. Classification with a Reject Option using a Hinge Loss.
- Author
-
Bartlett, Peter L. and Wegkamp, Marten H.
- Subjects
- *
CLASSIFICATION , *COST , *MATHEMATICAL optimization , *SUPPORT vector machines , *CONVEX functions - Abstract
We consider the problem of binary classification where the classifier can, for a particular cost, choose not to classify an observation. Just as in the conventional classification problem, minimization of the sample average of the cost is a difficult optimization problem. As an alternative, we propose the optimization of a certain convex loss function φ, analogous to the hinge loss used in support vector machines (SVMs). Its convexity ensures that the sample average of this surrogate loss can be efficiently minimized. We study its statistical properties. We show that minimizing the expected surrogate loss--the φ-risk--also minimizes the risk. We also study the rate at which the φ-risk approaches its minimum value. We show that fast rates are possible when the conditional probability P(Y = 1|X) is unlikely to be close to certain critical values. [ABSTRACT FROM AUTHOR]
- Published
- 2008
6. AdaBoost is Consistent.
- Author
-
Bartlett, Peter L. and Traskin, Mikhail
- Subjects
- *
ALGORITHMS , *RISK , *ERRORS , *ALGEBRA , *FOUNDATIONS of arithmetic - Abstract
The risk, or probability of error, of the classifier produced by the AdaBoost algorithm is investigated. In particular, we consider the stopping strategy to be used in AdaBoost to achieve universal consistency. We show that provided AdaBoost is stopped after n1-ε iterations-for sample size n and ε ∈ (0, 1)--the sequence of risks of the classifiers it produces approaches the Bayes risk. [ABSTRACT FROM AUTHOR]
- Published
- 2007
7. On the Consistency of Multiclass Classification Methods.
- Author
-
Tewari, Ambuj and Bartlett, Peter L.
- Subjects
- *
BINARY number system , *CLASSIFICATION , *BAYESIAN analysis , *PROBABILITY theory , *MACHINE learning - Abstract
Binary classification is a well studied special case of the classification problem. Statistical properties of binary classifiers, such as consistency, have been investigated in a variety of settings. Binary classification methods can be generalized in many ways to handle multiple classes. It turns out that one can lose consistency in generalizing a binary classification method to deal with multiple classes. We study a rich family of multiclass methods and provide a necessary and sufficient condition for their consistency. We illustrate our approach by applying it to some multiclass methods proposed in the literature. [ABSTRACT FROM AUTHOR]
- Published
- 2007
8. Sparseness vs Estimating Conditional Probabilities: Some Asymptotic Results.
- Author
-
Bartlett, Peter L. and Tewari, Ambuj
- Subjects
- *
SPARSE matrices , *PROBABILITY theory , *CONDITIONAL expectations , *KERNEL functions , *MATHEMATICAL functions , *ASYMPTOTIC efficiencies - Abstract
One of the nice properties of kernel classifiers such as SVMs is that they often produce sparse solutions. However, the decision functions of these classifiers cannot always be used to estimate the conditional probability of the class label. We investigate the relationship between these two properties and show that these are intimately related: sparseness does not occur when the conditional probabilities can be unambiguously estimated. We consider a family of convex loss functions and derive sharp asymptotic results for the fraction of data that becomes support vectors. This enables us to characterize the exact trade-off between sparseness and the ability to estimate conditional probabilities for these loss functions. [ABSTRACT FROM AUTHOR]
- Published
- 2007
9. Variance Reduction Techniques for Gradient Estimates in Reinforcement Learning.
- Author
-
Greensmith, Evan, Bartlett, Peter L., and Baxter, Jonathan
- Subjects
- *
REINFORCEMENT learning , *VARIANCES , *ESTIMATES , *ALGORITHMS , *MACHINE learning - Abstract
Policy gradient methods for reinforcement learning avoid some of the undesirable properties of the value function approaches, such as policy degradation (Baxter and Bartlett, 2001). However, the variance of the performance gradient estimates obtained from the simulation is sometimes excessive. In this paper, we consider variance reduction methods that were developed for Monte Carlo estimates of integrals. We study two commonly used policy gradient techniques, the baseline and actor-critic methods, from this perspective. Both can be interpreted as additive control variate variance reduction methods. We consider the expected average reward performance measure, and we focus on the GPOMDP algorithm for estimating performance gradients in partially observable Markov decision processes controlled by stochastic reactive policies. We give bounds for the estimation error of the gradient estimates for both baseline and actor-critic algorithms, in terms of the sample size and mixing properties of the controlled system. For the baseline technique, we compute the optimal baseline, and show that the popular approach of using the average reward to define the baseline can be suboptimal. For actor-critic algorithms, we show that using the true value function as the critic can be suboptimal. We also discuss algorithms for estimating the optimal baseline and approximate value function. [ABSTRACT FROM AUTHOR]
- Published
- 2004
10. Rademacher and Gaussian Complexities: Risk Bounds and Structural Results.
- Author
-
Bartlett, Peter L. and Mendelson, Shahar
- Subjects
- *
ERROR analysis in mathematics , *COMPUTATIONAL learning theory - Abstract
We investigate the use of certain data-dependent estimates of the complexity of a function class, called Rademacher and Ganssian complexities. In a decision theoretic setting, we prove general risk bounds in terms of these complexities. We consider function classes that can be expressed as combinations of functions from basis classes and show how the Rademacher and Gaussian complexities of such a function class can be bounded in terms of the complexity of the basis classes. We give examples of the application of these techniques in finding data-dependent risk bounds for decision trees, neural networks and support vector machines. [ABSTRACT FROM AUTHOR]
- Published
- 2003
11. Random Feature Amplification: Feature Learning and Generalization in Neural Networks.
- Author
-
Frei, Spencer, Chatterji, Niladri S., and Bartlett, Peter L.
- Subjects
- *
GENERALIZATION , *GENERATING functions - Abstract
In this work, we provide a characterization of the feature-learning process in two-layer ReLU networks trained by gradient descent on the logistic loss following random initialization. We consider data with binary labels that are generated by an XOR-like function of the input features. We permit a constant fraction of the training labels to be corrupted by an adversary. We show that, although linear classifiers are no better than random guessing for the distribution we consider, two-layer ReLU networks trained by gradient descent achieve generalization error close to the label noise rate. We develop a novel proof technique that shows that at initialization, the vast majority of neurons function as random features that are only weakly correlated with useful features, and the gradient descent dynamics 'amplify' these weak, random features to strong, useful features. [ABSTRACT FROM AUTHOR]
- Published
- 2023
12. The Interplay Between Implicit Bias and Benign Overfitting in Two-Layer Linear Networks.
- Author
-
Chatterji, Niladri S., Long, Philip M., and Bartlett, Peter L.
- Subjects
- *
IMPLICIT bias , *ARTIFICIAL neural networks , *COVARIANCE matrices , *STATISTICAL models - Abstract
The recent success of neural network models has shone light on a rather surprising statistical phenomenon: statistical models that perfectly fit noisy data can generalize well to unseen test data. Understanding this phenomenon of benign overfitting has attracted intense theoretical and empirical study. In this paper, we consider interpolating two-layer linear neural networks trained with gradient ow on the squared loss and derive bounds on the excess risk when the covariates satisfy sub-Gaussianity and anti-concentration properties, and the noise is independent and sub-Gaussian. By leveraging recent results that characterize the implicit bias of this estimator, our bounds emphasize the role of both the quality of the initialization as well as the properties of the data covariance matrix in achieving low excess risk. [ABSTRACT FROM AUTHOR]
- Published
- 2022
13. When Does Gradient Descent with Logistic Loss Find Interpolating Two-layer Networks?
- Author
-
Chatterji, Niladri S., Long, Philip M., and Bartlett, Peter L.
- Subjects
- *
DEEP learning , *CONJUGATE gradient methods - Abstract
We study the training of finite-width two-layer smoothed ReLU networks for binary classi fication using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies. [ABSTRACT FROM AUTHOR]
- Published
- 2021
14. An Efficient Sampling Algorithm for Non-smooth Composite Potentials.
- Author
-
Wenlong Mou, Flammarion, Nicolas, Wainwright, Martin J., and Bartlett, Peter L.
- Subjects
- *
ISOPERIMETRIC inequalities , *MARKOV chain Monte Carlo , *SMOOTHNESS of functions - Abstract
We consider the problem of sampling from a density of the form p(x) / exp(f(x) g(x)), where f: Rd! R is a smooth function and g: Rd ! R is a convex and Lipschitz function. We propose a new algorithm based on the Metropolis-Hastings framework. Under certain isoperimetric inequalities on the target density, we prove that the algorithm mixes to within total variation (TV) distance" of the target density in at most O(d log(d=")) iterations. This guarantee extends previous results on sampling from distributions with smooth log densities (g = 0) to the more general composite non-smooth case, with the same mixing time up to a multiple of the condition number. Our method is based on a novel proximal-based proposal distribution that can be efficiently computed for a large class of non-smooth functions g. Simulation results on posterior sampling problems that arise from the Bayesian Lasso show empirical advantage over previous proposal distributions. [ABSTRACT FROM AUTHOR]
- Published
- 2022
15. High-Order Langevin Diffusion Yields an Accelerated MCMC Algorithm.
- Author
-
Wenlong Mou, Yi-An Ma, Wainwright, Martin J., Bartlett, Peter L., and Jordan, Michael I.
- Subjects
- *
MARKOV chain Monte Carlo , *ALGORITHMS - Abstract
We propose a Markov chain Monte Carlo (MCMC) algorithm based on third-order Langevin dynamics for sampling from distributions with smooth, log-concave densities. The higher-order dynamics allow for more flexible discretization schemes, and we develop a specific method that combines splitting with more accurate integration. For a broad class of d-dimensional distributions arising from generalized linear models, we prove that the resulting third-order algorithm produces samples from a distribution that is at most" > 0 in Wasserstein distance from the target distribution in steps. This result requires only Lipschitz conditions on the gradient. For general strongly convex potentials with ff-th order smoothness, we prove that the mixing time scales as O. [ABSTRACT FROM AUTHOR]
- Published
- 2021
16. Derivative-Free Methods for Policy Optimization: Guarantees for Linear Quadratic Systems.
- Author
-
Malik, Dhruv, Pananjady, Ashwin, Bhatia, Kush, Khamaru, Koulik, Bartlett, Peter L., and Wainwrighty, Martin J.
- Subjects
- *
LINEAR systems , *RANDOM noise theory , *STOCHASTIC convergence , *SURETYSHIP & guaranty , *PROCESS optimization - Abstract
We study derivative-free methods for policy optimization over the class of linear policies. We focus on characterizing the convergence rate of these methods when applied to linear- quadratic systems, and study various settings of driving noise and reward feedback. Our main theoretical result provides an explicit bound on the sample or evaluation complexity: we show that these methods are guaranteed to converge to within any pre-specified tolerance of the optimal policy with a number of zero-order evaluations that is an explicit polynomial of the error tolerance, dimension, and curvature properties of the problem. Our analysis reveals some interesting differences between the settings of additive driving noise and random initialization, as well as the settings of one-point and two-point reward feedback. Our theory is corroborated by simulations of derivative-free methods in application to these systems. Along the way, we derive convergence rates for stochastic zero-order optimization algorithms when applied to a certain class of non-convex problems. [ABSTRACT FROM AUTHOR]
- Published
- 2020
17. Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks.
- Author
-
Collins, Michael, Globerson, Amir, Koo, Terry, Carreras, Xavier, and Bartlett, Peter L.
- Subjects
- *
LINEAR statistical models , *MACHINE learning , *CONJUGATE gradient methods , *ALGORITHMS , *LOG-linear models - Abstract
Log-linear and maximum-margin models are two commonly-used methods in supervised machine learning, and are frequently used in structured prediction problems. Efficient learning of parameters in these models is therefore an important problem, and becomes a key factor when learning from very large data sets. This paper describes exponentiated gradient (EG) algorithms for training such models, where EG updates are applied to the convex dual of either the log-linear or maxmargin objective function; the dual in both the log-linear and max-margin cases corresponds to minimizing a convex function with simplex constraints. We study both batch and online variants of the algorithm, and provide rates of convergence for both cases. In the max-margin case, O(1/ε) EG updates are required to reach a given accuracy ε in the dual; in contrast, for log-linear models only O(log(1/ε)) updates are required. For both the max-margin and log-linear cases, our bounds suggest that the online EG algorithm requires a factor of n less computation to reach a desired accuracy than the batch EG algorithm, where n is the number of training examples. Our experiments confirm that the online algorithms are much faster than the batch algorithms in practice. We describe how the EG updates factor in a convenient way for structured prediction problems, allowing the algorithms to be efficiently applied to problems such as sequence learning or natural language parsing. We perform extensive evaluation of the algorithms, comparing them to L-BFGS and stochastic gradient descent for log-linear models, and to SVM-Struct for max-margin models. The algorithms are applied to a multi-class problem as well as to a more complex large-scale parsing task. In all these settings, the EG algorithms presented here outperform the other methods. [ABSTRACT FROM AUTHOR]
- Published
- 2008
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.