Author: "Telgarsky, Matus" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Telgarsky, Matus"' showing total 142 results

Start Over Author "Telgarsky, Matus"

142 results on '"Telgarsky, Matus"'

1. Benefits of Early Stopping in Gradient Descent for Overparameterized Logistic Regression

Author: Wu, Jingfeng, Bartlett, Peter, Telgarsky, Matus, and Yu, Bin
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: In overparameterized logistic regression, gradient descent (GD) iterates diverge in norm while converging in direction to the maximum $\ell_2$-margin solution -- a phenomenon known as the implicit bias of GD. This work investigates additional regularization effects induced by early stopping in well-specified high-dimensional logistic regression. We first demonstrate that the excess logistic risk vanishes for early-stopped GD but diverges to infinity for GD iterates at convergence. This suggests that early-stopped GD is well-calibrated, whereas asymptotic GD is statistically inconsistent. Second, we show that to attain a small excess zero-one risk, polynomially many samples are sufficient for early-stopped GD, while exponentially many samples are necessary for any interpolating estimator, including asymptotic GD. This separation underscores the statistical benefits of early stopping in the overparameterized regime. Finally, we establish nonasymptotic bounds on the norm and angular differences between early-stopped GD and $\ell_2$-regularized empirical risk minimizer, thereby connecting the implicit regularization of GD with explicit $\ell_2$-regularization.
Published: 2025

2. One-layer transformers fail to solve the induction heads task

Author: Sanford, Clayton, Hsu, Daniel, and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: A simple communication complexity argument proves that no one-layer transformer can solve the induction heads task unless its size is exponentially larger than the size sufficient for a two-layer transformer.
Published: 2024

3. Spectrum Extraction and Clipping for Implicitly Linear Layers

Author: Boroojeny, Ali Ebrahimpour, Telgarsky, Matus, and Sundaram, Hari
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition
Abstract: We show the effectiveness of automatic differentiation in efficiently and correctly computing and controlling the spectrum of implicitly linear operators, a rich family of layer types including all standard convolutional and dense layers. We provide the first clipping method which is correct for general convolution layers, and illuminate the representational limitation that caused correctness issues in prior work. We study the effect of the batch normalization layers when concatenated with convolutional layers and show how our clipping method can be applied to their composition. By comparing the accuracy and performance of our algorithms to the state-of-the-art methods, using various experiments, we show they are more precise and efficient and lead to better generalization and adversarial robustness. We provide the code for using our methods at https://github.com/Ali-E/FastClip.
Published: 2024

4. Large Stepsize Gradient Descent for Logistic Loss: Non-Monotonicity of the Loss Improves Optimization Efficiency

Author: Wu, Jingfeng, Bartlett, Peter L., Telgarsky, Matus, and Yu, Bin
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We consider gradient descent (GD) with a constant stepsize applied to logistic regression with linearly separable data, where the constant stepsize $\eta$ is so large that the loss initially oscillates. We show that GD exits this initial oscillatory phase rapidly -- in $\mathcal{O}(\eta)$ steps -- and subsequently achieves an $\tilde{\mathcal{O}}(1 / (\eta t) )$ convergence rate after $t$ additional steps. Our results imply that, given a budget of $T$ steps, GD can achieve an accelerated loss of $\tilde{\mathcal{O}}(1/T^2)$ with an aggressive stepsize $\eta:= \Theta( T)$, without any use of momentum or variable stepsize schedulers. Our proof technique is versatile and also handles general classification loss functions (where exponential tails are needed for the $\tilde{\mathcal{O}}(1/T^2)$ acceleration), nonlinear predictors in the neural tangent kernel regime, and online stochastic gradient descent (SGD) with a large stepsize, under suitable separability conditions., Comment: COLT 2024 camera ready
Published: 2024

5. Transformers, parallel computation, and logarithmic depth

Author: Sanford, Clayton, Hsu, Daniel, and Telgarsky, Matus
Subjects: Computer Science - Machine Learning
Abstract: We show that a constant number of self-attention layers can efficiently simulate, and be simulated by, a constant number of communication rounds of Massively Parallel Computation. As a consequence, we show that logarithmic depth is sufficient for transformers to solve basic computational tasks that cannot be efficiently solved by several other neural sequence models and sub-quadratic transformer approximations. We thus establish parallelism as a key distinguishing property of transformers., Comment: 58 pages, 19 figures, code available at https://github.com/chsanford/hop-induction-heads
Published: 2024

6. On Achieving Optimal Adversarial Test Error

Author: Li, Justin D. and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees., Comment: ICLR 2023; bugs fixed
Published: 2023

7. Representational Strengths and Limitations of Transformers

Author: Sanford, Clayton, Hsu, Daniel, and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Attention layers, as commonly used in transformers, form the backbone of modern deep learning, yet there is no mathematical description of their benefits and deficiencies as compared with other architectures. In this work we establish both positive and negative results on the representation power of attention layers, with a focus on intrinsic complexity parameters such as width, depth, and embedding dimension. On the positive side, we present a sparse averaging task, where recurrent networks and feedforward networks all have complexity scaling polynomially in the input size, whereas transformers scale merely logarithmically in the input size; furthermore, we use the same construction to show the necessity and role of a large embedding dimension in a transformer. On the negative side, we present a triple detection task, where attention layers in turn have complexity scaling linearly in the input size; as this scenario seems rare in practice, we also present natural variants that can be efficiently solved by attention layers. The proof techniques emphasize the value of communication complexity in the analysis of transformers and related models, and the role of sparse averaging as a prototypical attention task, which even finds use in the analysis of triple detection.
Published: 2023

8. Feature selection with gradient descent on two-layer networks in low-rotation regimes

Author: Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: This work establishes low test error of gradient flow (GF) and stochastic gradient descent (SGD) on two-layer ReLU networks with standard initialization, in three regimes where key sets of weights rotate little (either naturally due to GF and SGD, or due to an artificial constraint), and making use of margins as the core analytic technique. The first regime is near initialization, specifically until the weights have moved by $\mathcal{O}(\sqrt m)$, where $m$ denotes the network width, which is in sharp contrast to the $\mathcal{O}(1)$ weight motion allowed by the Neural Tangent Kernel (NTK); here it is shown that GF and SGD only need a network width and number of samples inversely proportional to the NTK margin, and moreover that GF attains at least the NTK margin itself, which suffices to establish escape from bad KKT points of the margin objective, whereas prior work could only establish nondecreasing but arbitrarily small margins. The second regime is the Neural Collapse (NC) setting, where data lies in extremely-well-separated groups, and the sample complexity scales with the number of groups; here the contribution over prior work is an analysis of the entire GF trajectory from initialization. Lastly, if the inner layer weights are constrained to change in norm only and can not rotate, then GF with large widths achieves globally maximal margins, and its sample complexity scales with their inverse; this is in contrast to prior work, which required infinite width and a tricky dual convergence assumption. As purely technical contributions, this work develops a variety of potential functions and other tools which will hopefully aid future work.
Published: 2022

9. Convex Analysis at Infinity: An Introduction to Astral Space

Author: Dudík, Miroslav, Schapire, Robert E., and Telgarsky, Matus
Subjects: Mathematics - Optimization and Control, Computer Science - Machine Learning
Abstract: Not all convex functions on $\mathbb{R}^n$ have finite minimizers; some can only be minimized by a sequence as it heads to infinity. In this work, we aim to develop a theory for understanding such minimizers at infinity. We study astral space, a compact extension of $\mathbb{R}^n$ to which such points at infinity have been added. Astral space is constructed to be as small as possible while still ensuring that all linear functions can be continuously extended to the new space. Although astral space includes all of $\mathbb{R}^n$, it is not a vector space, nor even a metric space. However, it is sufficiently well-structured to allow useful and meaningful extensions of concepts of convexity, conjugacy, and subdifferentials. We develop these concepts and analyze various properties of convex functions on astral space, including the detailed structure of their minimizers, exact characterizations of continuity, and convergence of descent algorithms.
Published: 2022

10. Stochastic linear optimization never overfits with quadratically-bounded losses on general data

Author: Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: This work provides test error bounds for iterative fixed point methods on linear predictors -- specifically, stochastic and batch mirror descent (MD), and stochastic temporal difference learning (TD) -- with two core contributions: (a) a single proof technique which gives high probability guarantees despite the absence of projections, regularization, or any equivalents, even when optima have large or infinite norm, for quadratically-bounded losses (e.g., providing unified treatment of squared and logistic losses); (b) locally-adapted rates which depend not on global problem structure (such as condition numbers and maximum margins), but rather on properties of low norm predictors which may suffer some small excess test error. The proof technique is an elementary and versatile coupling argument, and is demonstrated here in the following settings: stochastic MD under realizability; stochastic MD for general Markov data; batch MD for general IID data; stochastic MD on heavy-tailed data (still without projections); stochastic TD on Markov chains (all prior stochastic TD bounds are in expectation)., Comment: Improves upon the COLT 2022 camera ready; please use the latest arXiv version!
Published: 2022

11. Actor-critic is implicitly biased towards high entropy optimal policies

Author: Hu, Yuzheng, Ji, Ziwei, and Telgarsky, Matus
Subjects: Computer Science - Machine Learning
Abstract: We show that the simplest actor-critic method -- a linear softmax policy updated with TD through interaction with a linear MDP, but featuring no explicit regularization or exploration -- does not merely find an optimal policy, but moreover prefers high entropy optimal policies. To demonstrate the strength of this bias, the algorithm not only has no regularization, no projections, and no exploration like $\epsilon$-greedy, but is moreover trained on a single trajectory with no resets. The key consequence of the high entropy bias is that uniform mixing assumptions on the MDP, which exist in some form in all prior work, can be dropped: the implicit regularization of the high entropy bias is enough to ensure that all chains mix and an optimal policy is reached with high probability. As auxiliary contributions, this work decouples concerns between the actor and critic by writing the actor update as an explicit mirror descent, provides tools to uniformly bound mixing times within KL balls of policy space, and provides a projection-free TD analysis with its own implicit bias which can be run from an unmixed starting distribution., Comment: v2 primarily improved the proofs, with minimal changes to the body
Published: 2021

12. Fast Margin Maximization via Dual Acceleration

Author: Ji, Ziwei, Srebro, Nathan, and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables., Comment: ICML 2021
Published: 2021

13. Early-stopped neural networks are consistent

Author: Ji, Ziwei, Li, Justin D., and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This work studies the behavior of shallow ReLU networks trained with the logistic loss via gradient descent on binary classification data where the underlying data distribution is general, and the (optimal) Bayes risk is not necessarily zero. In this setting, it is shown that gradient descent with early stopping achieves population risk arbitrarily close to optimal in terms of not just logistic and misclassification losses, but also in terms of calibration, meaning the sigmoid mapping of its outputs approximates the true underlying conditional distribution arbitrarily finely. Moreover, the necessary iteration, sample, and architectural complexities of this analysis all scale naturally with a certain complexity measure of the true conditional model. Lastly, while it is not shown that early stopping is necessary, it is shown that any univariate classifier satisfying a local interpolation property is inconsistent.
Published: 2021

14. Generalization bounds via distillation

Author: Hsu, Daniel, Ji, Ziwei, Telgarsky, Matus, and Wang, Lan
Subjects: Computer Science - Machine Learning
Abstract: This paper theoretically investigates the following empirical phenomenon: given a high-complexity network with poor generalization bounds, one can distill it into a network with nearly identical predictions but low complexity and vastly smaller generalization bounds. The main contribution is an analysis showing that the original network inherits this good generalization bound from its distillation, assuming the use of well-behaved data augmentation. This bound is presented both in an abstract and in a concrete form, the latter complemented by a reduction technique to handle modern computation graphs featuring convolutional layers, fully-connected layers, and skip connections, to name a few. To round out the story, a (looser) classical uniform convergence analysis of compression is also presented, as well as a variety of experiments on cifar and mnist demonstrating similar generalization performance between the original network and its distillation., Comment: To appear, ICLR 2021
Published: 2021

15. Gradient descent follows the regularization path for general losses

Author: Ji, Ziwei, Dudík, Miroslav, Schapire, Robert E., and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Recent work across many machine learning disciplines has highlighted that standard descent methods, even without explicit regularization, do not merely minimize the training error, but also exhibit an implicit bias. This bias is typically towards a certain regularized solution, and relies upon the details of the learning process, for instance the use of the cross-entropy loss. In this work, we show that for empirical risk minimization over linear predictors with arbitrary convex, strictly decreasing losses, if the risk does not attain its infimum, then the gradient-descent path and the algorithm-independent regularization path converge to the same direction (whenever either converges to a direction). Using this result, we provide a justification for the widely-used exponentially-tailed losses (such as the exponential loss or the logistic loss): while this convergence to a direction for exponentially-tailed losses is necessarily to the maximum-margin direction, other losses such as polynomially-tailed losses may induce convergence to a direction with a poor margin., Comment: To appear, COLT 2020
Published: 2020

16. Directional convergence and alignment in deep learning

Author: Ji, Ziwei and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: In this paper, we show that although the minimizers of cross-entropy and related classification losses are off at infinity, network weights learned by gradient flow converge in direction, with an immediate corollary that network predictions, training errors, and the margin distribution also converge. This proof holds for deep homogeneous networks -- a broad class of networks allowing for ReLU, max-pooling, linear, and convolutional layers -- and we additionally provide empirical support not just close to the theory (e.g., the AlexNet), but also on non-homogeneous networks (e.g., the DenseNet). If the network further has locally Lipschitz gradients, we show that these gradients also converge in direction, and asymptotically align with the gradient flow path, with consequences on margin maximization, convergence of saliency maps, and a few other settings. Our analysis complements and is distinct from the well-known neural tangent and mean-field theories, and in particular makes no requirements on network width and initialization, instead merely requiring perfect classification accuracy. The proof proceeds by developing a theory of unbounded nonsmooth Kurdyka-{\L}ojasiewicz inequalities for functions definable in an o-minimal structure, and is also applicable outside deep learning., Comment: To appear, NeuRIPS 2020
Published: 2020

17. Neural tangent kernels, transportation mappings, and universal approximation

Author: Ji, Ziwei, Telgarsky, Matus, and Xian, Ruicheng
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails that activations are mostly unchanged, and the network is nearly equivalent to its linearization. Concretely, the paper has two main contributions: a generic scheme to approximate functions with the NTK by sampling from transport mappings between the initial weights and their desired values, and the construction of transport mappings via Fourier transforms. Regarding the first contribution, the proof scheme provides another perspective on how the NTK regime arises from rescaling: redundancy in the weights due to resampling allows individual weights to be scaled down. Regarding the second contribution, the most notable transport mapping asserts that roughly $1 / \delta^{10d}$ nodes are sufficient to approximate continuous functions, where $\delta$ depends on the continuity properties of the target function. By contrast, nearly the same proof yields a bound of $1 / \delta^{2d}$ for shallow ReLU networks; this gap suggests a tantalizing direction for future work, separating shallow ReLU networks and their linearization.
Published: 2019

18. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow ReLU networks

Author: Ji, Ziwei and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Recent theoretical work has guaranteed that overparameterized networks trained by gradient descent achieve arbitrarily low training error, and sometimes even low test error. The required width, however, is always polynomial in at least one of the sample size $n$, the (inverse) target error $1/\epsilon$, and the (inverse) failure probability $1/\delta$. This work shows that $\widetilde{\Theta}(1/\epsilon)$ iterations of gradient descent with $\widetilde{\Omega}(1/\epsilon^2)$ training examples on two-layer ReLU networks of any width exceeding $\mathrm{polylog}(n,1/\epsilon,1/\delta)$ suffice to achieve a test misclassification error of $\epsilon$. We also prove that stochastic gradient descent can achieve $\epsilon$ test error with polylogarithmic width and $\widetilde{\Theta}(1/\epsilon)$ samples. The analysis relies upon the separation margin of the limiting kernel, which is guaranteed positive, can distinguish between true labels and random labels, and can give a tight sample-complexity analysis in the infinite-width setting
Published: 2019

19. Approximation power of random neural networks

Author: Bailey, Bolton, Ji, Ziwei, Telgarsky, Matus, and Xian, Ruicheng
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper investigates the approximation power of three types of random neural networks: (a) infinite width networks, with weights following an arbitrary distribution; (b) finite width networks obtained by subsampling the preceding infinite width networks; (c) finite width networks obtained by starting with standard Gaussian initialization, and then adding a vanishingly small correction to the weights. The primary result is a fully quantified bound on the rate of approximation of general general continuous functions: in all three cases, a function $f$ can be approximated with complexity $\|f\|_1 (d/\delta)^{\mathcal{O}(d)}$, where $\delta$ depends on continuity properties of $f$ and the complexity measure depends on the weight magnitudes and/or cardinalities. Along the way, a variety of ancillary results are developed: an exact construction of Gaussian densities with infinite width networks, an elementary stand-alone proof scheme for approximation via convolutions of radial basis functions, subsampling rates for infinite width networks, and depth separation for corrected networks., Comment: This submission constitutes a poor approach to the problem, and has no scientific purpose. A superior (different) approach (and stronger final result, also treating the NTK) has appeared in arXiv:1910.06956 ; please see that work instead
Published: 2019

20. Characterizing the implicit bias via a primal-dual analysis

Author: Ji, Ziwei and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: This paper shows that the implicit bias of gradient descent on linearly separable data is exactly characterized by the optimal solution of a dual optimization problem given by a smoothed margin, even for general losses. This is in contrast to prior results, which are often tailored to exponentially-tailed losses. For the exponential loss specifically, with $n$ training examples and $t$ gradient descent steps, our dual analysis further allows us to prove an $O(\ln(n)/\ln(t))$ convergence rate to the $\ell_2$ maximum margin direction, when a constant step size is used. This rate is tight in both $n$ and $t$, which has not been presented by prior work. On the other hand, with a properly chosen but aggressive step size schedule, we prove $O(1/t)$ rates for both $\ell_2$ margin maximization and implicit bias, whereas prior work (including all first-order methods for the general hard-margin linear SVM problem) proved $\widetilde{O}(1/\sqrt{t})$ margin rates, or $O(1/t)$ margin rates to a suboptimal margin, with an implied (slower) bias rate. Our key observations include that gradient descent on the primal variable naturally induces a mirror descent update on the dual variable, and that the dual objective in this setting is smooth enough to give a faster rate.
Published: 2019

21. A gradual, semi-discrete approach to generative network training via explicit Wasserstein minimization

Author: Chen, Yucheng, Telgarsky, Matus, Zhang, Chao, Bailey, Bolton, Hsu, Daniel, and Peng, Jian
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper provides a simple procedure to fit generative networks to target distributions, with the goal of a small Wasserstein distance (or other optimal transport costs). The approach is based on two principles: (a) if the source randomness of the network is a continuous distribution (the "semi-discrete" setting), then the Wasserstein distance is realized by a deterministic optimal transport mapping; (b) given an optimal transport mapping between a generator network and a target distribution, the Wasserstein distance may be decreased via a regression between the generated data and the mapped target points. The procedure here therefore alternates these two steps, forming an optimal transport and regressing against it, gradually adjusting the generator network towards the target distribution. Mathematically, this approach is shown to minimize the Wasserstein distance to both the empirical target distribution, and also its underlying population counterpart. Empirically, good performance is demonstrated on the training and testing sets of the MNIST and Thin-8 data. The paper closes with a discussion of the unsuitability of the Wasserstein distance for certain tasks, as has been identified in prior work [Arora et al., 2017, Huang et al., 2017]., Comment: Appears in ICML 2019
Published: 2019

22. Size-Noise Tradeoffs in Generative Networks

Author: Bailey, Bolton and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper investigates the ability of generative networks to convert their input noise distributions into other distributions. Firstly, we demonstrate a construction that allows ReLU networks to increase the dimensionality of their noise distribution by implementing a "space-filling" function based on iterated tent maps. We show this construction is optimal by analyzing the number of affine pieces in functions computed by multivariate ReLU networks. Secondly, we provide efficient ways (using polylog $(1/\epsilon)$ nodes) for networks to pass between univariate uniform and normal distributions, using a Taylor series approximation and a binary search gadget for computing function inverses. Lastly, we indicate how high dimensional distributions can be efficiently transformed into low dimensional distributions.
Published: 2018

23. Gradient descent aligns the layers of deep linear networks

Author: Ji, Ziwei and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: This paper establishes risk convergence and asymptotic weight matrix alignment --- a form of implicit regularization --- of gradient flow and gradient descent when applied to deep linear networks on linearly separable data. In more detail, for gradient flow applied to strictly decreasing loss functions (with similar results for gradient descent with particular decreasing step sizes): (i) the risk converges to 0; (ii) the normalized i-th weight matrix asymptotically equals its rank-1 approximation $u_iv_i^{\top}$; (iii) these rank-1 matrices are aligned across layers, meaning $|v_{i+1}^{\top}u_i|\to1$. In the case of the logistic loss (binary cross entropy), more can be said: the linear function induced by the network --- the product of its weight matrices --- converges to the same direction as the maximum margin solution. This last property was identified in prior work, but only under assumptions on gradient descent which here are implied by the alignment phenomenon.
Published: 2018

24. Risk and parameter convergence of logistic regression

Author: Ji, Ziwei and Telgarsky, Matus
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Gradient descent, when applied to the task of logistic regression, outputs iterates which are biased to follow a unique ray defined by the data. The direction of this ray is the maximum margin predictor of a maximal linearly separable subset of the data; the gradient descent iterates converge to this ray in direction at the rate $\mathcal{O}(\ln\ln t / \ln t)$. The ray does not pass through the origin in general, and its offset is the bounded global optimum of the risk over the remaining data; gradient descent recovers this offset at a rate $\mathcal{O}((\ln t)^2 / \sqrt{t})$., Comment: Appears in COLT 2019 with the title "The implicit bias of gradient descent on nonseparable data" (and no other changes)
Published: 2018

25. Social welfare and profit maximization from revealed preferences

Author: Ji, Ziwei, Mehta, Ruta, and Telgarsky, Matus
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Machine Learning
Abstract: Consider the seller's problem of finding optimal prices for her $n$ (divisible) goods when faced with a set of $m$ consumers, given that she can only observe their purchased bundles at posted prices, i.e., revealed preferences. We study both social welfare and profit maximization with revealed preferences. Although social welfare maximization is a seemingly non-convex optimization problem in prices, we show that (i) it can be reduced to a dual convex optimization problem in prices, and (ii) the revealed preferences can be interpreted as supergradients of the concave conjugate of valuation, with which subgradients of the dual function can be computed. We thereby obtain a simple subgradient-based algorithm for strongly concave valuations and convex cost, with query complexity $O(m^2/\epsilon^2)$, where $\epsilon$ is the additive difference between the social welfare induced by our algorithm and the optimum social welfare. We also study social welfare maximization under the online setting, specifically the random permutation model, where consumers arrive one-by-one in a random order. For the case where consumer valuations can be arbitrary continuous functions, we propose a price posting mechanism that achieves an expected social welfare up to an additive factor of $O(\sqrt{mn})$ from the maximum social welfare. Finally, for profit maximization (which may be non-convex in simple cases), we give nearly matching upper and lower bounds on the query complexity for separable valuations and cost (i.e., each good can be treated independently).
Published: 2017

26. Spectrally-normalized margin bounds for neural networks

Author: Bartlett, Peter, Foster, Dylan J., and Telgarsky, Matus
Subjects: Computer Science - Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity., Comment: Comparison to arXiv v1: 1-norm in main bound refined to (2,1)-group-norm. Comparison to NIPS camera ready: typo fixes
Published: 2017

27. Neural networks and rational functions

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: Neural networks and rational functions efficiently approximate each other. In more detail, it is shown here that for any ReLU network, there exists a rational function of degree $O(\text{polylog}(1/\epsilon))$ which is $\epsilon$-close, and similarly for any rational function there exists a ReLU network of size $O(\text{polylog}(1/\epsilon))$ which is $\epsilon$-close. By contrast, polynomials need degree $\Omega(\text{poly}(1/\epsilon))$ to approximate even a single ReLU. When converting a ReLU network to a rational function as above, the hidden constants depend exponentially on the number of layers, which is shown to be tight; in other words, a compositional representation can be beneficial even for rational functions., Comment: To appear, ICML 2017
Published: 2017

28. Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis

Author: Raginsky, Maxim, Rakhlin, Alexander, and Telgarsky, Matus
Subjects: Computer Science - Learning, Mathematics - Optimization and Control, Mathematics - Probability, Statistics - Machine Learning
Abstract: Stochastic Gradient Langevin Dynamics (SGLD) is a popular variant of Stochastic Gradient Descent, where properly scaled isotropic Gaussian noise is added to an unbiased estimate of the gradient at each iteration. This modest change allows SGLD to escape local minima and suffices to guarantee asymptotic convergence to global minimizers for sufficiently regular non-convex objectives (Gelfand and Mitter, 1991). The present work provides a nonasymptotic analysis in the context of non-convex learning problems, giving finite-time guarantees for SGLD to find approximate minimizers of both empirical and population risks. As in the asymptotic setting, our analysis relates the discrete-time SGLD Markov chain to a continuous-time diffusion process. A new tool that drives the results is the use of weighted transportation cost inequalities to quantify the rate of convergence of SGLD to a stationary distribution in the Euclidean $2$-Wasserstein distance., Comment: 29 pages
Published: 2017

29. Greedy bi-criteria approximations for $k$-medians and $k$-means

Author: Hsu, Daniel and Telgarsky, Matus
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Learning
Abstract: This paper investigates the following natural greedy procedure for clustering in the bi-criterion setting: iteratively grow a set of centers, in each round adding the center from a candidate set that maximally decreases clustering cost. In the case of $k$-medians and $k$-means, the key results are as follows. $\bullet$ When the method considers all data points as candidate centers, then selecting $\mathcal{O}(k\log(1/\varepsilon))$ centers achieves cost at most $2+\varepsilon$ times the optimal cost with $k$ centers. $\bullet$ Alternatively, the same guarantees hold if each round samples $\mathcal{O}(k/\varepsilon^5)$ candidate centers proportionally to their cluster cost (as with $\texttt{kmeans++}$, but holding centers fixed). $\bullet$ In the case of $k$-means, considering an augmented set of $n^{\lceil1/\varepsilon\rceil}$ candidate centers gives $1+\varepsilon$ approximation with $\mathcal{O}(k\log(1/\varepsilon))$ centers, the entire algorithm taking $\mathcal{O}(dk\log(1/\varepsilon)n^{1+\lceil1/\varepsilon\rceil})$ time, where $n$ is the number of data points in $\mathbb{R}^d$. $\bullet$ In the case of Euclidean $k$-medians, generating a candidate set via $n^{\mathcal{O}(1/\varepsilon^2)}$ executions of stochastic gradient descent with adaptively determined constraint sets will once again give approximation $1+\varepsilon$ with $\mathcal{O}(k\log(1/\varepsilon))$ centers in $dk\log(1/\varepsilon)n^{\mathcal{O}(1/\varepsilon^2)}$ time. Ancillary results include: guarantees for cluster costs based on powers of metrics; a brief, favorable empirical evaluation against $\texttt{kmeans++}$; data-dependent bounds allowing $1+\varepsilon$ in the first two bullets above, for example with $k$-medians over finite metric spaces.
Published: 2016

30. Benefits of depth in neural networks

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: For any positive integer $k$, there exist neural networks with $\Theta(k^3)$ layers, $\Theta(1)$ nodes per layer, and $\Theta(1)$ distinct parameters which can not be approximated by networks with $\mathcal{O}(k)$ layers unless they are exponentially large --- they must possess $\Omega(2^k)$ nodes. This result is proved here for a class of nodes termed "semi-algebraic gates" which includes the common choices of ReLU, maximum, indicator, and piecewise polynomial functions, therefore establishing benefits of depth against not just standard networks with ReLU gates, but also convolutional networks with ReLU and maximization gates, sum-product networks, and boosted decision trees (in this last case with a stronger separation: $\Omega(2^{k^3})$ total tree nodes are required)., Comment: To appear, COLT 2016. For a simplified version, see http://arxiv.org/abs/1509.08101
Published: 2016

31. Rate of Price Discovery in Iterative Combinatorial Auctions

Author: Abernethy, Jacob, Lahaie, Sébastien, and Telgarsky, Matus
Subjects: Computer Science - Computer Science and Game Theory
Abstract: We study a class of iterative combinatorial auctions which can be viewed as subgradient descent methods for the problem of pricing bundles to balance supply and demand. We provide concrete convergence rates for auctions in this class, bounding the number of auction rounds needed to reach clearing prices. Our analysis allows for a variety of pricing schemes, including item, bundle, and polynomial pricing, and the respective convergence rates confirm that more expressive pricing schemes come at the cost of slower convergence. We consider two models of bidder behavior. In the first model, bidders behave stochastically according to a random utility model, which includes standard best-response bidding as a special case. In the second model, bidders behave arbitrarily (even adversarially), and meaningful convergence relies on properly designed activity rules.
Published: 2015

32. Representation Benefits of Deep Feedforward Networks

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Computer Science - Neural and Evolutionary Computing
Abstract: This note provides a family of classification problems, indexed by a positive integer $k$, where all shallow networks with fewer than exponentially (in $k$) many nodes exhibit error at least $1/6$, whereas a deep network with 2 nodes in each of $2k$ layers achieves zero error, as does a recurrent network with 3 distinct nodes iterated $k$ times. The proof is elementary, and the networks are standard feedforward networks with ReLU (Rectified Linear Unit) nonlinearities.
Published: 2015

33. Convex Risk Minimization and Conditional Probability Estimation

Author: Telgarsky, Matus, Dudík, Miroslav, and Schapire, Robert
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: This paper proves, in very general settings, that convex risk minimization is a procedure to select a unique conditional probability model determined by the classification problem. Unlike most previous work, we give results that are general enough to include cases in which no minimum exists, as occurs typically, for instance, with standard boosting algorithms. Concretely, we first show that any sequence of predictors minimizing convex risk over the source distribution will converge to this unique model when the class of predictors is linear (but potentially of infinite dimension). Secondly, we show the same result holds for \emph{empirical} risk minimization whenever this class of predictors is finite dimensional, where the essential technical contribution is a norm-free generalization bound., Comment: To appear, COLT 2015
Published: 2015

34. Scalable Nonlinear Learning with Adaptive Polynomial Expansions

Author: Agarwal, Alekh, Beygelzimer, Alina, Hsu, Daniel, Langford, John, and Telgarsky, Matus
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: Can we effectively learn a nonlinear representation in time comparable to linear learning? We describe a new algorithm that explicitly and adaptively expands higher-order interaction features over base linear representations. The algorithm is designed for extreme computational efficiency, and an extensive experimental study shows that its computation/prediction tradeoff ability compares very favorably against strong baselines., Comment: To appear in NIPS 2014
Published: 2014

35. Moment-based Uniform Deviation Bounds for $k$-means and Friends

Author: Telgarsky, Matus and Dasgupta, Sanjoy
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: Suppose $k$ centers are fit to $m$ points by heuristically minimizing the $k$-means cost; what is the corresponding fit over the source distribution? This question is resolved here for distributions with $p\geq 4$ bounded moments; in particular, the difference between the sample cost and distribution cost decays with $m$ and $p$ as $m^{\min\{-1/4, -1/2+2/p\}}$. The essential technical contribution is a mechanism to uniformly control deviations in the face of unbounded parameter sets, cost functions, and source distributions. To further demonstrate this mechanism, a soft clustering variant of $k$-means cost is also considered, namely the log likelihood of a Gaussian mixture, subject to the constraint that all covariance matrices have bounded spectrum. Lastly, a rate with refined constants is provided for $k$-means instances possessing some cluster structure., Comment: To appear, NIPS 2013
Published: 2013

36. Boosting with the Logistic Loss is Consistent

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: This manuscript provides optimization guarantees, generalization bounds, and statistical consistency results for AdaBoost variants which replace the exponential loss with the logistic and similar losses (specifically, twice differentiable convex losses which are Lipschitz and tend to zero on one side). The heart of the analysis is to show that, in lieu of explicit regularization and constraints, the structure of the problem is fairly rigidly controlled by the source distribution itself. The first control of this type is in the separable case, where a distribution-dependent relaxed weak learning rate induces speedy convergence with high probability over any sample. Otherwise, in the nonseparable case, the convex surrogate risk itself exhibits distribution-dependent levels of curvature, and consequently the algorithm's output has small norm with high probability., Comment: To appear, COLT 2013
Published: 2013

37. Margins, Shrinkage, and Boosting

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: This manuscript shows that AdaBoost and its immediate variants can produce approximate maximum margin classifiers simply by scaling step size choices with a fixed small constant. In this way, when the unscaled step size is an optimal choice, these results provide guarantees for Friedman's empirically successful "shrinkage" procedure for gradient boosting (Friedman, 2000). Guarantees are also provided for a variety of other step sizes, affirming the intuition that increasingly regularized line searches provide improved margin guarantees. The results hold for the exponential loss and similar losses, most notably the logistic loss., Comment: To appear, ICML 2013
Published: 2013

38. Dirichlet draws are sparse with high probability

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Mathematics - Probability, Statistics - Machine Learning
Abstract: This note provides an elementary proof of the folklore fact that draws from a Dirichlet distribution (with parameters less than 1) are typically sparse (most coordinates are small)., Comment: 4 pages
Published: 2013

39. Tensor decompositions for learning latent variable models

Author: Anandkumar, Anima, Ge, Rong, Hsu, Daniel, Kakade, Sham M., and Telgarsky, Matus
Subjects: Computer Science - Learning, Mathematics - Numerical Analysis, Statistics - Machine Learning
Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models---including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation---which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
Published: 2012

40. Agglomerative Bregman Clustering

Author: Telgarsky, Matus and Dasgupta, Sanjoy
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: This manuscript develops the theory of agglomerative clustering with Bregman divergences. Geometric smoothing techniques are developed to deal with degenerate clusters. To allow for cluster models based on exponential families with overcomplete representations, Bregman divergences are developed for nondifferentiable convex functions., Comment: Appears in Proceedings of the 29th International Conference on Machine Learning (ICML 2012)
Published: 2012

41. Statistical Consistency of Finite-dimensional Unregularized Linear Classification

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Statistics - Machine Learning
Abstract: This manuscript studies statistical properties of linear classifiers obtained through minimization of an unregularized convex risk over a finite sample. Although the results are explicitly finite-dimensional, inputs may be passed through feature maps; in this way, in addition to treating the consistency of logistic regression, this analysis also handles boosting over a finite weak learning class with, for instance, the exponential, logistic, and hinge losses. In this finite-dimensional setting, it is still possible to fit arbitrary decision boundaries: scaling the complexity of the weak learning class with the sample size leads to the optimal classification risk almost surely.
Published: 2012

42. Blackwell Approachability and Minimax Theory

Author: Telgarsky, Matus
Subjects: Computer Science - Computer Science and Game Theory, Computer Science - Learning, Primary: 91A05, 49K35, Secondary: 91A20, 91A15
Abstract: This manuscript investigates the relationship between Blackwell Approachability, a stochastic vector-valued repeated game, and minimax theory, a single-play scalar-valued scenario. First, it is established in a general setting --- one not permitting invocation of minimax theory --- that Blackwell's Approachability Theorem and its generalization due to Hou are still valid. Second, minimax structure grants a result in the spirit of Blackwell's weak-approachability conjecture, later resolved by Vieille, that any set is either approachable by one player, or avoidable by the opponent. This analysis also reveals a strategy for the opponent., Comment: 18 pages
Published: 2011

43. A Primal-Dual Convergence Analysis of Boosting

Author: Telgarsky, Matus
Subjects: Computer Science - Learning, Mathematics - Optimization and Control
Abstract: Boosting combines weak learners into a predictor with low empirical risk. Its dual constructs a high entropy distribution upon which weak learners and training labels are uncorrelated. This manuscript studies this primal-dual relationship under a broad family of losses, including the exponential loss of AdaBoost and the logistic loss, revealing: - Weak learnability aids the whole loss family: for any {\epsilon}>0, O(ln(1/{\epsilon})) iterations suffice to produce a predictor with empirical risk {\epsilon}-close to the infimum; - The circumstances granting the existence of an empirical risk minimizer may be characterized in terms of the primal and dual problems, yielding a new proof of the known rate O(ln(1/{\epsilon})); - Arbitrary instances may be decomposed into the above two, granting rate O(1/{\epsilon}), with a matching lower bound provided for the logistic loss., Comment: 40 pages, 8 figures; the NIPS 2011 submission "The Fast Convergence of Boosting" is a brief presentation of the primary results; compared with the JMLR version, this arXiv version has hyperref and some formatting tweaks
Published: 2011

44. Central Binomial Tail Bounds

Author: Telgarsky, Matus
Subjects: Mathematics - Probability, 60E15, 60F10, 60C05
Abstract: An alternate form for the binomial tail is presented, which leads to a variety of bounds for the central tail. A few can be weakened into the corresponding Chernoff and Slud bounds, which not only demonstrates the quality of the presented bounds, but also provides alternate proofs for the classical bounds., Comment: 23 pages, 3 figures
Published: 2009

45. Social Welfare and Profit Maximization from Revealed Preferences

Author: Ji, Ziwei, Mehta, Ruta, Telgarsky, Matus, Hutchison, David, Series Editor, Kanade, Takeo, Series Editor, Kittler, Josef, Series Editor, Kleinberg, Jon M., Series Editor, Mattern, Friedemann, Series Editor, Mitchell, John C., Series Editor, Naor, Moni, Series Editor, Pandu Rangan, C., Series Editor, Steffen, Bernhard, Series Editor, Terzopoulos, Demetri, Series Editor, Tygar, Doug, Series Editor, Weikum, Gerhard, Series Editor, Christodoulou, George, editor, and Harks, Tobias, editor
Published: 2018
Full Text: View/download PDF

46. Tensor Decompositions for Learning Latent Variable Models

Author: Anandkumar, Animashree, Ge, Rong, Hsu, Daniel, Kakade, Sham M, and Telgarsky, Matus
Subjects: latent variable models, tensor decompositions, mixture models, topic models, method of moments, power method, cs.LG, math.NA, stat.ML, Information and Computing Sciences, Psychology and Cognitive Sciences, Artificial Intelligence & Image Processing, Machine learning, Statistics
Abstract: This work considers a computationally and statistically efficient parameter estimation method for a wide class of latent variable models-including Gaussian mixture models, hidden Markov models, and latent Dirichlet allocation-which exploits a certain tensor structure in their low-order observable moments (typically, of second- and third-order). Specifically, parameter estimation is reduced to the problem of extracting a certain (orthogonal) decomposition of a symmetric tensor derived from the moments; this decomposition can be viewed as a natural generalization of the singular value decomposition for matrices. Although tensor decompositions are generally intractable to compute, the decomposition of these specially structured tensors can be efficiently obtained by a variety of approaches, including power iterations and maximization approaches (similar to the case of matrices). A detailed analysis of a robust tensor power method is provided, establishing an analogue of Wedin's perturbation theorem for the singular vectors of matrices. This implies a robust and computationally tractable estimation approach for several popular latent variable models.
Published: 2014

47. Tensor Decompositions for Learning Latent Variable Models (A Survey for ALT)

Author: Anandkumar, Anima, Ge, Rong, Hsu, Daniel, Kakade, Sham M., Telgarsky, Matus, Hutchison, David, Series editor, Kanade, Takeo, Series editor, Kittler, Josef, Series editor, Kleinberg, Jon M., Series editor, Mattern, Friedemann, Series editor, Mitchell, John C., Series editor, Naor, Moni, Series editor, Pandu Rangan, C., Series editor, Steffen, Bernhard, Series editor, Terzopoulos, Demetri, Series editor, Tygar, Doug, Series editor, Weikum, Gerhard, Series editor, Chaudhuri, Kamalika, editor, GENTILE, CLAUDIO, editor, and Zilles, Sandra, editor
Published: 2015
Full Text: View/download PDF

48. Social Welfare and Profit Maximization from Revealed Preferences

Author: Ji, Ziwei, primary, Mehta, Ruta, additional, and Telgarsky, Matus, additional
Published: 2018
Full Text: View/download PDF

49. Duality and Data Dependence in Boosting /

Author: Telgarsky, Matus
Subjects: UCSD Dissertations, Academic Computer science. (Discipline)
Abstract: Boosting algorithms produce accurate predictors for complex phenomena by welding together collections of simple predictors. In the classical method AdaBoost, as well as its immediate variants, the welding points are determined by convex optimization; unlike typical applications of convex optimization in machine learning, however, the AdaBoost scheme eschews the usual regularization and constraints used to control numerical and statistical properties. On the other hand, the data and simple predictors impose rigid structure on the behavior of AdaBoost variants, and moreover convex duality provides a lens to resolve this rigidity. This structure is fundamental to the properties of these methods, and in particular leads to numerical and statistical convergence rates
Published: 2013

50. Tensor Decompositions for Learning Latent Variable Models

Author: Anandkumar, Anima, primary, Ge, Rong, primary, Hsu, Daniel, primary, Kakade, Sham M., primary, and Telgarsky, Matus, primary
Published: 2012
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

142 results on '"Telgarsky, Matus"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources