64 results on '"Lacoste-Julien, Simon"'
Search Results
2. Machine learning in airline crew pairing to construct initial clusters for dynamic constraint aggregation
- Author
-
Yaakoubi, Yassine, Soumis, François, and Lacoste-Julien, Simon
- Published
- 2020
- Full Text
- View/download PDF
3. Promoting Exploration in Memory-Augmented Adam using Critical Momenta
- Author
-
Malviya, Pranshu, Mordido, Gonçalo, Baratin, Aristide, Harikandeh, Reza Babanezhad, Huang, Jerry, Lacoste-Julien, Simon, Pascanu, Razvan, and Chandar, Sarath
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Machine Learning (cs.LG) - Abstract
Adaptive gradient-based optimizers, particularly Adam, have left their mark in training large-scale deep learning models. The strength of such optimizers is that they exhibit fast convergence while being more robust to hyperparameter choice. However, they often generalize worse than non-adaptive methods. Recent studies have tied this performance gap to flat minima selection: adaptive methods tend to find solutions in sharper basins of the loss landscape, which in turn hurts generalization. To overcome this issue, we propose a new memory-augmented version of Adam that promotes exploration towards flatter minima by using a buffer of critical momentum terms during training. Intuitively, the use of the buffer makes the optimizer overshoot outside the basin of attraction if it is not wide enough. We empirically show that our method improves the performance of several variants of Adam on standard supervised language modelling and image classification tasks.
- Published
- 2023
4. Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?
- Author
-
Knyazev, Boris, Hwang, Doha, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Computer Science - Artificial Intelligence ,Statistics - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Pretraining a neural network on a large dataset is becoming a cornerstone in machine learning that is within the reach of only a few communities with large-resources. We aim at an ambitious goal of democratizing pretraining. Towards that goal, we train and release a single neural network that can predict high quality ImageNet parameters of other neural networks. By using predicted parameters for initialization we are able to boost training of diverse ImageNet models available in PyTorch. When transferred to other datasets, models initialized with predicted parameters also converge faster and reach competitive final performance., ICML 2023, camera ready (7 tables with extra results added), code and models are at https://github.com/SamsungSAILMontreal/ghn3
- Published
- 2023
5. Unlocking Slot Attention by Changing Optimal Transport Costs
- Author
-
Zhang, Yan, Zhang, David W., Lacoste-Julien, Simon, Burghouts, Gertjan J., and Snoek, Cees G. M.
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Slot attention is a powerful method for object-centric modeling in images and videos. However, its set-equivariance limits its ability to handle videos with a dynamic number of objects because it cannot break ties. To overcome this limitation, we first establish a connection between slot attention and optimal transport. Based on this new perspective we propose MESH (Minimize Entropy of Sinkhorn): a cross-attention module that combines the tiebreaking properties of unregularized optimal transport with the speed of regularized optimal transport. We evaluate slot attention using MESH on multiple object-centric learning benchmarks and find significant improvements over slot attention in every setting., Published at International Conference on Machine Learning (ICML) 2023
- Published
- 2023
6. Controlled Sparsity via Constrained Optimization or: How I Learned to Stop Tuning Penalties and Love Constraints
- Author
-
Gallego-Posada, Jose, Ramirez, Juan, Erraqabi, Akram, Bengio, Yoshua, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
The performance of trained neural networks is robust to harsh levels of pruning. Coupled with the ever-growing size of deep learning models, this observation has motivated extensive research on learning sparse models. In this work, we focus on the task of controlling the level of sparsity when performing sparse learning. Existing methods based on sparsity-inducing penalties involve expensive trial-and-error tuning of the penalty factor, thus lacking direct control of the resulting model sparsity. In response, we adopt a constrained formulation: using the gate mechanism proposed by Louizos et al. (2018), we formulate a constrained optimization problem where sparsification is guided by the training objective and the desired sparsity target in an end-to-end fashion. Experiments on CIFAR-{10, 100}, TinyImageNet, and ImageNet using WideResNet and ResNet{18, 50} models validate the effectiveness of our proposal and demonstrate that we can reliably achieve pre-determined sparsity targets without compromising on predictive performance., NeurIPS 2022 - Code available at https://github.com/gallego-posada/constrained_sparsity
- Published
- 2022
7. Data-Efficient Structured Pruning via Submodular Optimization
- Author
-
Halabi, Marwa El, Srinivas, Suraj, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Artificial Intelligence (cs.AI) ,Discrete Mathematics (cs.DM) ,Optimization and Control (math.OC) ,Computer Science - Artificial Intelligence ,FOS: Mathematics ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) ,Computer Science - Discrete Mathematics - Abstract
Structured pruning is an effective approach for compressing large pre-trained neural networks without significantly affecting their performance. However, most current structured pruning methods do not provide any performance guarantees, and often require fine-tuning, which makes them inapplicable in the limited-data regime. We propose a principled data-efficient structured pruning method based on submodular optimization. In particular, for a given layer, we select neurons/channels to prune and corresponding new weights for the next layer, that minimize the change in the next layer's input induced by pruning. We show that this selection problem is a weakly submodular maximization problem, thus it can be provably approximated using an efficient greedy algorithm. Our method is guaranteed to have an exponentially decreasing error between the original model and the pruned model outputs w.r.t the pruned size, under reasonable assumptions. It is also one of the few methods in the literature that uses only a limited-number of training data and no labels. Our experimental results demonstrate that our method outperforms state-of-the-art methods in the limited-data regime.
- Published
- 2022
8. Disentanglement via Mechanism Sparsity Regularization: A New Principle for Nonlinear ICA
- Author
-
Lachapelle, S��bastien, L��pez, Pau Rodr��guez, Sharma, Yash, Everett, Katie, Priol, R��mi Le, Lacoste, Alexandre, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science::Machine Learning ,Computer Science - Machine Learning ,I.5.1 ,Statistics - Machine Learning ,I.2.6 ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
This work introduces a novel principle we call disentanglement via mechanism sparsity regularization, which can be applied when the latent factors of interest depend sparsely on past latent factors and/or observed auxiliary variables. We propose a representation learning method that induces disentanglement by simultaneously learning the latent factors and the sparse causal graphical model that relates them. We develop a rigorous identifiability theory, building on recent nonlinear independent component analysis (ICA) results, that formalizes this principle and shows how the latent variables can be recovered up to permutation if one regularizes the latent mechanisms to be sparse and if some graph connectivity criterion is satisfied by the data generating process. As a special case of our framework, we show how one can leverage unknown-target interventions on the latent factors to disentangle them, thereby drawing further connections between ICA and causality. We propose a VAE-based method in which the latent mechanisms are learned and regularized via binary masks, and validate our theory by showing it learns disentangled representations in simulations., Appears in: 1st Conference on Causal Learning and Reasoning (CLeaR 2022). 57 pages
- Published
- 2021
9. Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity
- Author
-
Loizou, Nicolas, Berard, Hugo, Gidel, Gauthier, Mitliagkas, Ioannis, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,Computer Science - Computer Science and Game Theory ,Statistics - Machine Learning ,FOS: Mathematics ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) ,Computer Science and Game Theory (cs.GT) - Abstract
Two of the most prominent algorithms for solving unconstrained smooth games are the classical stochastic gradient descent-ascent (SGDA) and the recently introduced stochastic consensus optimization (SCO) [Mescheder et al., 2017]. SGDA is known to converge to a stationary point for specific classes of games, but current convergence analyses require a bounded variance assumption. SCO is used successfully for solving large-scale adversarial problems, but its convergence guarantees are limited to its deterministic variant. In this work, we introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO under this condition for solving a class of stochastic variational inequality problems that are potentially non-monotone. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size, and we propose insightful stepsize-switching rules to guarantee convergence to the exact solution. In addition, our convergence guarantees hold under the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching., 35th Conference on Neural Information Processing Systems (NeurIPS 2021)
- Published
- 2021
10. Repurposing Pretrained Models for Robust Out-of-domain Few-Shot Learning
- Author
-
Kwon, Namyeong, Na, Hwidong, Huang, Gabriel, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Vision and Pattern Recognition (cs.CV) ,Computer Science - Computer Vision and Pattern Recognition ,Machine Learning (cs.LG) - Abstract
Model-agnostic meta-learning (MAML) is a popular method for few-shot learning but assumes that we have access to the meta-training set. In practice, training on the meta-training set may not always be an option due to data privacy concerns, intellectual property issues, or merely lack of computing resources. In this paper, we consider the novel problem of repurposing pretrained MAML checkpoints to solve new few-shot classification tasks. Because of the potential distribution mismatch, the original MAML steps may no longer be optimal. Therefore we propose an alternative meta-testing procedure and combine MAML gradient steps with adversarial training and uncertainty-based stepsize adaptation. Our method outperforms "vanilla" MAML on same-domain and cross-domains benchmarks using both SGD and Adam optimizers and shows improved robustness to the choice of base stepsize., Appears in: Proceedings of the Ninth International Conference on Learning Representations (ICLR 2021). 20 pages
- Published
- 2021
11. Online Adversarial Attacks
- Author
-
Mladenovic, Andjela, Bose, Avishek Joey, Berard, Hugo, Hamilton, William L., Lacoste-Julien, Simon, Vincent, Pascal, and Gidel, Gauthier
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Computer Science - Cryptography and Security ,Computer Science - Data Structures and Algorithms ,Data Structures and Algorithms (cs.DS) ,Cryptography and Security (cs.CR) ,Machine Learning (cs.LG) - Abstract
Adversarial attacks expose important vulnerabilities of deep learning models, yet little attention has been paid to settings where data arrives as a stream. In this paper, we formalize the online adversarial attack problem, emphasizing two key elements found in real-world use-cases: attackers must operate under partial knowledge of the target model, and the decisions made by the attacker are irrevocable since they operate on a transient data stream. We first rigorously analyze a deterministic variant of the online threat model by drawing parallels to the well-studied $k$-secretary problem in theoretical computer science and propose Virtual+, a simple yet practical online algorithm. Our main theoretical result shows Virtual+ yields provably the best competitive ratio over all single-threshold algorithms for $k, ICLR 2022
- Published
- 2021
12. Geometry-Aware Universal Mirror-Prox
- Author
-
Babanezhad, Reza and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Machine Learning (cs.LG) - Abstract
Mirror-prox (MP) is a well-known algorithm to solve variational inequality (VI) problems. VI with a monotone operator covers a large group of settings such as convex minimization, min-max or saddle point problems. To get a convergent algorithm, the step-size of the classic MP algorithm relies heavily on the problem dependent knowledge of the operator such as its smoothness parameter which is hard to estimate. Recently, a universal variant of MP for smooth/bounded operators has been introduced that depends only on the norm of updates in MP. In this work, we relax the dependence to evaluating the norm of updates to Bregman divergence between updates. This relaxation allows us to extends the analysis of universal MP to the settings where the operator is not smooth or bounded. Furthermore, we analyse the VI problem with a stochastic monotone operator in different settings and obtain an optimal rate up to a logarithmic factor.
- Published
- 2020
13. On the Convergence of Continuous Constrained Optimization for Structure Learning
- Author
-
Ng, Ignavier, Lachapelle, Sébastien, Ke, Nan Rosemary, Lacoste-Julien, Simon, and Zhang, Kun
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Recently, structure learning of directed acyclic graphs (DAGs) has been formulated as a continuous optimization problem by leveraging an algebraic characterization of acyclicity. The constrained problem is solved using the augmented Lagrangian method (ALM) which is often preferred to the quadratic penalty method (QPM) by virtue of its standard convergence result that does not require the penalty coefficient to go to infinity, hence avoiding ill-conditioning. However, the convergence properties of these methods for structure learning, including whether they are guaranteed to return a DAG solution, remain unclear, which might limit their practical applications. In this work, we examine the convergence of ALM and QPM for structure learning in the linear, nonlinear, and confounded cases. We show that the standard convergence result of ALM does not hold in these settings, and demonstrate empirically that its behavior is akin to that of the QPM which is prone to ill-conditioning. We further establish the convergence guarantee of QPM to a DAG solution, under mild conditions. Lastly, we connect our theoretical results with existing approaches to help resolve the convergence issue, and verify our findings in light of an empirical comparison of them., AISTATS 2022. A preliminary version of this paper was presented at the NeurIPS 2020 Workshop on Causal Discovery and Causality-Inspired Machine Learning. The code is available at https://github.com/ignavierng/notears-convergence
- Published
- 2020
14. Affine Invariant Analysis of Frank-Wolfe on Strongly Convex Sets
- Author
-
Kerdreux, Thomas, Liu, Lewis, Lacoste-Julien, Simon, and Scieur, Damien
- Subjects
Optimization and Control (math.OC) ,FOS: Mathematics ,Mathematics - Optimization and Control - Abstract
It is known that the Frank-Wolfe (FW) algorithm, which is affine-covariant, enjoys accelerated convergence rates when the constraint set is strongly convex. However, these results rely on norm-dependent assumptions, usually incurring non-affine invariant bounds, in contradiction with FW's affine-covariant property. In this work, we introduce new structural assumptions on the problem (such as the directional smoothness) and derive an affine invariant, norm-independent analysis of Frank-Wolfe. Based on our analysis, we propose an affine invariant backtracking line-search. Interestingly, we show that typical backtracking line-searches using smoothness of the objective function surprisingly converge to an affine invariant step size, despite using affine-dependent norms in the step size's computation. This indicates that we do not necessarily need to know the set's structure in advance to enjoy the affine-invariant accelerated rate.
- Published
- 2020
15. Implicit Regularization via Neural Feature Alignment
- Author
-
Baratin, Aristide, George, Thomas, Laurent, C��sar, Hjelm, R Devon, Lajoie, Guillaume, Vincent, Pascal, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
We approach the problem of implicit regularization in deep learning from a geometrical viewpoint. We highlight a regularization effect induced by a dynamical alignment of the neural tangent features introduced by Jacot et al, along a small number of task-relevant directions. This can be interpreted as a combined mechanism of feature selection and compression. By extrapolating a new analysis of Rademacher complexity bounds for linear models, we motivate and study a heuristic complexity measure that captures this phenomenon, in terms of sequences of tangent kernel classes along optimization paths., AISTATS 2021
- Published
- 2020
16. Stochastic Hamiltonian Gradient Methods for Smooth Games
- Author
-
Loizou, Nicolas, Berard, Hugo, Jolicoeur-Martineau, Alexia, Vincent, Pascal, Lacoste-Julien, Simon, and Mitliagkas, Ioannis
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,Computer Science - Computer Science and Game Theory ,Statistics - Machine Learning ,FOS: Mathematics ,MathematicsofComputing_NUMERICALANALYSIS ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) ,Computer Science and Game Theory (cs.GT) - Abstract
The success of adversarial formulations in machine learning has brought renewed motivation for smooth games. In this work, we focus on the class of stochastic Hamiltonian methods and provide the first convergence guarantees for certain classes of stochastic smooth games. We propose a novel unbiased estimator for the stochastic Hamiltonian gradient descent (SHGD) and highlight its benefits. Using tools from the optimization literature we show that SHGD converges linearly to the neighbourhood of a stationary point. To guarantee convergence to the exact solution, we analyze SHGD with a decreasing step-size and we also present the first stochastic variance reduced Hamiltonian method. Our results provide the first global non-asymptotic last-iterate convergence guarantees for the class of stochastic unconstrained bilinear games and for the more general class of stochastic games that satisfy a "sufficiently bilinear" condition, notably including some non-convex non-concave problems. We supplement our analysis with experiments on stochastic bilinear and sufficiently bilinear games, where our theory is shown to be tight, and on simple adversarial machine learning formulations., ICML 2020 - Proceedings of the 37th International Conference on Machine Learning
- Published
- 2020
17. Differentiable Causal Discovery from Interventional Data
- Author
-
Brouillard, Philippe, Lachapelle, S��bastien, Lacoste, Alexandre, Lacoste-Julien, Simon, and Drouin, Alexandre
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,I.5.1 ,Statistics - Machine Learning ,I.2.6 ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Learning a causal directed acyclic graph from data is a challenging task that involves solving a combinatorial problem for which the solution is not always identifiable. A new line of work reformulates this problem as a continuous constrained optimization one, which is solved via the augmented Lagrangian method. However, most methods based on this idea do not make use of interventional data, which can significantly alleviate identifiability issues. This work constitutes a new step in this direction by proposing a theoretically-grounded method based on neural networks that can leverage interventional data. We illustrate the flexibility of the continuous-constrained framework by taking advantage of expressive neural architectures such as normalizing flows. We show that our approach compares favorably to the state of the art in a variety of settings, including perfect and imperfect interventions for which the targeted nodes may even be unknown., Appears in: Advances in Neural Information Processing Systems 34 (NeurIPS 2020). 46 pages
- Published
- 2020
18. To Each Optimizer a Norm, To Each Norm its Generalization
- Author
-
Vaswani, Sharan, Babanezhad, Reza, Gallego-Posada, Jose, Mishkin, Aaron, Lacoste-Julien, Simon, and Roux, Nicolas Le
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
We study the implicit regularization of optimization methods for linear models interpolating the training data in the under-parametrized and over-parametrized regimes. Since it is difficult to determine whether an optimizer converges to solutions that minimize a known norm, we flip the problem and investigate what is the corresponding norm minimized by an interpolating solution. Using this reasoning, we prove that for over-parameterized linear regression, projections onto linear spans can be used to move between different interpolating solutions. For under-parameterized linear classification, we prove that for any linear classifier separating the data, there exists a family of quadratic norms ||.||_P such that the classifier's direction is the same as that of the maximum P-margin solution. For linear classification, we argue that analyzing convergence to the standard maximum l2-margin is arbitrary and show that minimizing the norm induced by the data results in better generalization. Furthermore, for over-parameterized linear classification, projections onto the data-span enable us to use techniques from the under-parameterized setting. On the empirical side, we propose techniques to bias optimizers towards better generalizing solutions, improving their test performance. We validate our theoretical results via synthetic experiments, and use the neural tangent kernel to handle non-linear models.
- Published
- 2020
19. Adaptive Gradient Methods Converge Faster with Over-Parameterization (but you should do a line-search)
- Author
-
Vaswani, Sharan, Laradji, Issam, Kunstner, Frederik, Meng, Si Yi, Schmidt, Mark, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,FOS: Mathematics ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
Adaptive gradient methods are typically used for training over-parameterized models. To better understand their behaviour, we study a simplistic setting -- smooth, convex losses with models over-parameterized enough to interpolate the data. In this setting, we prove that AMSGrad with constant step-size and momentum converges to the minimizer at a faster $O(1/T)$ rate. When interpolation is only approximately satisfied, constant step-size AMSGrad converges to a neighbourhood of the solution at the same rate, while AdaGrad is robust to the violation of interpolation. However, even for simple convex problems satisfying interpolation, the empirical performance of both methods heavily depends on the step-size and requires tuning, questioning their adaptivity. We alleviate this problem by automatically determining the step-size using stochastic line-search or Polyak step-sizes. With these techniques, we prove that both AdaGrad and AMSGrad retain their convergence guarantees, without needing to know problem-dependent constants. Empirically, we demonstrate that these techniques improve the convergence and generalization of adaptive gradient methods across tasks, from binary classification with kernel mappings to multi-class classification with deep networks.
- Published
- 2020
20. An Analysis of the Adaptation Speed of Causal Models
- Author
-
Priol, R��mi Le, Harikandeh, Reza Babanezhad, Bengio, Yoshua, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Consider a collection of datasets generated by unknown interventions on an unknown structural causal model $G$. Recently, Bengio et al. (2020) conjectured that among all candidate models, $G$ is the fastest to adapt from one dataset to another, along with promising experiments. Indeed, intuitively $G$ has less mechanisms to adapt, but this justification is incomplete. Our contribution is a more thorough analysis of this hypothesis. We investigate the adaptation speed of cause-effect SCMs. Using convergence rates from stochastic optimization, we justify that a relevant proxy for adaptation speed is distance in parameter space after intervention. Applying this proxy to categorical and normal cause-effect models, we show two results. When the intervention is on the cause variable, the SCM with the correct causal direction is advantaged by a large factor. When the intervention is on the effect variable, we characterize the relative adaptation speed. Surprisingly, we find situations where the anticausal model is advantaged, falsifying the initial hypothesis. Code to reproduce experiments is available at https://github.com/remilepriol/causal-adaptation-speed, Published at AISTATS 2021. 10 pages main articles, 19 pages supplement, 10 figures
- Published
- 2020
21. GAIT: A Geometric Approach to Information Theory
- Author
-
Gallego-Posada, Jose, Vani, Ankit, Schwarzer, Max, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
We advocate the use of a notion of entropy that reflects the relative abundances of the symbols in an alphabet, as well as the similarities between them. This concept was originally introduced in theoretical ecology to study the diversity of ecosystems. Based on this notion of entropy, we introduce geometry-aware counterparts for several concepts and theorems in information theory. Notably, our proposed divergence exhibits performance on par with state-of-the-art methods based on the Wasserstein distance, but enjoys a closed-form expression that can be computed efficiently. We demonstrate the versatility of our method via experiments on a broad range of domains: training generative models, computing image barycenters, approximating empirical measures and counting modes., Appears in: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS) 2020. 19 pages
- Published
- 2019
22. A Tight and Unified Analysis of Gradient-Based Methods for a Whole Spectrum of Games
- Author
-
Azizian, Wa��ss, Mitliagkas, Ioannis, Lacoste-Julien, Simon, and Gidel, Gauthier
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,I.2.6 ,G.1.6 ,FOS: Mathematics ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
We consider differentiable games where the goal is to find a Nash equilibrium. The machine learning community has recently started using variants of the gradient method (GD). Prime examples are extragradient (EG), the optimistic gradient method (OG) and consensus optimization (CO), which enjoy linear convergence in cases like bilinear games, where the standard GD fails. The full benefits of theses relatively new methods are not known as there is no unified analysis for both strongly monotone and bilinear games. We provide new analyses of the EG's local and global convergence properties and use is to get a tighter global convergence rate for OG and CO. Our analysis covers the whole range of settings between bilinear and strongly monotone games. It reveals that these methods converge via different mechanisms at these extremes; in between, it exploits the most favorable mechanism for the given problem. We then prove that EG achieves the optimal rate for a wide class of algorithms with any number of extrapolations. Our tight analysis of EG's convergence rate in games shows that, unlike in convex minimization, EG may be much faster than GD., Appears in: Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020). 39 pages. Minor modification regarding prior work in comparison to the AISTATS Proceedings
- Published
- 2019
23. Gradient-Based Neural DAG Learning
- Author
-
Lachapelle, S��bastien, Brouillard, Philippe, Deleu, Tristan, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,I.5.1 ,Statistics - Machine Learning ,I.2.6 ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
We propose a novel score-based approach to learning a directed acyclic graph (DAG) from observational data. We adapt a recently proposed continuous constrained optimization formulation to allow for nonlinear relationships between variables using neural networks. This extension allows to model complex interactions while avoiding the combinatorial nature of the problem. In addition to comparing our method to existing continuous optimization methods, we provide missing empirical comparisons to nonlinear greedy search methods. On both synthetic and real-world data sets, this new method outperforms current continuous methods on most tasks, while being competitive with existing greedy search methods on important metrics for causal inference., Appears in: Proceedings of the Eighth International Conference on Learning Representations (ICLR 2020). 23 pages
- Published
- 2019
24. Painless Stochastic Gradient: Interpolation, Line-Search, and Convergence Rates
- Author
-
Vaswani, Sharan, Mishkin, Aaron, Laradji, Issam, Schmidt, Mark, Gidel, Gauthier, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,FOS: Mathematics ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
Recent works have shown that stochastic gradient descent (SGD) achieves the fast convergence rates of full-batch gradient descent for over-parameterized models satisfying certain interpolation conditions. However, the step-size used in these works depends on unknown quantities and SGD's practical performance heavily relies on the choice of this step-size. We propose to use line-search techniques to automatically set the step-size when training models that can interpolate the data. In the interpolation setting, we prove that SGD with a stochastic variant of the classic Armijo line-search attains the deterministic convergence rates for both convex and strongly-convex functions. Under additional assumptions, SGD with Armijo line-search is shown to achieve fast convergence for non-convex functions. Furthermore, we show that stochastic extra-gradient with a Lipschitz line-search attains linear convergence for an important class of non-convex functions and saddle-point problems satisfying interpolation. To improve the proposed methods' practical performance, we give heuristics to use larger step-sizes and acceleration. We compare the proposed algorithms against numerous optimization methods on standard classification tasks using both kernel methods and deep networks. The proposed methods result in competitive performance across all models and datasets, while being robust to the precise choices of hyper-parameters. For multi-class classification using deep networks, SGD with Armijo line-search results in both faster convergence and better generalization., Added a citation to the related work of Paul Tseng, and citations to methods that had previously explored line-searches for deep learning empirically
- Published
- 2019
25. Implicit Regularization of Discrete Gradient Dynamics in Linear Neural Networks
- Author
-
Gidel, Gauthier, Bach, Francis, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,FOS: Mathematics ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
When optimizing over-parameterized models, such as deep neural networks, a large set of parameters can achieve zero training error. In such cases, the choice of the optimization algorithm and its respective hyper-parameters introduces biases that will lead to convergence to specific minimizers of the objective. Consequently, this choice can be considered as an implicit regularization for the training of over-parametrized models. In this work, we push this idea further by studying the discrete gradient dynamics of the training of a two-layer linear network with the least-squares loss. Using a time rescaling, we show that, with a vanishing initialization and a small enough step size, this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank., 19 pages, to appear in NeurIPS 2019 proceedings
- Published
- 2019
26. Reducing Noise in GAN Training with Variance Reduced Extragradient
- Author
-
Chavdarova, Tatjana, Gidel, Gauthier, Fleuret, Francois, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Machine Learning ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,FOS: Mathematics ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
We study the effect of the stochastic gradient noise on the training of generative adversarial networks (GANs) and show that it can prevent the convergence of standard game optimization methods, while the batch version converges. We address this issue with a novel stochastic variance-reduced extragradient (SVRE) optimization algorithm, which for a large class of games improves upon the previous convergence rates proposed in the literature. We observe empirically that SVRE performs similarly to a batch method on MNIST while being computationally cheaper, and that SVRE yields more stable GAN training on standard datasets., latest NeurIPS'19 version
- Published
- 2019
27. Predicting Tactical Solutions to Operational Planning Problems Under Imperfect Information.
- Author
-
Larsen, Eric, Lachapelle, Sébastien, Bengio, Yoshua, Frejinger, Emma, Lacoste-Julien, Simon, and Lodi, Andrea
- Subjects
DEEP learning ,STOCHASTIC programming ,SUPERVISED learning ,OPERATIONS research ,RAILROADS ,STOCHASTIC approximation ,MACHINE learning - Abstract
This paper offers a methodological contribution at the intersection of machine learning and operations research. Namely, we propose a methodology to quickly predict expected tactical descriptions of operational solutions (TDOSs). The problem we address occurs in the context of two-stage stochastic programming, where the second stage is demanding computationally. We aim to predict at a high speed the expected TDOS associated with the second-stage problem, conditionally on the first-stage variables. This may be used in support of the solution to the overall two-stage problem by avoiding the online generation of multiple second-stage scenarios and solutions. We formulate the tactical prediction problem as a stochastic optimal prediction program, whose solution we approximate with supervised machine learning. The training data set consists of a large number of deterministic operational problems generated by controlled probabilistic sampling. The labels are computed based on solutions to these problems (solved independently and offline), employing appropriate aggregation and subselection methods to address uncertainty. Results on our motivating application on load planning for rail transportation show that deep learning models produce accurate predictions in very short computing time (milliseconds or less). The predictive accuracy is close to the lower bounds calculated based on sample average approximation of the stochastic prediction programs. [ABSTRACT FROM AUTHOR]
- Published
- 2022
- Full Text
- View/download PDF
28. Negative Momentum for Improved Game Dynamics
- Author
-
Gidel, Gauthier, Hemmat, Reyhane Askari, Pezeshki, Mohammad, Lepriol, Remi, Huang, Gabriel, Lacoste-Julien, Simon, and Mitliagkas, Ioannis
- Subjects
FOS: Computer and information sciences ,Computer Science::Computer Science and Game Theory ,Computer Science - Machine Learning ,Statistics - Machine Learning ,I.2.6 ,G.1.6 ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Games generalize the single-objective optimization paradigm by introducing different objective functions for different players. Differentiable games often proceed by simultaneous or alternating gradient updates. In machine learning, games are gaining new importance through formulations like generative adversarial networks (GANs) and actor-critic systems. However, compared to single-objective optimization, game dynamics are more complex and less understood. In this paper, we analyze gradient-based methods with momentum on simple games. We prove that alternating updates are more stable than simultaneous updates. Next, we show both theoretically and empirically that alternating gradient updates with a negative momentum term achieves convergence in a difficult toy adversarial problem, but also on the notoriously difficult to train saturating GANs., Appears in: Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS 2019). Minor changes with respect to the AISTATS version: typo corrected in Thm. 6 (squared condition number instead of condition number; and small change in constant) and dependence in $\beta$ changed in Theorem 5 for the formal statement; not changing the conclusions. 28 pages
- Published
- 2018
29. Frank-Wolfe Splitting via Augmented Lagrangian Method
- Author
-
Gidel, Gauthier, Pedregosa, Fabian, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Computer Science - Learning ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,I.2.6 ,G.1.6 ,FOS: Mathematics ,MathematicsofComputing_NUMERICALANALYSIS ,Machine Learning (stat.ML) ,Mathematics - Optimization and Control ,90C52, 90C90, 68T05 ,Machine Learning (cs.LG) - Abstract
Minimizing a function over an intersection of convex sets is an important task in optimization that is often much more challenging than minimizing it over each individual constraint set. While traditional methods such as Frank-Wolfe (FW) or proximal gradient descent assume access to a linear or quadratic oracle on the intersection, splitting techniques take advantage of the structure of each sets, and only require access to the oracle on the individual constraints. In this work, we develop and analyze the Frank-Wolfe Augmented Lagrangian (FW-AL) algorithm, a method for minimizing a smooth function over convex compact sets related by a "linear consistency" constraint that only requires access to a linear minimization oracle over the individual constraints. It is based on the Augmented Lagrangian Method (ALM), also known as Method of Multipliers, but unlike most existing splitting methods, it only requires access to linear (instead of quadratic) minimization oracles. We use recent advances in the analysis of Frank-Wolfe and the alternating direction method of multipliers algorithms to prove a sublinear convergence rate for FW-AL over general convex compact sets and a linear convergence rate for polytopes., Appears in: Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS 2018). 30 pages
- Published
- 2018
30. Improved asynchronous parallel optimization analysis for stochastic incremental methods
- Author
-
Leblond, Rémi, Pedregosa, Fabian, Lacoste-Julien, Simon, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Eidgenössische Technische Hochschule - Swiss Federal Institute of Technology [Zürich] (ETH Zürich), Montreal Institute for Learning Algorithms [Montréal] (MILA), Centre de Recherches Mathématiques [Montréal] (CRM), Université de Montréal (UdeM)-Université de Montréal (UdeM), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), École normale supérieure - Paris (ENS-PSL), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL)
- Subjects
FOS: Computer and information sciences ,Optimization ,Computer Science - Machine Learning ,Machine Learning (stat.ML) ,Large Scale ,Machine Learning (cs.LG) ,Machine Learning ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Statistics - Machine Learning ,Optimization and Control (math.OC) ,FOS: Mathematics ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,Mathematics - Optimization and Control ,Asynchronous Parallel ,Sparsity - Abstract
As datasets continue to increase in size and multi-core computer architectures are developed, asynchronous parallel optimization algorithms become more and more essential to the field of Machine Learning. Unfortunately, conducting the theoretical analysis asynchronous methods is difficult, notably due to the introduction of delay and inconsistency in inherently sequential algorithms. Handling these issues often requires resorting to simplifying but unrealistic assumptions. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We demonstrate the usefulness of our new framework by analyzing three distinct asynchronous parallel incremental optimization algorithms: Hogwild (asynchronous SGD), KROMAGNON (asynchronous SVRG) and ASAGA, a novel asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. We are able to both remove problematic assumptions and obtain better theoretical results. Notably, we prove that ASAGA and KROMAGNON can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedups as well as the hardware overhead. Finally, we investigate the overlap constant, an ill-understood but central quantity for the theoretical analysis of asynchronous parallel algorithms. We find that it encompasses much more complexity than suggested in previous work, and often is order-of-magnitude bigger than traditionally thought., Comment: 67 pages, published in JMLR, can be found online at http://jmlr.org/papers/v19/17-650.html. arXiv admin note: substantial text overlap with arXiv:1606.04809
- Published
- 2018
31. SEARNN: Training RNNs with global-local losses
- Author
-
Leblond, Rémi, Alayrac, Jean-Baptiste, Osokin, Anton, Lacoste-Julien, Simon, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris sciences et lettres (PSL), Models of visual object recognition and scene understanding (WILLOW), Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Département d'informatique - ENS Paris (DI-ENS), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Département d'Informatique et de Recherche Opérationnelle [Montreal] (DIRO), Université de Montréal (UdeM), Leblond, Rémi, École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
FOS: Computer and information sciences ,[INFO.INFO-AI] Computer Science [cs]/Artificial Intelligence [cs.AI] ,Computer Science - Learning ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG] ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,[INFO.INFO-AI]Computer Science [cs]/Artificial Intelligence [cs.AI] ,Machine Learning (cs.LG) - Abstract
We propose SEARNN, a novel training algorithm for recurrent neural networks (RNNs) inspired by the "learning to search" (L2S) approach to structured prediction. RNNs have been widely successful in structured prediction applications such as machine translation or parsing, and are commonly trained using maximum likelihood estimation (MLE). Unfortunately, this training loss is not always an appropriate surrogate for the test error: by only maximizing the ground truth probability, it fails to exploit the wealth of information offered by structured losses. Further, it introduces discrepancies between training and predicting (such as exposure bias) that may hurt test performance. Instead, SEARNN leverages test-alike search space exploration to introduce global-local losses that are closer to the test error. We first demonstrate improved performance over MLE on two different tasks: OCR and spelling correction. Then, we propose a subsampling strategy to enable SEARNN to scale to large vocabulary sizes. This allows us to validate the benefits of our approach on a machine translation task., Comment: Published as a conference paper at ICLR 2018, 16 pages
- Published
- 2017
32. Frank-Wolfe Algorithms for Saddle Point Problems
- Author
-
Gidel, Gauthier, Jebara, Tony, Lacoste-Julien, Simon, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Columbia University [New York], Département d'Informatique et de Recherche Opérationnelle [Montreal] (DIRO), Université de Montréal (UdeM), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris)
- Subjects
FOS: Computer and information sciences ,Games theory ,I.2.6 ,G.1.6 ,Machine Learning (stat.ML) ,Saddle Points ,Machine Learning (cs.LG) ,Structured learning ,Convex optimization ,Computer Science - Learning ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,ACM: G.: Mathematics of Computing/G.1: NUMERICAL ANALYSIS/G.1.6: Optimization ,FOS: Mathematics ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,Constrained optimization ,Mathematics - Optimization and Control ,90C52, 90C90, 68T05 ,ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.6: Learning - Abstract
We extend the Frank-Wolfe (FW) optimization algorithm to solve constrained smooth convex-concave saddle point (SP) problems. Remarkably, the method only requires access to linear minimization oracles. Leveraging recent advances in FW optimization, we provide the first proof of convergence of a FW-type saddle point solver over polytopes, thereby partially answering a 30 year-old conjecture. We also survey other convergence results and highlight gaps in the theoretical underpinnings of FW-style algorithms. Motivating applications without known efficient alternatives are explored through structured prediction with combinatorial penalties as well as games over matching polytopes involving an exponential number of constraints., Appears in: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017). 39 pages
- Published
- 2017
33. Asaga: Asynchronous Parallel Saga
- Author
-
Leblond, Rémi, Pedregosa, Fabian, Lacoste-Julien, Simon, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Microsoft Research - Inria Joint Centre (MSR - INRIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Microsoft Research Laboratory Cambridge-Microsoft Corporation [Redmond, Wash.], Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Département d'Informatique et de Recherche Opérationnelle [Montreal] (DIRO), and Université de Montréal (UdeM)
- Subjects
FOS: Computer and information sciences ,Computer Science - Learning ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,FOS: Mathematics ,Machine Learning (stat.ML) ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
We describe ASAGA, an asynchronous parallel version of the incremental gradient algorithm SAGA that enjoys fast linear convergence rates. Through a novel perspective, we revisit and clarify a subtle but important technical issue present in a large fraction of the recent convergence rate proofs for asynchronous parallel optimization algorithms, and propose a simplification of the recently introduced "perturbed iterate" framework that resolves it. We thereby prove that ASAGA can obtain a theoretical linear speedup on multi-core systems even without sparsity assumptions. We present results of an implementation on a 40-core architecture illustrating the practical speedup as well as the hardware overhead., Appears in: Proceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS 2017), 37 pages
- Published
- 2016
34. Convergence Rate of Frank-Wolfe for Non-Convex Objectives
- Author
-
Lacoste-Julien, Simon, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Microsoft Research - Inria Joint Centre (MSR - INRIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Microsoft Research Laboratory Cambridge-Microsoft Corporation [Redmond, Wash.], Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris)
- Subjects
FOS: Computer and information sciences ,I.2.6 ,G.1.6 ,Computer Science - Numerical Analysis ,Machine Learning (stat.ML) ,Numerical Analysis (math.NA) ,[INFO.INFO-NA]Computer Science [cs]/Numerical Analysis [cs.NA] ,[STAT.OT]Statistics [stat]/Other Statistics [stat.ML] ,Machine Learning (cs.LG) ,Computer Science - Learning ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Statistics - Machine Learning ,Optimization and Control (math.OC) ,FOS: Mathematics ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,Mathematics - Optimization and Control ,90C52, 90C90, 68T05 - Abstract
We give a simple proof that the Frank-Wolfe algorithm obtains a stationary point at a rate of $O(1/\sqrt{t})$ on non-convex objectives with a Lipschitz continuous gradient. Our analysis is affine invariant and is the first, to the best of our knowledge, giving a similar rate to what was already proven for projected gradient methods (though on slightly different measures of stationarity)., Comment: 6 pages
- Published
- 2016
35. Beyond CCA: Moment Matching for Multi-View Models
- Author
-
Podosinnikova, Anastasia, Bach, Francis, Lacoste-Julien, Simon, Laboratoire d'informatique de l'école normale supérieure (LIENS), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Microsoft Research - Inria Joint Centre (MSR - INRIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Microsoft Research Laboratory Cambridge-Microsoft Corporation [Redmond, Wash.], Statistical Machine Learning and Parsimony (SIERRA), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, Institut National de Recherche en Informatique et en Automatique (Inria), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Inria de Paris, École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Département d'informatique de l'École normale supérieure (DI-ENS), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris)
- Subjects
FOS: Computer and information sciences ,I.2.6 ,I.2.7 ,G.1.3 ,G.1.6 ,G.3 ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) ,Computer Science - Learning ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Statistics - Machine Learning ,62H12 62H20 62H25 62F10 68T05 68T50 90C26 90C90 - Abstract
We introduce three novel semi-parametric extensions of probabilistic canonical correlation analysis with identifiability guarantees. We consider moment matching techniques for estimation in these models. For that, by drawing explicit links between the new models and a discrete version of independent component analysis (DICA), we first extend the DICA cumulant tensors to the new discrete version of CCA. By further using a close connection with independent component analysis, we introduce generalized covariance matrices, which can replace the cumulant tensors in the moment matching framework, and, therefore, improve sample complexity and simplify derivations and algorithms significantly. As the tensor power method or orthogonal joint diagonalization are not applicable in the new setting, we use non-orthogonal joint diagonalization techniques for matching the cumulants. We demonstrate performance of the proposed models and estimation techniques on experiments with both synthetic and real datasets., Appears in: Proceedings of the 33rd International Conference on Machine Learning (ICML 2016). 22 pages
- Published
- 2016
36. Rethinking LDA: Moment Matching for Discrete ICA
- Author
-
Podosinnikova, Anastasia, Bach, Francis, Lacoste-Julien, Simon, Microsoft Research - Inria Joint Centre (MSR - INRIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Microsoft Research Laboratory Cambridge-Microsoft Corporation [Redmond, Wash.], Laboratoire d'informatique de l'école normale supérieure (LIENS), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Statistical Machine Learning and Parsimony (SIERRA), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria), MSR-Inria Joint Centre, École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), and Département d'informatique de l'École normale supérieure (DI-ENS)
- Subjects
FOS: Computer and information sciences ,LDA ,topic modeling ,moment matching ,Machine Learning (stat.ML) ,latent Dirichlet allocation ,discrete ICA ,Machine Learning (cs.LG) ,Computer Science - Learning ,ComputingMethodologies_PATTERNRECOGNITION ,gamma-Poisson ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Statistics - Machine Learning ,discrete independent component analysis ,independent component analysis ,learning with provable guarantees ,GP ,joint diagonalization ,ICA ,DICA - Abstract
We consider moment matching techniques for estimation in Latent Dirichlet Allocation (LDA). By drawing explicit links between LDA and discrete versions of independent component analysis (ICA), we first derive a new set of cumulant-based tensors, with an improved sample complexity. Moreover, we reuse standard ICA techniques such as joint diagonalization of tensors to improve over existing methods based on the tensor power method. In an extensive set of experiments on both synthetic and real datasets, we show that our new combination of tensors and orthogonal joint diagonalization techniques outperforms existing moment matching methods., 30 pages; added plate diagrams and clarifications, changed style, corrected typos, updated figures. in Proceedings of the 29-th Conference on Neural Information Processing Systems (NIPS), 2015
- Published
- 2015
37. Barrier Frank-Wolfe for Marginal Inference
- Author
-
Krishnan, Rahul G., Lacoste-Julien, Simon, Sontag, David, Courant Institute of Mathematical Sciences [New York] (CIMS), New York University [New York] (NYU), NYU System (NYU)-NYU System (NYU), Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Microsoft Research - Inria Joint Centre (MSR - INRIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Microsoft Research Laboratory Cambridge-Microsoft Corporation [Redmond, Wash.], Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris)
- Subjects
FOS: Computer and information sciences ,Computer Science - Learning ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,FOS: Mathematics ,MathematicsofComputing_NUMERICALANALYSIS ,Machine Learning (stat.ML) ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,Mathematics - Optimization and Control ,Machine Learning (cs.LG) - Abstract
We introduce a globally-convergent algorithm for optimizing the tree-reweighted (TRW) variational objective over the marginal polytope. The algorithm is based on the conditional gradient method (Frank-Wolfe) and moves pseudomarginals within the marginal polytope through repeated maximum a posteriori (MAP) calls. This modular structure enables us to leverage black-box MAP solvers (both exact and approximate) for variational inference, and obtains more accurate results than tree-reweighted algorithms that optimize over the local consistency relaxation. Theoretically, we bound the sub-optimality for the proposed algorithm despite the TRW objective having unbounded gradients at the boundary of the marginal polytope. Empirically, we demonstrate the increased quality of results found by tightening the relaxation over the marginal polytope as well as the spanning tree polytope on synthetic and real-world instances., 25 pages, 12 figures, To appear in Neural Information Processing Systems (NIPS) 2015, Corrected reference and cleaned up bibliography
- Published
- 2015
38. Block-Coordinate Frank-Wolfe Optimization for Structural SVMs
- Author
-
Lacoste-Julien, Simon, Jaggi, Martin, Schmidt, Mark, Pletscher, Patrick, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria), Centre de Mathématiques Appliquées - Ecole Polytechnique (CMAP), École polytechnique (X)-Centre National de la Recherche Scientifique (CNRS), Machine Learning Laboratory, Lacoste-Julien, Simon, Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt, École normale supérieure - Paris (ENS-PSL), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL)
- Subjects
FOS: Computer and information sciences ,I.2.6 ,G.1.6 ,MathematicsofComputing_NUMERICALANALYSIS ,Machine Learning (stat.ML) ,[MATH.MATH-OC] Mathematics [math]/Optimization and Control [math.OC] ,[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG] ,90C52, 90C90, 90C06, 68T05 ,[STAT.ML] Statistics [stat]/Machine Learning [stat.ML] ,Machine Learning (cs.LG) ,90C52 ,90C90 ,90C06 ,68T05 ,Computer Science - Learning ,ml-ai ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,Optimization and Control (math.OC) ,Statistics - Machine Learning ,FOS: Mathematics ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,Mathematics - Optimization and Control - Abstract
We propose a randomized block-coordinate variant of the classic Frank-Wolfe algorithm for convex optimization with block-separable constraints. Despite its lower iteration cost, we show that it achieves a similar convergence rate in duality gap as the full Frank-Wolfe algorithm. We also show that, when applied to the dual structural support vector machine (SVM) objective, this yields an online algorithm that has the same low iteration complexity as primal stochastic subgradient methods. However, unlike stochastic subgradient methods, the block-coordinate Frank-Wolfe algorithm allows us to compute the optimal step-size and yields a computable duality gap guarantee. Our experiments indicate that this simple algorithm outperforms competing structural SVM solvers., Appears in Proceedings of the 30th International Conference on Machine Learning (ICML 2013). 9 pages main text + 22 pages appendix. Changes from v3 to v4: 1) Re-organized appendix; improved & clarified duality gap proofs; re-drew all plots; 2) Changed convention for Cf definition; 3) Added weighted averaging experiments + convergence results; 4) Clarified main text and relationship with appendix
- Published
- 2013
39. Scattering Networks for Hybrid Representation Learning.
- Author
-
Oyallon, Edouard, Zagoruyko, Sergey, Huang, Gabriel, Komodakis, Nikos, Lacoste-Julien, Simon, Blaschko, Matthew, and Belilovsky, Eugene
- Subjects
BLENDED learning ,ARTIFICIAL neural networks ,SUPERVISED learning ,IMAGE reconstruction ,IMAGE reconstruction algorithms ,HYBRID power systems ,IMAGE representation - Abstract
Scattering networks are a class of designed Convolutional Neural Networks (CNNs) with fixed weights. We argue they can serve as generic representations for modelling images. In particular, by working in scattering space, we achieve competitive results both for supervised and unsupervised learning tasks, while making progress towards constructing more interpretable CNNs. For supervised learning, we demonstrate that the early layers of CNNs do not necessarily need to be learned, and can be replaced with a scattering network instead. Indeed, using hybrid architectures, we achieve the best results with predefined representations to-date, while being competitive with end-to-end learned CNNs. Specifically, even applying a shallow cascade of small-windowed scattering coefficients followed by $1\times 1$1×1-convolutions results in AlexNet accuracy on the ILSVRC2012 classification task. Moreover, by combining scattering networks with deep residual networks, we achieve a single-crop top-5 error of 11.4 percent on ILSVRC2012. Also, we show they can yield excellent performance in the small sample regime on CIFAR-10 and STL-10 datasets, exceeding their end-to-end counterparts, through their ability to incorporate geometrical priors. For unsupervised learning, scattering coefficients can be a competitive representation that permits image recovery. We use this fact to train hybrid GANs to generate images. Finally, we empirically analyze several properties related to stability and reconstruction of images from scattering coefficients. [ABSTRACT FROM AUTHOR]
- Published
- 2019
- Full Text
- View/download PDF
40. A simpler approach to obtaining an O(1/t) convergence rate for the projected stochastic subgradient method
- Author
-
Lacoste-Julien, Simon, Schmidt, Mark, Bach, Francis, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria), Laboratoire d'informatique de l'école normale supérieure (LIENS), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Département d'informatique - ENS Paris (DI-ENS), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Inria Paris-Rocquencourt, Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL), Lacoste-Julien, Simon, École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)
- Subjects
FOS: Computer and information sciences ,I.2.6 ,G.1.6 ,Machine Learning (stat.ML) ,[MATH.MATH-OC] Mathematics [math]/Optimization and Control [math.OC] ,[INFO.INFO-LG] Computer Science [cs]/Machine Learning [cs.LG] ,[STAT.ML] Statistics [stat]/Machine Learning [stat.ML] ,Machine Learning (cs.LG) ,Computer Science - Learning ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,Statistics - Machine Learning ,Optimization and Control (math.OC) ,FOS: Mathematics ,90C15, 68T05, 65K10 ,[MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC] ,Mathematics - Optimization and Control - Abstract
In this note, we present a new averaging technique for the projected stochastic subgradient method. By using a weighted average with a weight of t+1 for each iterate w_t at iteration t, we obtain the convergence rate of O(1/t) with both an easy proof and an easy implementation. The new scheme is compared empirically to existing techniques, with similar performance behavior., Comment: 8 pages, 6 figures. Changes with previous version: Added reference to concurrently submitted work arXiv:1212.1824v1; clarifications added; typos corrected; title changed to 'subgradient method' as 'subgradient descent' is misnomer
- Published
- 2012
41. Sequential Kernel Herding: Frank-Wolfe Optimization for Particle Filtering
- Author
-
Lacoste-Julien, Simon, Lindsten, Fredrik, Bach, Francis, Laboratoire d'informatique de l'école normale supérieure (LIENS), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Statistical Machine Learning and Parsimony (SIERRA), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria), Microsoft Research - Inria Joint Centre (MSR - INRIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Microsoft Research Laboratory Cambridge-Microsoft Corporation [Redmond, Wash.], University of Cambridge [UK] (CAM), MSR-Inria Joint Centre, European Research Council (SIERRA project 239993), Swedish Research Council (project 'Learning of complex dynamical systems' 637-2014-466), European Project: 239993,EC:FP7:ERC,ERC-2009-StG,SIERRA(2009), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS), Département d'informatique de l'École normale supérieure (DI-ENS), and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris)
- Subjects
FOS: Computer and information sciences ,Computer Science - Learning ,[STAT.ML]Statistics [stat]/Machine Learning [stat.ML] ,[INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG] ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Machine Learning (cs.LG) - Abstract
Recently, the Frank-Wolfe optimization algorithm was suggested as a procedure to obtain adaptive quadrature rules for integrals of functions in a reproducing kernel Hilbert space (RKHS) with a potentially faster rate of convergence than Monte Carlo integration (and "kernel herding" was shown to be a special case of this procedure). In this paper, we propose to replace the random sampling step in a particle filter by Frank-Wolfe optimization. By optimizing the position of the particles, we can obtain better accuracy than random or quasi-Monte Carlo sampling. In applications where the evaluation of the emission probabilities is expensive (such as in robot localization), the additional computational cost to generate the particles through optimization can be justified. Experiments on standard synthetic examples as well as on a robot localization task indicate indeed an improvement of accuracy over random and quasi-Monte Carlo sampling., in 18th International Conference on Artificial Intelligence and Statistics (AISTATS), May 2015, San Diego, United States. 38, JMLR Workshop and Conference Proceedings
- Published
- 2015
42. An Affine Invariant Linear Convergence Analysis for Frank-Wolfe Algorithms
- Author
-
Lacoste-Julien, Simon and Jaggi, Martin
- Subjects
Optimization and Control (math.OC) ,G.1.6 ,FOS: Mathematics ,Mathematics - Optimization and Control ,90C25 - Abstract
We study the linear convergence of variants of the Frank-Wolfe algorithms for some classes of strongly convex problems, using only affine-invariant quantities. As in Guelat & Marcotte (1986), we show the linear convergence of the standard Frank-Wolfe algorithm when the solution is in the interior of the domain, but with affine invariant constants. We also show the linear convergence of the away-steps variant of the Frank-Wolfe algorithm, but with constants which only depend on the geometry of the domain, and not any property of the location of the optimal solution. Running these algorithms does not require knowing any problem specific parameters., appeared at the NIPS 2013 Workshop on Greedy Algorithms, Frank-Wolfe and Friends (v2: added acknowledgements)
- Published
- 2013
43. Learning from Narrated Instruction Videos.
- Author
-
Alayrac, Jean-Baptiste, Bojanowski, Piotr, Agrawal, Nishant, Sivic, Josef, Laptev, Ivan, and Lacoste-Julien, Simon
- Subjects
IMAGE processing ,MULTIPLE correspondence analysis (Statistics) ,DIGITAL image processing ,COMPUTER graphics ,IMAGE recognition (Computer vision) - Abstract
Automatic assistants could guide a person or a robot in performing new tasks, such as changing a car tire or repotting a plant. Creating such assistants, however, is non-trivial and requires understanding of visual and verbal content of a video. Towards this goal, we here address the problem of automatically learning the main steps of a task from a set of narrated instruction videos. We develop a new unsupervised learning approach that takes advantage of the complementary nature of the input video and the associated narration. The method sequentially clusters textual and visual representations of a task, where the two clustering problems are linked by joint constraints to obtain a single coherent sequence of steps in both modalities. To evaluate our method, we collect and annotate a new challenging dataset of real-world instruction videos from the Internet. The dataset contains videos for five different tasks with complex interactions between people and objects, captured in a variety of indoor and outdoor settings. We experimentally demonstrate that the proposed method can automatically discover, learn and localize the main steps of a task in input videos. [ABSTRACT FROM AUTHOR]
- Published
- 2018
- Full Text
- View/download PDF
44. Gaussian Probabilities and Expectation Propagation
- Author
-
Cunningham, John P., Hennig, Philipp, and Lacoste-Julien, Simon
- Subjects
FOS: Computer and information sciences ,Statistics - Machine Learning ,Machine Learning (stat.ML) ,Computer Science::Computational Geometry - Abstract
While Gaussian probability densities are omnipresent in applied mathematics, Gaussian cumulative probabilities are hard to calculate in any but the univariate case. We study the utility of Expectation Propagation (EP) as an approximate integration method for this problem. For rectangular integration regions, the approximation is highly accurate. We also extend the derivations to the more general case of polyhedral integration regions. However, we find that in this polyhedral case, EP's answer, though often accurate, can be almost arbitrarily wrong. We consider these unexpected results empirically and theoretically, both for the problem of Gaussian probabilities and for EP more generally. These results elucidate an interesting and non-obvious feature of EP not yet studied in detail.
- Published
- 2011
45. Unsupervised Learning from Narrated Instruction Videos.
- Author
-
Alayrac, Jean-Baptiste, Bojanowski, Piotr, Agrawal, Nishant, Sivic, Josef, Laptev, Ivan, and Lacoste-Julien, Simon
- Published
- 2016
- Full Text
- View/download PDF
46. On pairwise costs for network flow multi-object tracking.
- Author
-
Chari, Visesh, Lacoste-Julien, Simon, Laptev, Ivan, and Sivic, Josef
- Published
- 2015
- Full Text
- View/download PDF
47. SIGMa.
- Author
-
Lacoste-Julien, Simon, Palla, Konstantina, Davies, Alex, Kasneci, Gjergji, Graepel, Thore, and Ghahramani, Zoubin
- Published
- 2013
- Full Text
- View/download PDF
48. Structured Prediction, Dual Extragradient and Bregman Projections.
- Author
-
Taskar, Ben, Lacoste-Julien, Simon, Jordan, Michael I., Bennett, Kristin P., and Parrado-Hernández, Emilio
- Subjects
- *
ALGORITHMS , *ESTIMATION theory , *MATHEMATICAL models , *MARKOV processes , *DYNAMIC programming - Abstract
We present a simple and scalable algorithm for maximum-margin estimation of structured output models, including an important class of Markov networks and combinatorial models. We formulate the estimation problem as a convex-concave saddle-point problem that allows us to use simple projection methods based on the dual extragradient algorithm (Nesterov, 2003). The projection step can be solved using dynamic programming or combinatorial algorithms for min-cost convex flow, depending on the structure of the problem. We show that this approach provides a memory-efficient alternative to formulations based on reductions to a quadratic program (QP). We analyze the convergence of the method and present experiments on two very different structured prediction tasks: 3D image segmentation and word alignment, illustrating the favorable scaling properties of our algorithm. [ABSTRACT FROM AUTHOR]
- Published
- 2006
49. Adversarial games in machine learning : challenges and applications
- Author
-
Berard, Hugo, Vincent, Pascal, and Lacoste-Julien, Simon
- Subjects
Optimization ,Generative adversarial networks ,Extragradient ,Nash-equilibrium ,Apprentissage automatique ,Inéquation variationelle ,Théorie des jeux ,Visualisation ,Réseaux antagonistes génératifs ,Équilibre de Nash ,Machine learning ,Variational Inequality ,Generative modeling ,Adversarial Training ,Modèles génératifs ,Exemples antagonistes ,Réseaux de neurones ,Optimisation ,Neural networks ,Adversarial Attacks ,Game theory ,Visualization - Abstract
L’apprentissage automatique repose pour un bon nombre de problèmes sur la minimisation d’une fonction de coût, pour ce faire il tire parti de la vaste littérature sur l’optimisation qui fournit des algorithmes et des garanties de convergences pour ce type de problèmes. Cependant récemment plusieurs modèles d’apprentissage automatique qui ne peuvent pas être formulé comme la minimisation d’un coût unique ont été propose, à la place ils nécessitent de définir un jeu entre plusieurs joueurs qui ont chaque leur propre objectif. Un de ces modèles sont les réseaux antagonistes génératifs (GANs). Ce modèle génératif formule un jeu entre deux réseaux de neurones, un générateur et un discriminateur, en essayant de tromper le discriminateur qui essaye de distinguer les vraies images des fausses, le générateur et le discriminateur s’améliore résultant en un équilibre de Nash, ou les images produites par le générateur sont indistinguable des vraies images. Malgré leur succès les GANs restent difficiles à entrainer à cause de la nature antagoniste du jeu, nécessitant de choisir les bons hyperparamètres et résultant souvent en une dynamique d’entrainement instable. Plusieurs techniques de régularisations ont été propose afin de stabiliser l’entrainement, dans cette thèse nous abordons ces instabilités sous l’angle d’un problème d’optimisation. Nous commençons par combler le fossé entre la littérature d’optimisation et les GANs, pour ce faire nous formulons GANs comme un problème d’inéquation variationnelle, et proposons de la littérature sur le sujet pour proposer des algorithmes qui convergent plus rapidement. Afin de mieux comprendre quels sont les défis de l’optimisation des jeux, nous proposons plusieurs outils afin d’analyser le paysage d’optimisation des GANs. En utilisant ces outils, nous montrons que des composantes rotationnelles sont présentes dans le voisinage des équilibres, nous observons également que les GANs convergent rarement vers un équilibre de Nash mais converge plutôt vers des équilibres stables locaux (LSSP). Inspirer par le succès des GANs nous proposons pour finir, une nouvelle famille de jeux que nous appelons adversarial example games qui consiste à entrainer simultanément un générateur et un critique, le générateur cherchant à perturber les exemples afin d’induire en erreur le critique, le critique cherchant à être robuste aux perturbations. Nous montrons qu’à l’équilibre de ce jeu, le générateur est capable de générer des perturbations qui transfèrent à toute une famille de modèles., Many machine learning (ML) problems can be formulated as minimization problems, with a large optimization literature that provides algorithms and guarantees to solve this type of problems. However, recently some ML problems have been proposed that cannot be formulated as minimization problems but instead require to define a game between several players where each player has a different objective. A successful application of such games in ML are generative adversarial networks (GANs), where generative modeling is formulated as a game between a generator and a discriminator, where the goal of the generator is to fool the discriminator, while the discriminator tries to distinguish between fake and real samples. However due to the adversarial nature of the game, GANs are notoriously hard to train, requiring careful fine-tuning of the hyper-parameters and leading to unstable training. While regularization techniques have been proposed to stabilize training, we propose in this thesis to look at these instabilities from an optimization perspective. We start by bridging the gap between the machine learning and optimization literature by casting GANs as an instance of the Variational Inequality Problem (VIP), and leverage the large literature on VIP to derive more efficient and stable algorithms to train GANs. To better understand what are the challenges of training GANs, we then propose tools to study the optimization landscape of GANs. Using these tools we show that GANs do suffer from rotation around their equilibrium, and that they do not converge to Nash-Equilibria. Finally inspired by the success of GANs to generate images, we propose a new type of games called Adversarial Example Games that are able to generate adversarial examples that transfer across different models and architectures.
- Published
- 2023
50. Acceleration and new analysis of convex optimization algorithms
- Author
-
Liu, Lewis, Lacoste-Julien, Simon, and Scieur, Damien
- Subjects
Conditional Gradient ,Broyden–Fletcher–Goldfarb–Shanno method ,Formule de Davidon-Fletcher-Powell ,Frank-Wolfe ,Dégradé Conditionnel ,Davidon–Fletcher–Powell formula ,Méthode Broyden-Fletcher-Goldfarb-Shanno - Abstract
Ces dernières années ont vu une résurgence de l’algorithme de Frank-Wolfe (FW) (également connu sous le nom de méthodes de gradient conditionnel) dans l’optimisation clairsemée et les problèmes d’apprentissage automatique à grande échelle avec des objectifs convexes lisses. Par rapport aux méthodes de gradient projeté ou proximal, une telle méthode sans projection permet d’économiser le coût de calcul des projections orthogonales sur l’ensemble de contraintes. Parallèlement, FW propose également des solutions à structure clairsemée. Malgré ces propriétés prometteuses, FW ne bénéficie pas des taux de convergence optimaux obtenus par les méthodes accélérées basées sur la projection. Nous menons une enquête dé- taillée sur les essais récents pour accélérer FW dans différents contextes et soulignons où se situe la difficulté lorsque l’on vise des taux linéaires globaux en théorie. En outre, nous fournissons une direction prometteuse pour accélérer FW sur des ensembles fortement convexes en utilisant des techniques d’intervalle de dualité et une nouvelle notion de régularité. D’autre part, l’algorithme FW est une covariante affine et bénéficie de taux de convergence accélérés lorsque l’ensemble de contraintes est fortement convexe. Cependant, ces résultats reposent sur des hypothèses dépendantes de la norme, entraînant généralement des bornes invariantes non affines, en contradiction avec la propriété de covariante affine de FW. Dans ce travail, nous introduisons de nouvelles hypothèses structurelles sur le problème (comme la régularité directionnelle) et dérivons une analyse affine invariante et indépendante de la norme de Frank-Wolfe. Sur la base de notre analyse, nous proposons une recherche par ligne affine invariante. Fait intéressant, nous montrons que les recherches en ligne classiques utilisant la régularité de la fonction objectif convergent étonnamment vers une taille de pas invariante affine, malgré l’utilisation de normes dépendantes de l’affine dans le calcul des tailles de pas. Cela indique que nous n’avons pas nécessairement besoin de connaître à l’avance la structure des ensembles pour profiter du taux accéléré affine-invariant. Dans un autre axe de recherche, nous étudions les algorithmes au-delà des méthodes du premier ordre. Les techniques Quasi-Newton approchent le pas de Newton en estimant le Hessien en utilisant les équations dites sécantes. Certaines de ces méthodes calculent le Hessien en utilisant plusieurs équations sécantes mais produisent des mises à jour non symétriques. D’autres schémas quasi-Newton, tels que BFGS, imposent la symétrie mais ne peuvent pas satisfaire plus d’une équation sécante. Nous proposons un nouveau type de mise à jour symétrique quasi-Newton utilisant plusieurs équations sécantes au sens des moindres carrés. Notre approche généralise et unifie la conception de mises à jour quasi-Newton et satisfait des garanties de robustesse prouvables., Recent years have witnessed a resurgence of the Frank-Wolfe (FW) algorithm, also known as conditional gradient methods, in sparse optimization and large-scale machine learning problems with smooth convex objectives. Compared to projected or proximal gradient methods, such projection-free method saves the computational cost of orthogonal projections onto the constraint set. Meanwhile, FW also gives solutions with sparse structure. Despite of these promising properties, FW does not enjoy the optimal convergence rates achieved by projection-based accelerated methods. On the other hand, FW algorithm is affine-covariant, and enjoys accelerated convergence rates when the constraint set is strongly convex. However, these results rely on norm-dependent assumptions, usually incurring non-affine invariant bounds, in contradiction with FW’s affine-covariant property. In this work, we introduce new structural assumptions on the problem (such as the directional smoothness) and derive an affine in- variant, norm-independent analysis of Frank-Wolfe. Based on our analysis, we pro- pose an affine invariant backtracking line-search. Interestingly, we show that typical back-tracking line-search techniques using smoothness of the objective function surprisingly converge to an affine invariant stepsize, despite using affine-dependent norms in the computation of stepsizes. This indicates that we do not necessarily need to know the structure of sets in advance to enjoy the affine-invariant accelerated rate. Additionally, we provide a promising direction to accelerate FW over strongly convex sets using duality gap techniques and a new version of smoothness. In another line of research, we study algorithms beyond first-order methods. Quasi-Newton techniques approximate the Newton step by estimating the Hessian using the so-called secant equations. Some of these methods compute the Hessian using several secant equations but produce non-symmetric updates. Other quasi- Newton schemes, such as BFGS, enforce symmetry but cannot satisfy more than one secant equation. We propose a new type of quasi-Newton symmetric update using several secant equations in a least-squares sense. Our approach generalizes and unifies the design of quasi-Newton updates and satisfies provable robustness guarantees.
- Published
- 2022
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.