Author: "Jaggi, Martin" / Publisher: arxiv - Searchworks@Jio Institute Digital Library Search Results

1. Provably Personalized and Robust Federated Learning

Author: Werner, Mariel, He, Lie, Karimireddy, Sai Praneeth, Jordan, Michael, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: Clustering clients with similar objectives and learning a model per cluster is an intuitive and interpretable approach to personalization in federated learning. However, doing so with provable and optimal guarantees has remained an open challenge. In this work, we formalize personalized federated learning as a stochastic optimization problem where the stochastic gradients on a client may correspond to one of $K$ distributions. In such a setting, we show that using i) a simple thresholding-based clustering algorithm, and ii) local client gradients obtains optimal convergence guarantees. In fact, our rates asymptotically match those obtained if we knew the true underlying clustering of the clients. Furthermore, our algorithms are provably robust in the Byzantine setting where some fraction of the gradients are corrupted.
Published: 2023
Full Text: View/download PDF

2. Rotational Optimizers: Simple & Robust DNN Training

Author: Kosson, Atli, Messmer, Bettina, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Machine Learning (cs.LG)
Abstract: The training dynamics of modern deep neural networks depend on complex interactions between the learning rate, weight decay, initialization, and other hyperparameters. These interactions can give rise to Spherical Motion Dynamics in scale-invariant layers (e.g., normalized layers), which converge to an equilibrium state, where the weight norm and the expected rotational update size are fixed. Our analysis of this equilibrium in AdamW, SGD with momentum, and Lion provides new insights into the effects of different hyperparameters and their interactions on the training process. We propose rotational variants (RVs) of these optimizers that force the expected angular update size to match the equilibrium value throughout training. This simplifies the training dynamics by removing the transient phase corresponding to the convergence to an equilibrium. Our rotational optimizers can match the performance of the original variants, often with minimal or no tuning of the baseline hyperparameters, showing that these transient phases are not needed. Furthermore, we find that the rotational optimizers have a reduced need for learning rate warmup and improve the optimization of poorly normalized networks., 23 pages, 9 figures
Published: 2023
Full Text: View/download PDF

3. Unified Convergence Theory of Stochastic and Variance-Reduced Cubic Newton Methods

Author: Chayti, El Mahdi, Doikov, Nikita, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We study the widely known Cubic-Newton method in the stochastic setting and propose a general framework to use variance reduction which we call the helper framework. In all previous work, these methods were proposed with very large batches (both in gradients and Hessians) and with various and often strong assumptions. In this work, we investigate the possibility of using such methods without large batches and use very simple assumptions that are sufficient for all our methods to work. In addition, we study these methods applied to gradient-dominated functions. In the general case, we show improved convergence (compared to first-order methods) to an approximate local minimum, and for gradient-dominated functions, we show convergence to approximate global minima.
Published: 2023
Full Text: View/download PDF

4. Landmark Attention: Random-Access Infinite Context Length for Transformers

Author: Mohtashami, Amirkeivan and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: While transformers have shown remarkable success in natural language processing, their attention mechanism's large memory requirements have limited their ability to handle longer contexts. Prior approaches, such as recurrent memory or retrieval-based augmentation, have either compromised the random-access flexibility of attention (i.e., the capability to select any token in the entire context) or relied on separate mechanisms for relevant context retrieval, which may not be compatible with the model's attention. In this paper, we present a novel approach that allows access to the complete context while retaining random-access flexibility, closely resembling running attention on the entire context. Our method uses a landmark token to represent each block of the input and trains the attention to use it for selecting relevant blocks, enabling retrieval of blocks directly through the attention mechanism instead of by relying on a separate mechanism. Our approach seamlessly integrates with specialized data structures and the system's memory hierarchy, enabling processing of arbitrarily long context lengths. We demonstrate that our method can obtain comparable performance with Transformer-XL while significantly reducing the number of retrieved tokens in each step. Finally, we show that fine-tuning LLaMA 7B with our method successfully extends its context length capacity up to 32k tokens, allowing for inference at the context lengths of GPT-4.
Published: 2023
Full Text: View/download PDF

5. Shuffle SGD is Always Better than SGD: Improved Analysis of SGD with Arbitrary Data Orders

Author: Koloskova, Anastasia, Doikov, Nikita, Stich, Sebastian U., and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Stochastic Gradient Descent (SGD) algorithms are widely used in optimizing neural networks, with Random Reshuffling (RR) and Single Shuffle (SS) being popular choices for cycling through random or single permutations of the training data. However, the convergence properties of these algorithms in the non-convex case are not fully understood. Existing results suggest that, in realistic training scenarios where the number of epochs is smaller than the training set size, RR may perform worse than SGD. In this paper, we analyze a general SGD algorithm that allows for arbitrary data orderings and show improved convergence rates for non-convex functions. Specifically, our analysis reveals that SGD with random and single shuffling is always faster or at least as good as classical SGD with replacement, regardless of the number of iterations. Overall, our study highlights the benefits of using SGD with random/single shuffling and provides new insights into its convergence properties for non-convex optimization.
Published: 2023
Full Text: View/download PDF

6. Layerwise Linear Mode Connectivity

Author: Adilova, Linara, Fischer, Asja, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: In the federated setup one performs an aggregation of separate local models multiple times during training in order to obtain a stronger global model; most often aggregation is a simple averaging of the parameters. Understanding when and why averaging works in a non-convex setup, such as federated deep learning, is an open challenge that hinders obtaining highly performant global models. On i.i.d.~datasets federated deep learning with frequent averaging is successful. The common understanding, however, is that during the independent training models are drifting away from each other and thus averaging may not work anymore after many local parameter updates. The problem can be seen from the perspective of the loss surface: for points on a non-convex surface the average can become arbitrarily bad. The assumption of local convexity, often used to explain the success of federated averaging, contradicts to the empirical evidence showing that high loss barriers exist between models from the very beginning of the learning, even when training on the same data. Based on the observation that the learning process evolves differently in different layers, we investigate the barrier between models in a layerwise fashion. Our conjecture is that barriers preventing from successful federated training are caused by a particular layer or group of layers., Comment: HLD 2023: 1st Workshop on High-dimensional Learning Dynamics, ICML 2023, Hawaii, USA
Published: 2023
Full Text: View/download PDF

7. Faster Causal Attention Over Large Sequences Through Sparse Flash Attention

Author: Pagliardini, Matteo, Paliotta, Daniele, Jaggi, Martin, and Fleuret, François
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Transformer-based language models have found many diverse applications requiring them to process sequences of increasing length. For these applications, the causal self-attention -- which is the only component scaling quadratically w.r.t. the sequence length -- becomes a central concern. While many works have proposed schemes to sparsify the attention patterns and reduce the computational overhead of self-attention, those are often limited by implementations concerns and end up imposing a simple and static structure over the attention matrix. Conversely, implementing more dynamic sparse attentions often results in runtimes significantly slower than computing the full attention using the Flash implementation from Dao et al. (2022). We extend FlashAttention to accommodate a large class of attention sparsity patterns that, in particular, encompass key/query dropping and hashing-based attention. This leads to implementations with no computational complexity overhead and a multi-fold runtime speedup on top of FlashAttention. Even with relatively low degrees of sparsity, our method improves visibly upon FlashAttention as the sequence length increases. Without sacrificing perplexity, we increase the training speed of a transformer language model by $2.0\times$ and $3.3\times$ for sequences of respectively $8k$ and $16k$ tokens.
Published: 2023
Full Text: View/download PDF

8. Collaborative Learning via Prediction Consensus

Author: Fan, Dongyang, Mendler-Dünner, Celestine, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We consider a collaborative learning setting where each agent's goal is to improve their own model by leveraging the expertise of collaborators, in addition to their own training data. To facilitate the exchange of expertise among agents, we propose a distillation-based method leveraging unlabeled auxiliary data, which is pseudo-labeled by the collective. Central to our method is a trust weighting scheme which serves to adaptively weigh the influence of each collaborator on the pseudo-labels until a consensus on how to label the auxiliary data is reached. We demonstrate that our collaboration scheme is able to significantly boost individual model's performance with respect to the global distribution, compared to local training. At the same time, the adaptive trust weights can effectively identify and mitigate the negative impact of bad models on the collective. We find that our method is particularly effective in the presence of heterogeneity among individual agents, both in terms of training data as well as model architectures.
Published: 2023
Full Text: View/download PDF

9. Beyond spectral gap (extended): The role of the topology in decentralized learning

Author: Vogels, Thijs, Hendrikx, Hadrien, Jaggi, Martin, Ecole Polytechnique Fédérale de Lausanne (EPFL), Apprentissage de modèles à partir de données massives (Thoth), Inria Grenoble - Rhône-Alpes, Institut National de Recherche en Informatique et en Automatique (Inria)-Institut National de Recherche en Informatique et en Automatique (Inria)-Laboratoire Jean Kuntzmann (LJK), Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), Université Grenoble Alpes (UGA)-Centre National de la Recherche Scientifique (CNRS)-Université Grenoble Alpes (UGA)-Institut polytechnique de Grenoble - Grenoble Institute of Technology (Grenoble INP ), and Université Grenoble Alpes (UGA)
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Decentralized Learning, Convex Optimization, Spectral Gap, Gossip Algorithms, Machine Learning (cs.LG), Computer Science - Distributed, Parallel, and Cluster Computing, Optimization and Control (math.OC), FOS: Mathematics, [MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC], Distributed, Parallel, and Cluster Computing (cs.DC), [INFO.INFO-DC]Computer Science [cs]/Distributed, Parallel, and Cluster Computing [cs.DC], Mathematics - Optimization and Control, Stochastic Gradient Descent
Abstract: In data-parallel optimization of machine learning models, workers collaborate to improve their estimates of the model: more accurate gradients allow them to use larger learning rates and optimize faster. In the decentralized setting, in which workers communicate over a sparse graph, current theory fails to capture important aspects of real-world behavior. First, the `spectral gap' of the communication graph is not predictive of its empirical performance in (deep) learning. Second, current theory does not explain that collaboration enables larger learning rates than training alone. In fact, it prescribes smaller learning rates, which further decrease as graphs become larger, failing to explain convergence dynamics in infinite graphs. This paper aims to paint an accurate picture of sparsely-connected distributed optimization. We quantify how the graph topology influences convergence in a quadratic toy problem and provide theoretical results for general smooth and (strongly) convex objectives. Our theory matches empirical observations in deep learning, and accurately describes the relative merits of different graph topologies. This paper is an extension of the conference paper by Vogels et. al. (2022). Code: https://github.com/epfml/topology-in-decentralized-learning., Comment: Extended version of the other paper (with the same name), that includes (among other things) theory for the heterogeneous case. arXiv admin note: substantial text overlap with arXiv:2206.03093
Published: 2023
Full Text: View/download PDF

10. Characterizing & Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

Author: Mohtashami, Amirkeivan, Stich, Sebastian, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Machine Learning (cs.LG)
Abstract: While SGD, which samples from the data with replacement is widely studied in theory, a variant called Random Reshuffling (RR) is more common in practice. RR iterates through random permutations of the dataset and has been shown to converge faster than SGD. When the order is chosen deterministically, a variant called incremental gradient descent (IG), the existing convergence bounds show improvement over SGD but are worse than RR. However, these bounds do not differentiate between a good and a bad ordering and hold for the worst choice of order. Meanwhile, in some cases, choosing the right order when using IG can lead to convergence faster than RR. In this work, we quantify the effect of order on convergence speed, obtaining convergence bounds based on the chosen sequence of permutations while also recovering previous results for RR. In addition, we show benefits of using structured shuffling when various levels of abstractions (e.g. tasks, classes, augmentations, etc.) exists in the dataset in theory and in practice. Finally, relying on our measure, we develop a greedy algorithm for choosing good orders during training, achieving superior performance (by more than 14 percent in accuracy) over RR.
Published: 2022
Full Text: View/download PDF

11. Byzantine-Robust Decentralized Learning via ClippedGossip

Author: He, Lie, Karimireddy, Sai Praneeth, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Distributed, Parallel, and Cluster Computing (cs.DC), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In this paper, we study the challenging task of Byzantine-robust decentralized training on arbitrary communication graphs. Unlike federated learning where workers communicate through a server, workers in the decentralized environment can only talk to their neighbors, making it harder to reach consensus and benefit from collaborative training. To address these issues, we propose a ClippedGossip algorithm for Byzantine-robust consensus and optimization, which is the first to provably converge to a $O(\delta_{\max}\zeta^2/\gamma^2)$ neighborhood of the stationary point for non-convex objectives under standard assumptions. Finally, we demonstrate the encouraging empirical performance of ClippedGossip under a large number of attacks.
Published: 2022
Full Text: View/download PDF

12. Scalable Collaborative Learning via Representation Sharing

Author: Berdoz, Frédéric, Singh, Abhishek, Jaggi, Martin, and Raskar, Ramesh
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Machine Learning (cs.LG)
Abstract: Privacy-preserving machine learning has become a key conundrum for multi-party artificial intelligence. Federated learning (FL) and Split Learning (SL) are two frameworks that enable collaborative learning while keeping the data private (on device). In FL, each data holder trains a model locally and releases it to a central server for aggregation. In SL, the clients must release individual cut-layer activations (smashed data) to the server and wait for its response (during both inference and back propagation). While relevant in several settings, both of these schemes have a high communication cost, rely on server-level computation algorithms and do not allow for tunable levels of collaboration. In this work, we present a novel approach for privacy-preserving machine learning, where the clients collaborate via online knowledge distillation using a contrastive loss (contrastive w.r.t. the labels). The goal is to ensure that the participants learn similar features on similar classes without sharing their input data. To do so, each client releases averaged last hidden layer activations of similar labels to a central server that only acts as a relay (i.e., is not involved in the training or aggregation of the models). Then, the clients download these last layer activations (feature representations) of the ensemble of users and distill their knowledge in their personal model using a contrastive objective. For cross-device applications (i.e., small local datasets and limited computational capacity), this approach increases the utility of the models compared to independent learning and other federated knowledge distillation (FD) schemes, is communication efficient and is scalable with the number of clients. We prove theoretically that our framework is well-posed, and we benchmark its performance against standard FD and FL on various datasets using different model architectures.
Published: 2022
Full Text: View/download PDF

13. Special Properties of Gradient Descent with Large Learning Rates

Author: Mohtashami, Amirkeivan, Jaggi, Martin, and Stich, Sebastian
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: When training neural networks, it has been widely observed that a large step size is essential in stochastic gradient descent (SGD) for obtaining superior models. However, the effect of large step sizes on the success of SGD is not well understood theoretically. Several previous works have attributed this success to the stochastic noise present in SGD. However, we show through a novel set of experiments that the stochastic noise is not sufficient to explain good non-convex training, and that instead the effect of a large learning rate itself is essential for obtaining best performance.We demonstrate the same effects also in the noise-less case, i.e. for full-batch GD. We formally prove that GD with large step size -- on certain non-convex function classes -- follows a different trajectory than GD with a small step size, which can lead to convergence to a global minimum instead of a local one. Our settings provide a framework for future analysis which allows comparing algorithms based on behaviors that can not be observed in the traditional settings., Comment: A short version of this work appeared in ICML 22 ICML Workshop on Continuous Time Methods for Machine Learning under the title "The Gap Between Continuous and Discrete Gradient Descent"
Published: 2022
Full Text: View/download PDF

14. On Second-order Optimization Methods for Federated Learning

Author: Bischoff, Sebastian, Günnemann, Stephan, Jaggi, Martin, and Stich, Sebastian U.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We consider federated learning (FL), where the training data is distributed across a large number of clients. The standard optimization method in this setting is Federated Averaging (FedAvg), which performs multiple local first-order optimization steps between communication rounds. In this work, we evaluate the performance of several second-order distributed methods with local steps in the FL setting which promise to have favorable convergence properties. We (i) show that FedAvg performs surprisingly well against its second-order competitors when evaluated under fair metrics (equal amount of local computations)-in contrast to the results of previous work. Based on our numerical study, we propose (ii) a novel variant that uses second-order local information for updates and a global line search to counteract the resulting local specificity., Comment: ICML 2021 Workshop "Beyond first-order methods in ML systems"
Published: 2021
Full Text: View/download PDF

15. Linear Speedup in Personalized Collaborative Learning

Author: Chayti, El Mahdi, Karimireddy, Sai Praneeth, Stich, Sebastian U., Flammarion, Nicolas, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Collaborative training can improve the accuracy of a model for a user by trading off the model's bias (introduced by using data from other users who are potentially different) against its variance (due to the limited amount of data on any single user). In this work, we formalize the personalized collaborative learning problem as a stochastic optimization of a task 0 while giving access to N related but different tasks 1,..., N. We provide convergence guarantees for two algorithms in this setting -- a popular collaboration method known as weighted gradient averaging, and a novel bias correction method -- and explore conditions under which we can achieve linear speedup w.r.t. the number of auxiliary tasks N. Further, we also empirically study their performance confirming our theoretical insights.
Published: 2021
Full Text: View/download PDF

16. Lightweight Cross-Lingual Sentence Representation Learning

Author: Mao, Zhuoyuan, Gupta, Prakhar, Wang, Pei, Chu, Chenhui, Jaggi, Martin, and Kurohashi, Sadao
Subjects: FOS: Computer and information sciences, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Computation and Language (cs.CL)
Abstract: Large-scale models for learning fixed-dimensional cross-lingual sentence representations like LASER (Artetxe and Schwenk, 2019b) lead to significant improvement in performance on downstream tasks. However, further increases and modifications based on such large-scale models are usually impractical due to memory limitations. In this work, we introduce a lightweight dual-transformer architecture with just 2 layers for generating memory-efficient cross-lingual sentence representations. We explore different training tasks and observe that current cross-lingual training tasks leave a lot to be desired for this shallow architecture. To ameliorate this, we propose a novel cross-lingual language model, which combines the existing single-word masked language model with the newly proposed cross-lingual token-level reconstruction task. We further augment the training task by the introduction of two computationally-lite sentence-level contrastive learning tasks to enhance the alignment of cross-lingual sentence representation space, which compensates for the learning bottleneck of the lightweight transformer for generative tasks. Our comparisons with competing models on cross-lingual sentence retrieval and multilingual document classification confirm the effectiveness of the newly proposed training tasks for a shallow model., Comment: ACL 2021 main conference; modified Eq. (2)
Published: 2021
Full Text: View/download PDF

17. Critical Parameters for Scalable Distributed Learning with Large Batches and Asynchronous Updates

Author: Stich, Sebastian Urban, Mohtashami, Amirkeivan, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, ml-ai, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, Machine Learning (stat.ML), Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: It has been experimentally observed that the efficiency of distributed training with stochastic gradient (SGD) depends decisively on the batch size and -- in asynchronous implementations -- on the gradient staleness. Especially, it has been observed that the speedup saturates beyond a certain batch size and/or when the delays grow too large. We identify a data-dependent parameter that explains the speedup saturation in both these settings. Our comprehensive theoretical analysis, for strongly convex, convex and non-convex settings, unifies and generalized prior work directions that often focused on only one of these two aspects. In particular, our approach allows us to derive improved speedup results under frequently considered sparsity assumptions. Our insights give rise to theoretically based guidelines on how the learning rates can be adjusted in practice. We show that our results are tight and illustrate key findings in numerical experiments., Comment: Proceedings of the 24th International Conference on Artificial Intelligence and Statistics (AISTATS) 2021
Published: 2021
Full Text: View/download PDF

18. Exact Optimization of Conformal Predictors via Incremental and Decremental Learning

Author: Cherubin, Giovanni, Chatzikokolakis, Konstantinos, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Conformal Predictors (CP) are wrappers around ML models, providing error guarantees under weak assumptions on the data distribution. They are suitable for a wide range of problems, from classification and regression to anomaly detection. Unfortunately, their very high computational complexity limits their applicability to large datasets. In this work, we show that it is possible to speed up a CP classifier considerably, by studying it in conjunction with the underlying ML method, and by exploiting incremental&decremental learning. For methods such as k-NN, KDE, and kernel LS-SVM, our approach reduces the running time by one order of magnitude, whilst producing exact solutions. With similar ideas, we also achieve a linear speed up for the harder case of bootstrapping. Finally, we extend these techniques to improve upon an optimization of k-NN CP for regression. We evaluate our findings empirically, and discuss when methods are suitable for CP optimization.
Published: 2021
Full Text: View/download PDF

19. A Field Guide to Federated Optimization

Author: Wang, Jianyu, Charles, Zachary, Xu, Zheng, Joshi, Gauri, McMahan, H. Brendan, Arcas, Blaise Aguera y, Al-Shedivat, Maruan, Andrew, Galen, Avestimehr, Salman, Daly, Katharine, Data, Deepesh, Diggavi, Suhas, Eichner, Hubert, Gadhikar, Advait, Garrett, Zachary, Girgis, Antonious M., Hanzely, Filip, Hard, Andrew, He, Chaoyang, Horvath, Samuel, Huo, Zhouyuan, Ingerman, Alex, Jaggi, Martin, Javidi, Tara, Kairouz, Peter, Kale, Satyen, Karimireddy, Sai Praneeth, Konecny, Jakub, Koyejo, Sanmi, Li, Tian, Liu, Luyang, Mohri, Mehryar, Qi, Hang, Reddi, Sashank J., Richtarik, Peter, Singhal, Karan, Smith, Virginia, Soltanolkotabi, Mahdi, Song, Weikang, Suresh, Ananda Theertha, Stich, Sebastian U., Talwalkar, Ameet, Wang, Hongyi, Woodworth, Blake, Wu, Shanshan, Yu, Felix X., Yuan, Honglin, Zaheer, Manzil, Zhang, Mi, Zhang, Tong, Zheng, Chunxiang, Zhu, Chen, and Zhu, Wennan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Federated learning and analytics are a distributed approach for collaboratively learning models (or statistics) from decentralized data, motivated by and designed for privacy protection. The distributed learning process can be formulated as solving federated optimization problems, which emphasize communication efficiency, data heterogeneity, compatibility with privacy and system requirements, and other constraints that are not primary considerations in other problem settings. This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance. The goal of this work is not to survey the current literature, but to inspire researchers and practitioners to design federated learning algorithms that can be used in various practical applications.
Published: 2021
Full Text: View/download PDF

20. Masked Training of Neural Networks with Partial Gradients

Author: Mohtashami, Amirkeivan, Jaggi, Martin, and Stich, Sebastian U.
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: State-of-the-art training algorithms for deep learning models are based on stochastic gradient descent (SGD). Recently, many variations have been explored: perturbing parameters for better accuracy (such as in Extragradient), limiting SGD updates to a subset of parameters for increased efficiency (such as meProp) or a combination of both (such as Dropout). However, the convergence of these methods is often not studied in theory. We propose a unified theoretical framework to study such SGD variants -- encompassing the aforementioned algorithms and additionally a broad variety of methods used for communication efficient training or model compression. Our insights can be used as a guide to improve the efficiency of such methods and facilitate generalization to new applications. As an example, we tackle the task of jointly training networks, a version of which (limited to sub-networks) is used to create Slimmable Networks. By training a low-rank Transformer jointly with a standard one we obtain superior performance than when it is trained separately., Comment: Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS) 2022
Published: 2021
Full Text: View/download PDF

21. RelaySum for Decentralized Deep Learning on Heterogeneous Data

Author: Vogels, Thijs, He, Lie, Koloskova, Anastasia, Lin, Tao, Karimireddy, Sai Praneeth, Stich, Sebastian U., and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Distributed, Parallel, and Cluster Computing (cs.DC), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In decentralized machine learning, workers compute model updates on their local data. Because the workers only communicate with few neighbors without central coordination, these updates propagate progressively over the network. This paradigm enables distributed training on networks without all-to-all connectivity, helping to protect data privacy as well as to reduce the communication cost of distributed training in data centers. A key challenge, primarily in decentralized deep learning, remains the handling of differences between the workers' local data distributions. To tackle this challenge, we introduce the RelaySum mechanism for information propagation in decentralized learning. RelaySum uses spanning trees to distribute information exactly uniformly across all workers with finite delays depending on the distance between nodes. In contrast, the typical gossip averaging mechanism only distributes data uniformly asymptotically while using the same communication volume per step as RelaySum. We prove that RelaySGD, based on this mechanism, is independent of data heterogeneity and scales to many workers, enabling highly accurate decentralized deep learning on heterogeneous data. Our code is available at http://github.com/epfml/relaysgd., Comment: Presented at NeurIPS 2021
Published: 2021
Full Text: View/download PDF

22. Understanding Memorization from the Perspective of Optimization via Efficient Influence Estimation

Author: Liu, Futong, Lin, Tao, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Over-parameterized deep neural networks are able to achieve excellent training accuracy while maintaining a small generalization error. It has also been found that they are able to fit arbitrary labels, and this behaviour is referred to as the phenomenon of memorization. In this work, we study the phenomenon of memorization with turn-over dropout, an efficient method to estimate influence and memorization, for data with true labels (real data) and data with random labels (random data). Our main findings are: (i) For both real data and random data, the optimization of easy examples (e.g., real data) and difficult examples (e.g., random data) are conducted by the network simultaneously, with easy ones at a higher speed; (ii) For real data, a correct difficult example in the training dataset is more informative than an easy one. By showing the existence of memorization on random data and real data, we highlight the consistency between them regarding optimization and we emphasize the implication of memorization during optimization.
Published: 2021
Full Text: View/download PDF

23. Multi-Head Attention: Collaborate Instead of Concatenate

Author: Cordonnier, Jean-Baptiste, Loukas, Andreas, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning, Machine Learning (stat.ML), Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Attention layers are widely used in natural language processing (NLP) and are beginning to influence computer vision architectures. Training very large transformer models allowed significant improvement in both fields, but once trained, these networks show symptoms of over-parameterization. For instance, it is known that many attention heads can be pruned without impacting accuracy. This work aims to enhance current understanding on how multiple heads interact. Motivated by the observation that attention heads learn redundant key/query projections, we propose a collaborative multi-head attention layer that enables heads to learn shared projections. Our scheme decreases the number of parameters in an attention layer and can be used as a drop-in replacement in any transformer architecture. Our experiments confirm that sharing key/query dimensions can be exploited in language understanding, machine translation and vision. We also show that it is possible to re-parametrize a pre-trained multi-head attention layer into our collaborative attention layer. Collaborative multi-head attention reduces the size of the key and query projections by 4 for same accuracy and speed. Our code is public.
Published: 2020
Full Text: View/download PDF

24. Dynamic Model Pruning with Feedback

Author: Lin, Tao, Stich, Sebastian Urban, Barba Flores, Luis Felipe, Dmitriev, Daniil, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, network pruning, Statistics - Machine Learning, deep learning, dynamic reparameterization, Machine Learning (stat.ML), model compression, optimization, Machine Learning (cs.LG)
Abstract: Deep neural networks often have millions of parameters. This can hinder their deployment to low-end devices, not only due to high memory requirements but also because of increased latency at inference. We propose a novel model compression method that generates a sparse trained model without additional overhead: by allowing (i) dynamic allocation of the sparsity pattern and (ii) incorporating feedback signal to reactivate prematurely pruned weights we obtain a performant sparse model in one single training pass (retraining is not needed, but can further improve the performance). We evaluate our method on CIFAR-10 and ImageNet, and show that the obtained sparse models can reach the state-of-the-art performance of dense models. Moreover, their performance surpasses that of models generated by all previously proposed pruning schemes., Comment: appearing at ICLR 2020
Published: 2020
Full Text: View/download PDF

25. Sparse Communication for Training Deep Networks

Author: Eghlidi, Negar Foroutan and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, Machine Learning (stat.ML), Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: Synchronous stochastic gradient descent (SGD) is the most common method used for distributed training of deep learning models. In this algorithm, each worker shares its local gradients with others and updates the parameters using the average gradients of all workers. Although distributed training reduces the computation time, the communication overhead associated with the gradient exchange forms a scalability bottleneck for the algorithm. There are many compression techniques proposed to reduce the number of gradients that needs to be communicated. However, compressing the gradients introduces yet another overhead to the problem. In this work, we study several compression schemes and identify how three key parameters affect the performance. We also provide a set of insights on how to increase performance and introduce a simple sparsification scheme, random-block sparsification, that reduces communication while keeping the performance close to standard SGD.
Published: 2020
Full Text: View/download PDF

26. Robust Cross-lingual Embeddings from Parallel Sentences

Author: Sabet, Ali, Gupta, Prakhar, Cordonnier, Jean-Baptiste, West, Robert, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, InformationSystems_INFORMATIONSTORAGEANDRETRIEVAL, ComputingMethodologies_DOCUMENTANDTEXTPROCESSING, Computation and Language (cs.CL), Information Retrieval (cs.IR), Computer Science - Information Retrieval, Machine Learning (cs.LG)
Abstract: Recent advances in cross-lingual word embeddings have primarily relied on mapping-based methods, which project pretrained word embeddings from different languages into a shared space through a linear transformation. However, these approaches assume word embedding spaces are isomorphic between different languages, which has been shown not to hold in practice (S{\o}gaard et al., 2018), and fundamentally limits their performance. This motivates investigating joint learning methods which can overcome this impediment, by simultaneously learning embeddings across languages via a cross-lingual term in the training objective. We propose a bilingual extension of the CBOW method which leverages sentence-aligned corpora to obtain robust cross-lingual word and sentence representations. Our approach significantly improves cross-lingual sentence retrieval performance over all other approaches while maintaining parity with the current state-of-the-art methods on word-translation. It also achieves parity with a deep RNN method on a zero-shot cross-lingual document classification task, requiring far fewer computational resources for training and inference. As an additional advantage, our bilingual method leads to a much more pronounced improvement in the the quality of monolingual word vectors compared to other competing methods.
Published: 2019
Full Text: View/download PDF

27. Decentralized Deep Learning with Arbitrary Communication Compression

Author: Koloskova, Anastasiia, Lin, Tao, Stich, Sebastian Urban, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, G.1.6, E.4, Machine Learning (stat.ML), F.2.1, Machine Learning (cs.LG), ml-ai, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, Optimization and Control (math.OC), Computer Science - Data Structures and Algorithms, FOS: Mathematics, 68W10, 68W15, 68W40, 90C06, 90C25, 90C35, Data Structures and Algorithms (cs.DS), Distributed, Parallel, and Cluster Computing (cs.DC), Mathematics - Optimization and Control
Abstract: Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks, as well as for efficient scaling to large compute clusters. As current approaches suffer from limited bandwidth of the network, we propose the use of communication compression in the decentralized training context. We show that Choco-SGD $-$ recently introduced and analyzed for strongly-convex objectives only $-$ converges under arbitrary high compression ratio on general non-convex functions at the rate $O\bigl(1/\sqrt{nT}\bigr)$ where $T$ denotes the number of iterations and $n$ the number of workers. The algorithm achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods. We demonstrate the practical performance of the algorithm in two key scenarios: the training of deep learning models (i) over distributed user devices, connected by a social network and (ii) in a datacenter (outperforming all-reduce time-wise).
Published: 2019
Full Text: View/download PDF

28. MLSys: The New Frontier of Machine Learning Systems

Author: Ratner, Alexander, Alistarh, Dan, Alonso, Gustavo, Andersen, David G., Bailis, Peter, Bird, Sarah, Carlini, Nicholas, Catanzaro, Bryan, Chayes, Jennifer, Chung, Eric, Dally, Bill, Dean, Jeff, Dhillon, Inderjit S., Dimakis, Alexandros, Dubey, Pradeep, Elkan, Charles, Fursin, Grigori, Ganger, Gregory R., Getoor, Lise, Gibbons, Phillip B., Gibson, Garth A., Gonzalez, Joseph E., Gottschlich, Justin, Han, Song, Hazelwood, Kim, Huang, Furong, Jaggi, Martin, Jamieson, Kevin, Jordan, Michael I., Joshi, Gauri, Khalaf, Rania, Knight, Jason, Konečný, Jakub, Kraska, Tim, Kumar, Arun, Kyrillidis, Anastasios, Lakshmiratan, Aparna, Li, Jing, Madden, Samuel, McMahan, H. Brendan, Meijer, Erik, Mitliagkas, Ioannis, Monga, Rajat, Murray, Derek, Olukotun, Kunle, Papailiopoulos, Dimitris, Pekhimenko, Gennady, Rekatsinas, Theodoros, Rostamizadeh, Afshin, Ré, Christopher, De Sa, Christopher, Sedghi, Hanie, Sen, Siddhartha, Smith, Virginia, Smola, Alex, Song, Dawn, Sparks, Evan, Stoica, Ion, Sze, Vivienne, Udell, Madeleine, Vanschoren, Joaquin, Venkataraman, Shivaram, Vinayak, Rashmi, Weimer, Markus, Wilson, Andrew Gordon, Xing, Eric, Zaharia, Matei, Zhang, Ce, and Talwalkar, Ameet
Subjects: Software Engineering (cs.SE), FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Software Engineering, Computer Science - Databases, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, Databases (cs.DB), Machine Learning (stat.ML), Distributed, Parallel, and Cluster Computing (cs.DC), Machine Learning (cs.LG)
Abstract: Machine learning (ML) techniques are enjoying rapidly increasing adoption. However, designing and implementing the systems that support ML models in real-world deployments remains a significant obstacle, in large part due to the radically different development and deployment profile of modern ML methods, and the range of practical concerns that come with broader adoption. We propose to foster a new systems machine learning research community at the intersection of the traditional systems and ML communities, focused on topics such as hardware systems for ML, software systems for ML, and ML optimized for metrics beyond predictive accuracy. To do this, we describe a new conference, MLSys, that explicitly targets research at the intersection of systems and machine learning with a program committee split evenly between experts in systems and ML, and an explicit focus on topics at the intersection of the two.
Published: 2019
Full Text: View/download PDF

29. Context Mover's Distance & Barycenters: Optimal Transport of Contexts for Building Representations

Author: Singh, Sidak Pal, Hug, Andreas, Dieuleveut, Aymeric, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Machine Learning (stat.ML), Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: We present a framework for building unsupervised representations of entities and their compositions, where each entity is viewed as a probability distribution rather than a vector embedding. In particular, this distribution is supported over the contexts which co-occur with the entity and are embedded in a suitable low-dimensional space. This enables us to consider representation learning from the perspective of Optimal Transport and take advantage of its tools such as Wasserstein distance and barycenters. We elaborate how the method can be applied for obtaining unsupervised representations of text and illustrate the performance (quantitatively as well as qualitatively) on tasks such as measuring sentence similarity, word entailment and similarity, where we empirically observe significant gains (e.g., 4.1% relative improvement over Sent2vec, GenSen). The key benefits of the proposed approach include: (a) capturing uncertainty and polysemy via modeling the entities as distributions, (b) utilizing the underlying geometry of the particular task (with the ground cost), (c) simultaneously providing interpretability with the notion of optimal transport between contexts and (d) easy applicability on top of existing point embedding methods. The code, as well as prebuilt histograms, are available under https://github.com/context-mover/., AISTATS 2020. Also, accepted previously at ICLR 2019 DeepGenStruct Workshop
Published: 2018
Full Text: View/download PDF

30. Don't Use Large Mini-Batches, Use Local SGD

Author: Lin, Tao, Stich, Sebastian U., Patel, Kumar Kshitij, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, ml-ai, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Mini-batch stochastic gradient methods (SGD) are state of the art for distributed training of deep neural networks. Drastic increases in the mini-batch sizes have lead to key efficiency and scalability gains in recent years. However, progress faces a major roadblock, as models trained with large batches often do not generalize well, i.e. they do not show good accuracy on new data. As a remedy, we propose a \emph{post-local} SGD and show that it significantly improves the generalization performance compared to large-batch training on standard benchmarks while enjoying the same efficiency (time-to-accuracy) and scalability. We further provide an extensive study of the communication efficiency vs. performance trade-offs associated with a host of \emph{local SGD} variants., Comment: To appear in ICLR 2020
Published: 2018
Full Text: View/download PDF

31. COLA: Decentralized Linear Learning

Author: He, Lie, Bian, An, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, convergence, algorithm, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning, Machine Learning (stat.ML), Distributed, Parallel, and Cluster Computing (cs.DC), optimization, Machine Learning (cs.LG)
Abstract: Decentralized machine learning is a promising emerging paradigm in view of global challenges of data ownership and privacy. We consider learning of linear classification and regression models, in the setting where the training data is decentralized over many user devices, and the learning algorithm must run on-device, on an arbitrary communication network, without a central coordinator. We propose COLA, a new decentralized training algorithm with strong theoretical guarantees and superior practical performance. Our framework overcomes many limitations of existing methods, and achieves communication efficiency, scalability, elasticity as well as resilience to changes in data and allows for unreliable and heterogeneous participating devices.
Published: 2018
Full Text: View/download PDF

32. Global linear convergence of Newton's method without strong-convexity or Lipschitz gradients

Author: Karimireddy, Sai Praneeth, Stich, Sebastian U., and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Optimization and Control (math.OC), G.1.6, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG), 90C25, 68Q25
Abstract: We show that Newton's method converges globally at a linear rate for objective functions whose Hessians are stable. This class of problems includes many functions which are not strongly convex, such as logistic regression. Our linear convergence result is (i) affine-invariant, and holds even if an (ii) approximate Hessian is used, and if the subproblems are (iii) only solved approximately. Thus we theoretically demonstrate the superiority of Newton's method over first-order methods, which would only achieve a sublinear $O(1/t^2)$ rate under similar conditions., Comment: 19 pages
Published: 2018
Full Text: View/download PDF

33. Training DNNs with Hybrid Block Floating Point

Author: Drumond, Mario, Lin, Tao, Jaggi, Martin, and Falsafi, Babak
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, ml-ai, Statistics - Machine Learning, Computer Science - Numerical Analysis, FOS: Mathematics, Machine Learning (stat.ML), Numerical Analysis (math.NA), Machine Learning (cs.LG)
Abstract: The wide adoption of DNNs has given birth to unrelenting computing requirements, forcing datacenter operators to adopt domain-specific accelerators to train them. These accelerators typically employ densely packed full precision floating-point arithmetic to maximize performance per area. Ongoing research efforts seek to further increase that performance density by replacing floating-point with fixed-point arithmetic. However, a significant roadblock for these attempts has been fixed point's narrow dynamic range, which is insufficient for DNN training convergence. We identify block floating point (BFP) as a promising alternative representation since it exhibits wide dynamic range and enables the majority of DNN operations to be performed with fixed-point logic. Unfortunately, BFP alone introduces several limitations that preclude its direct applicability. In this work, we introduce HBFP, a hybrid BFP-FP approach, which performs all dot products in BFP and other operations in floating point. HBFP delivers the best of both worlds: the high accuracy of floating point at the superior hardware density of fixed point. For a wide variety of models, we show that HBFP matches floating point's accuracy while enabling hardware implementations that deliver up to 8.5x higher throughput., Comment: 9 pages, 3 figures. Accepted in Neural Information Processing Systems 2018 (NeurIPS 2018)
Published: 2018
Full Text: View/download PDF

34. Approximate Steepest Coordinate Descent

Author: Stich, Sebastian Urban, Raj, Anant, and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Learning, ml-ai, Optimization and Control (math.OC), G.1.6, FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We propose a new selection rule for the coordinate selection in coordinate descent methods for huge-scale optimization. The efficiency of this novel scheme is provably better than the efficiency of uniformly random selection, and can reach the efficiency of steepest coordinate descent (SCD), enabling an acceleration of a factor of up to $n$, the number of coordinates. In many practical applications, our scheme can be implemented at no extra cost and computational efficiency very close to the faster uniform selection. Numerical experiments with Lasso and Ridge regression show promising improvements, in line with our theoretical guarantees., Comment: appearing at ICML 2017
Published: 2017
Full Text: View/download PDF

35. Distributed Optimization with Arbitrary Local Solvers

Author: Ma, Chenxin, Konecný, Jakub, Jaggi, Martin, Smith, Virginia, Jordan, Michael I., Richtárik, Peter, and Takác, Martin
Subjects: FOS: Computer and information sciences, distributed computing, Computer Science - Learning, machine learning, Optimization and Control (math.OC), primal-dual algorithm, FOS: Mathematics, Mathematics - Optimization and Control, convergence analysis, Machine Learning (cs.LG)
Abstract: With the growth of data and necessity for distributed optimization methods, solvers that work well on a single machine must be re-designed to leverage distributed computation. Recent work in this area has been limited by focusing heavily on developing highly specific methods for the distributed environment. These special-purpose methods are often unable to fully leverage the competitive performance of their well-tuned and customized single machine counterparts. Further, they are unable to easily integrate improvements that continue to be made to single machine methods. To this end, we present a framework for distributed optimization that both allows the flexibility of arbitrary solvers to be used on each (single) machine locally, and yet maintains competitive performance against other state-of-the-art special-purpose distributed methods. We give strong primal-dual convergence rate guarantees for our framework that hold for arbitrary local solvers. We demonstrate the impact of local solver selection both theoretically and in an extensive experimental comparison. Finally, we provide thorough implementation details for our framework, highlighting areas for practical performance gains.
Published: 2015
Full Text: View/download PDF

36. L1-Regularized Distributed Optimization: A Communication-Efficient Primal-Dual Framework

Author: Smith, Virginia, Forte, Simone, Jordan, Michael I., and Jaggi, Martin
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Machine Learning (cs.LG)
Abstract: Despite the importance of sparsity in many large-scale applications, there are few methods for distributed optimization of sparsity-inducing objectives. In this paper, we present a communication-efficient framework for L1-regularized optimization in the distributed environment. By viewing classical objectives in a more general primal-dual setting, we develop a new class of methods that can be efficiently distributed and applied to common sparsity-inducing models, such as Lasso, sparse logistic regression, and elastic net-regularized problems. We provide theoretical convergence guarantees for our framework, and demonstrate its efficiency and flexibility with a thorough experimental comparison on Amazon EC2. Our proposed framework yields speedups of up to 50x as compared to current state-of-the-art methods for distributed L1-regularized optimization.
Published: 2015
Full Text: View/download PDF

37. On the Global Linear Convergence of Frank-Wolfe Optimization Variants

Author: Lacoste-Julien, Simon, Jaggi, Martin, Statistical Machine Learning and Parsimony (SIERRA), Département d'informatique - ENS Paris (DI-ENS), École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS-PSL), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-Inria Paris-Rocquencourt, Institut National de Recherche en Informatique et en Automatique (Inria), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS), Microsoft Research - Inria Joint Centre (MSR - INRIA), Institut National de Recherche en Informatique et en Automatique (Inria)-Microsoft Research Laboratory Cambridge-Microsoft Corporation [Redmond, Wash.], Department of Computer Science [ETH Zürich] (D-INFK), Eidgenössische Technische Hochschule - Swiss Federal Institute of Technology [Zürich] (ETH Zürich), Google Research Award, MSR-Inria Joint Centre, Département d'informatique de l'École normale supérieure (DI-ENS), École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)-École normale supérieure - Paris (ENS Paris), Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)-Institut National de Recherche en Informatique et en Automatique (Inria)-École normale supérieure - Paris (ENS Paris), Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)-Inria Paris-Rocquencourt, and Université Paris sciences et lettres (PSL)-Université Paris sciences et lettres (PSL)
Subjects: FOS: Computer and information sciences, I.2.6, G.1.6, Machine Learning (stat.ML), Machine Learning (cs.LG), Computer Science - Learning, ml-ai, [INFO.INFO-LG]Computer Science [cs]/Machine Learning [cs.LG], [STAT.ML]Statistics [stat]/Machine Learning [stat.ML], Statistics - Machine Learning, Optimization and Control (math.OC), ACM: G.: Mathematics of Computing/G.1: NUMERICAL ANALYSIS/G.1.6: Optimization, FOS: Mathematics, [MATH.MATH-OC]Mathematics [math]/Optimization and Control [math.OC], Mathematics - Optimization and Control, ACM: I.: Computing Methodologies/I.2: ARTIFICIAL INTELLIGENCE/I.2.6: Learning, 90C52, 90C90, 68T05
Abstract: The Frank-Wolfe (FW) optimization algorithm has lately re-gained popularity thanks in particular to its ability to nicely handle the structured constraints appearing in machine learning applications. However, its convergence rate is known to be slow (sublinear) when the solution lies at the boundary. A simple less-known fix is to add the possibility to take 'away steps' during optimization, an operation that importantly does not require a feasibility oracle. In this paper, we highlight and clarify several variants of the Frank-Wolfe optimization algorithm that have been successfully applied in practice: away-steps FW, pairwise FW, fully-corrective FW and Wolfe's minimum norm point algorithm, and prove for the first time that they all enjoy global linear convergence, under a weaker condition than strong convexity of the objective. The constant in the convergence rate has an elegant interpretation as the product of the (classical) condition number of the function with a novel geometric quantity that plays the role of a 'condition number' of the constraint set. We provide pointers to where these algorithms have made a difference in practice, in particular with the flow polytope, the marginal polytope and the base polytope for submodular optimization., Comment: Appears in: Advances in Neural Information Processing Systems 28 (NIPS 2015). 26 pages
Published: 2015
Full Text: View/download PDF

38. Convex Optimization without Projection Steps

Author: Jaggi, Martin
Subjects: FOS: Computer and information sciences, I.5.1, Computer Science - Artificial Intelligence, MathematicsofComputing_NUMERICALANALYSIS, G.1.3, Systems and Control (eess.SY), F.2.2, Artificial Intelligence (cs.AI), Optimization and Control (math.OC), 90C22, 90C20, 90C25, TheoryofComputation_ANALYSISOFALGORITHMSANDPROBLEMCOMPLEXITY, FOS: Mathematics, FOS: Electrical engineering, electronic engineering, information engineering, Computer Science - Systems and Control, Mathematics - Optimization and Control
Abstract: For the general problem of minimizing a convex function over a compact convex domain, we will investigate a simple iterative approximation algorithm based on the method by Frank & Wolfe 1956, that does not need projection steps in order to stay inside the optimization domain. Instead of a projection step, the linearized problem defined by a current subgradient is solved, which gives a step direction that will naturally stay in the domain. Our framework generalizes the sparse greedy algorithm of Frank & Wolfe and its primal-dual analysis by Clarkson 2010 (and the low-rank SDP approach by Hazan 2008) to arbitrary convex domains. We give a convergence proof guaranteeing {\epsilon}-small duality gap after O(1/{\epsilon}) iterations. The method allows us to understand the sparsity of approximate solutions for any l1-regularized convex optimization problem (and for optimization over the simplex), expressed as a function of the approximation quality. We obtain matching upper and lower bounds of {\Theta}(1/{\epsilon}) for the sparsity for l1-problems. The same bounds apply to low-rank semidefinite optimization with bounded trace, showing that rank O(1/{\epsilon}) is best possible here as well. As another application, we obtain sparse matrices of O(1/{\epsilon}) non-zero entries as {\epsilon}-approximate solutions when optimizing any convex function over a class of diagonally dominant symmetric matrices. We show that our proposed first-order method also applies to nuclear norm and max-norm matrix optimization problems. For nuclear norm regularized optimization, such as matrix completion and low-rank recovery, we demonstrate the practical efficiency and scalability of our algorithm for large matrix problems, as e.g. the Netflix dataset. For general convex optimization over bounded matrix max-norm, our algorithm is the first with a convergence guarantee, to the best of our knowledge.
Published: 2011
Full Text: View/download PDF

39. A Combinatorial Algorithm to Compute Regularization Paths

Author: Gärtner, Bernd, Giesen, Joachim, Jaggi, Martin, and Welsch, Torsten
Subjects: FOS: Computer and information sciences, Computer Science - Learning, I.5.1, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, F.2.2, Machine Learning (cs.LG)
Abstract: For a wide variety of regularization methods, algorithms computing the entire solution path have been developed recently. Solution path algorithms do not only compute the solution for one particular value of the regularization parameter but the entire path of solutions, making the selection of an optimal parameter much easier. Most of the currently used algorithms are not robust in the sense that they cannot deal with general or degenerate input. Here we present a new robust, generic method for parametric quadratic programming. Our algorithm directly applies to nearly all machine learning applications, where so far every application required its own different algorithm. We illustrate the usefulness of our method by applying it to a very low rank problem which could not be solved by existing path tracking methods, namely to compute part-worth values in choice based conjoint analysis, a popular technique from market research to estimate consumers preferences on a class of parameterized options., Comment: 7 Pages, 1 Figure
Published: 2009
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

39 results on '"Jaggi, Martin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources