Author: "Zhao, Shen-Yi" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Zhao, Shen-Yi"' showing total 18 results

Start Over Author "Zhao, Shen-Yi"

18 results on '"Zhao, Shen-Yi"'

1. Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training

Author: Zhao, Shen-Yi, Shi, Chang-Wei, Xie, Yin-Peng, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Stochastic gradient descent~(SGD) and its variants have been the dominating optimization methods in machine learning. Compared to SGD with small-batch training, SGD with large-batch training can better utilize the computational power of current multi-core systems such as graphics processing units~(GPUs) and can reduce the number of communication rounds in distributed training settings. Thus, SGD with large-batch training has attracted considerable attention. However, existing empirical results showed that large-batch training typically leads to a drop in generalization accuracy. Hence, how to guarantee the generalization ability in large-batch training becomes a challenging task. In this paper, we propose a simple yet effective method, called stochastic normalized gradient descent with momentum~(SNGM), for large-batch training. We prove that with the same number of gradient computations, SNGM can adopt a larger batch size than momentum SGD~(MSGD), which is one of the most widely used variants of SGD, to converge to an $\epsilon$-stationary point. Empirical results on deep learning verify that when adopting the same large batch size, SNGM can achieve better test accuracy than MSGD and other state-of-the-art large-batch training methods.
Published: 2020
Full Text: View/download PDF

2. Stagewise Enlargement of Batch Size for SGD-based Learning

Author: Zhao, Shen-Yi, Xie, Yin-Peng, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Existing research shows that the batch size can seriously affect the performance of stochastic gradient descent~(SGD) based learning, including training speed and generalization ability. A larger batch size typically results in less parameter updates. In distributed training, a larger batch size also results in less frequent communication. However, a larger batch size can make a generalization gap more easily. Hence, how to set a proper batch size for SGD has recently attracted much attention. Although some methods about setting batch size have been proposed, the batch size problem has still not been well solved. In this paper, we first provide theory to show that a proper batch size is related to the gap between initialization and optimum of the model parameter. Then based on this theory, we propose a novel method, called \underline{s}tagewise \underline{e}nlargement of \underline{b}atch \underline{s}ize~(\mbox{SEBS}), to set proper batch size for SGD. More specifically, \mbox{SEBS} adopts a multi-stage scheme, and enlarges the batch size geometrically by stage. We theoretically prove that, compared to classical stagewise SGD which decreases learning rate by stage, \mbox{SEBS} can reduce the number of parameter updates without increasing generalization error. SEBS is suitable for \mbox{SGD}, momentum \mbox{SGD} and AdaGrad. Empirical results on real data successfully verify the theories of \mbox{SEBS}. Furthermore, empirical results also show that SEBS can outperform other baselines.
Published: 2020

3. ADASS: Adaptive Sample Selection for Training Acceleration

Author: Zhao, Shen-Yi, Gao, Hao, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Stochastic gradient decent~(SGD) and its variants, including some accelerated variants, have become popular for training in machine learning. However, in all existing SGD and its variants, the sample size in each iteration~(epoch) of training is the same as the size of the full training set. In this paper, we propose a new method, called \underline{ada}ptive \underline{s}ample \underline{s}election~(ADASS), for training acceleration. During different epoches of training, ADASS only need to visit different training subsets which are adaptively selected from the full training set according to the Lipschitz constants of the loss functions on samples. It means that in ADASS the sample size in each epoch of training can be smaller than the size of the full training set, by discarding some samples. ADASS can be seamlessly integrated with existing optimization methods, such as SGD and momentum SGD, for training acceleration. Theoretical results show that the learning accuracy of ADASS is comparable to that of counterparts with full training set. Furthermore, empirical results on both shallow models and deep models also show that ADASS can accelerate the training process of existing methods without sacrificing accuracy.
Published: 2019

4. Clustered Reinforcement Learning

Author: Ma, Xiao, Zhao, Shen-Yi, and Li, Wu-Jun
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning
Abstract: Exploration strategy design is one of the challenging problems in reinforcement learning~(RL), especially when the environment contains a large state space or sparse rewards. During exploration, the agent tries to discover novel areas or high reward~(quality) areas. In most existing methods, the novelty and quality in the neighboring area of the current state are not well utilized to guide the exploration of the agent. To tackle this problem, we propose a novel RL framework, called \underline{c}lustered \underline{r}einforcement \underline{l}earning~(CRL), for efficient exploration in RL. CRL adopts clustering to divide the collected states into several clusters, based on which a bonus reward reflecting both novelty and quality in the neighboring area~(cluster) of the current state is given to the agent. Experiments on a continuous control task and several \emph{Atari 2600} games show that CRL can outperform other state-of-the-art methods to achieve the best performance in most cases., Comment: 16pages, 3 figures
Published: 2019

5. On the Convergence of Memory-Based Distributed SGD

Author: Zhao, Shen-Yi, Gao, Hao, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Distributed stochastic gradient descent~(DSGD) has been widely used for optimizing large-scale machine learning models, including both convex and non-convex models. With the rapid growth of model size, huge communication cost has been the bottleneck of traditional DSGD. Recently, many communication compression methods have been proposed. Memory-based distributed stochastic gradient descent~(M-DSGD) is one of the efficient methods since each worker communicates a sparse vector in each iteration so that the communication cost is small. Recent works propose the convergence rate of M-DSGD when it adopts vanilla SGD. However, there is still a lack of convergence theory for M-DSGD when it adopts momentum SGD. In this paper, we propose a universal convergence analysis for M-DSGD by introducing \emph{transformation equation}. The transformation equation describes the relation between traditional DSGD and M-DSGD so that we can transform M-DSGD to its corresponding DSGD. Hence we get the convergence rate of M-DSGD with momentum for both convex and non-convex problems. Furthermore, we combine M-DSGD and stagewise learning that the learning rate of M-DSGD in each stage is a constant and is decreased by stage, instead of iteration. Using the transformation equation, we propose the convergence rate of stagewise M-DSGD which bridges the gap between theory and practice.
Published: 2019

6. Global Momentum Compression for Sparse Communication in Distributed Learning

Author: Shi, Chang-Wei, Zhao, Shen-Yi, Xie, Yin-Peng, Gao, Hao, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: With the rapid growth of data, distributed momentum stochastic gradient descent~(DMSGD) has been widely used in distributed learning, especially for training large-scale deep models. Due to the latency and limited bandwidth of the network, communication has become the bottleneck of distributed learning. Communication compression with sparsified gradient, abbreviated as \emph{sparse communication}, has been widely employed to reduce communication cost. All existing works about sparse communication in DMSGD employ local momentum, in which the momentum only accumulates stochastic gradients computed by each worker locally. In this paper, we propose a novel method, called \emph{\underline{g}}lobal \emph{\underline{m}}omentum \emph{\underline{c}}ompression~(GMC), for sparse communication. Different from existing works that utilize local momentum, GMC utilizes global momentum. Furthermore, to enhance the convergence performance when using more aggressive sparsification compressors (e.g., RBGS), we extend GMC to GMC+. We theoretically prove the convergence of GMC and GMC+. To the best of our knowledge, this is the first work that introduces global momentum for sparse communication in distributed learning. Empirical results demonstrate that, compared with the local momentum counterparts, our GMC and GMC+ can achieve higher test accuracy and exhibit faster convergence, especially under non-IID data distribution.
Published: 2019

7. Quantized Epoch-SGD for Communication-Efficient Distributed Learning

Author: Zhao, Shen-Yi, Gao, Hao, and Li, Wu-Jun
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Due to its efficiency and ease to implement, stochastic gradient descent (SGD) has been widely used in machine learning. In particular, SGD is one of the most popular optimization methods for distributed learning. Recently, quantized SGD (QSGD), which adopts quantization to reduce the communication cost in SGD-based distributed learning, has attracted much attention. Although several QSGD methods have been proposed, some of them are heuristic without theoretical guarantee, and others have high quantization variance which makes the convergence become slow. In this paper, we propose a new method, called Quantized Epoch-SGD (QESGD), for communication-efficient distributed learning. QESGD compresses (quantizes) the parameter with variance reduction, so that it can get almost the same performance as that of SGD with less communication cost. QESGD is implemented on the Parameter Server framework, and empirical results on distributed deep learning show that QESGD can outperform other state-of-the-art quantization methods to achieve the best performance.
Published: 2019

8. Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate

Author: Zhao, Shen-Yi, Zhang, Gong-Duo, Li, Ming-Wei, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Machine Learning
Abstract: Distributed sparse learning with a cluster of multiple machines has attracted much attention in machine learning, especially for large-scale applications with high-dimensional data. One popular way to implement sparse learning is to use $L_1$ regularization. In this paper, we propose a novel method, called proximal \mbox{SCOPE}~(\mbox{pSCOPE}), for distributed sparse learning with $L_1$ regularization. pSCOPE is based on a \underline{c}ooperative \underline{a}utonomous \underline{l}ocal \underline{l}earning~(\mbox{CALL}) framework. In the \mbox{CALL} framework of \mbox{pSCOPE}, we find that the data partition affects the convergence of the learning procedure, and subsequently we define a metric to measure the goodness of a data partition. Based on the defined metric, we theoretically prove that pSCOPE is convergent with a linear convergence rate if the data partition is good enough. We also prove that better data partition implies faster convergence rate. Furthermore, pSCOPE is also communication efficient. Experimental results on real data sets show that pSCOPE can outperform other state-of-the-art distributed methods for sparse learning.
Published: 2018

9. Feature-Distributed SVRG for High-Dimensional Linear Classification

Author: Zhang, Gong-Duo, Zhao, Shen-Yi, Gao, Hao, and Li, Wu-Jun
Subjects: Computer Science - Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Statistics - Machine Learning
Abstract: Linear classification has been widely used in many high-dimensional applications like text classification. To perform linear classification for large-scale tasks, we often need to design distributed learning methods on a cluster of multiple machines. In this paper, we propose a new distributed learning method, called feature-distributed stochastic variance reduced gradient (FD-SVRG) for high-dimensional linear classification. Unlike most existing distributed learning methods which are instance-distributed, FD-SVRG is feature-distributed. FD-SVRG has lower communication cost than other instance-distributed methods when the data dimensionality is larger than the number of data instances. Experimental results on real data demonstrate that FD-SVRG can outperform other state-of-the-art distributed methods for high-dimensional linear classification in terms of both communication cost and wall-clock time, when the dimensionality is larger than the number of instances in training data.
Published: 2018

10. Lock-Free Optimization for Non-Convex Problems

Author: Zhao, Shen-Yi, Zhang, Gong-Duo, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Learning
Abstract: Stochastic gradient descent~(SGD) and its variants have attracted much attention in machine learning due to their efficiency and effectiveness for optimization. To handle large-scale problems, researchers have recently proposed several lock-free strategy based parallel SGD~(LF-PSGD) methods for multi-core systems. However, existing works have only proved the convergence of these LF-PSGD methods for convex problems. To the best of our knowledge, no work has proved the convergence of the LF-PSGD methods for non-convex problems. In this paper, we provide the theoretical proof about the convergence of two representative LF-PSGD methods, Hogwild! and AsySVRG, for non-convex problems. Empirical results also show that both Hogwild! and AsySVRG are convergent on non-convex problems, which successfully verifies our theoretical results.
Published: 2016

11. SCOPE: Scalable Composite Optimization for Learning on Spark

Author: Zhao, Shen-Yi, Xiang, Ru, Shi, Ying-Hao, Gao, Peng, and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Learning
Abstract: Many machine learning models, such as logistic regression~(LR) and support vector machine~(SVM), can be formulated as composite optimization problems. Recently, many distributed stochastic optimization~(DSO) methods have been proposed to solve the large-scale composite optimization problems, which have shown better performance than traditional batch methods. However, most of these DSO methods are not scalable enough. In this paper, we propose a novel DSO method, called \underline{s}calable \underline{c}omposite \underline{op}timization for l\underline{e}arning~({SCOPE}), and implement it on the fault-tolerant distributed platform \mbox{Spark}. SCOPE is both computation-efficient and communication-efficient. Theoretical analysis shows that SCOPE is convergent with linear convergence rate when the objective function is convex. Furthermore, empirical results on real datasets show that SCOPE can outperform other state-of-the-art distributed learning methods on Spark, including both batch learning methods and DSO methods.
Published: 2016

12. Fast Asynchronous Parallel Stochastic Gradient Decent

Author: Zhao, Shen-Yi and Li, Wu-Jun
Subjects: Statistics - Machine Learning, Computer Science - Learning
Abstract: Stochastic gradient descent~(SGD) and its variants have become more and more popular in machine learning due to their efficiency and effectiveness. To handle large-scale problems, researchers have recently proposed several parallel SGD methods for multicore systems. However, existing parallel SGD methods cannot achieve satisfactory performance in real applications. In this paper, we propose a fast asynchronous parallel SGD method, called AsySVRG, by designing an asynchronous strategy to parallelize the recently proposed SGD variant called stochastic variance reduced gradient~(SVRG). Both theoretical and empirical results show that AsySVRG can outperform existing state-of-the-art parallel SGD methods like Hogwild! in terms of convergence rate and computation cost.
Published: 2015

13. Scalable Stochastic Alternating Direction Method of Multipliers

Author: Zhao, Shen-Yi, Li, Wu-Jun, and Zhou, Zhi-Hua
Subjects: Computer Science - Learning
Abstract: Stochastic alternating direction method of multipliers (ADMM), which visits only one sample or a mini-batch of samples each time, has recently been proved to achieve better performance than batch ADMM. However, most stochastic methods can only achieve a convergence rate $O(1/\sqrt T)$ on general convex problems,where T is the number of iterations. Hence, these methods are not scalable with respect to convergence rate (computation cost). There exists only one stochastic method, called SA-ADMM, which can achieve convergence rate $O(1/T)$ on general convex problems. However, an extra memory is needed for SA-ADMM to store the historic gradients on all samples, and thus it is not scalable with respect to storage cost. In this paper, we propose a novel method, called scalable stochastic ADMM(SCAS-ADMM), for large-scale optimization and learning problems. Without the need to store the historic gradients, SCAS-ADMM can achieve the same convergence rate $O(1/T)$ as the best stochastic method SA-ADMM and batch ADMM on general convex problems. Experiments on graph-guided fused lasso show that SCAS-ADMM can achieve state-of-the-art performance in real applications
Published: 2015

14. On the convergence and improvement of stochastic normalized gradient descent

Author: Zhao, Shen-Yi, Xie, Yin-Peng, and Li, Wu-Jun
Published: 2021
Full Text: View/download PDF

15. Global Momentum Compression for Sparse Communication in Distributed SGD

Author: Zhao, Shen-Yi, Xie, Yin-Peng, Gao, Hao, and Li, Wu-Jun
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: With the rapid growth of data, distributed stochastic gradient descent~(DSGD) has been widely used for solving large-scale machine learning problems. Due to the latency and limited bandwidth of network, communication has become the bottleneck of DSGD when we need to train large scale models, like deep neural networks. Communication compression with sparsified gradient, abbreviated as \emph{sparse communication}, has been widely used for reducing communication cost in DSGD. Recently, there has appeared one method, called deep gradient compression~(DGC), to combine memory gradient and momentum SGD for sparse communication. DGC has achieved promising performance in practise. However, the theory about the convergence of DGC is lack. In this paper, we propose a novel method, called \emph{\underline{g}}lobal \emph{\underline{m}}omentum \emph{\underline{c}}ompression~(GMC), for sparse communication in DSGD. GMC also combines memory gradient and momentum SGD. But different from DGC which adopts local momentum, GMC adopts global momentum. We theoretically prove the convergence rate of GMC for both convex and non-convex problems. To the best of our knowledge, this is the first work that proves the convergence of distributed momentum SGD~(DMSGD) with sparse communication and memory gradient. Empirical results show that, compared with the DMSGD counterpart without sparse communication, GMC can reduce the communication cost by approximately 100 fold without loss of generalization accuracy. GMC can also achieve comparable~(sometimes better) performance compared with DGC, with extra theoretical guarantee.
Published: 2019

16. SCOPE: Scalable Composite Optimization for Learning on Spark

Author: Zhao, Shen-Yi, Xiang, Ru, Shi, Ying-Hao, Gao, Peng, and Li, Wu-Jun
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Machine Learning (stat.ML), General Medicine, Machine Learning (cs.LG)
Abstract: Many machine learning models, such as logistic regression (LR) and support vector machine (SVM), can be formulated as composite optimization problems. Recently, many distributed stochastic optimization (DSO) methods have been proposed to solve the large-scale composite optimization problems, which have shown better performance than traditional batch methods. However, most of these DSO methods might not be scalable enough. In this paper, we propose a novel DSO method, called scalable composite optimization for learning (SCOPE), and implement it on the fault-tolerant distributed platform Spark. SCOPE is both computation-efficient and communication-efficient. Theoretical analysis shows that SCOPE is convergent with linear convergence rate when the objective function is strongly convex. Furthermore, empirical results on real datasets show that SCOPE can outperform other state-of-the-art distributed learning methods on Spark, including both batch learning methods and DSO methods.
Published: 2017

17. Lock-Free Optimization for Non-Convex Problems

Author: Zhao, Shen-Yi, primary, Zhang, Gong-Duo, additional, and Li, Wu-Jun, additional
Published: 2017
Full Text: View/download PDF

18. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Author: Zhao, Shen-Yi, primary and Li, Wu-Jun, additional
Published: 2016
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

18 results on '"Zhao, Shen-Yi"'

1. Stochastic Normalized Gradient Descent with Momentum for Large-Batch Training

2. Stagewise Enlargement of Batch Size for SGD-based Learning

3. ADASS: Adaptive Sample Selection for Training Acceleration

4. Clustered Reinforcement Learning

5. On the Convergence of Memory-Based Distributed SGD

6. Global Momentum Compression for Sparse Communication in Distributed Learning

7. Quantized Epoch-SGD for Communication-Efficient Distributed Learning

8. Proximal SCOPE for Distributed Sparse Learning: Better Data Partition Implies Faster Convergence Rate

9. Feature-Distributed SVRG for High-Dimensional Linear Classification

10. Lock-Free Optimization for Non-Convex Problems

11. SCOPE: Scalable Composite Optimization for Learning on Spark

12. Fast Asynchronous Parallel Stochastic Gradient Decent

13. Scalable Stochastic Alternating Direction Method of Multipliers

14. On the convergence and improvement of stochastic normalized gradient descent

15. Global Momentum Compression for Sparse Communication in Distributed SGD

16. SCOPE: Scalable Composite Optimization for Learning on Spark

17. Lock-Free Optimization for Non-Convex Problems

18. Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

18 results on '"Zhao, Shen-Yi"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources