Author: "Zhao, Tuo" / Publisher: arxiv - Searchworks@Jio Institute Digital Library Search Results

1. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Author: Li, Yixiao, Yu, Yifan, Zhang, Qingru, Liang, Chen, He, Pengcheng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.
Published: 2023
Full Text: View/download PDF

2. Machine Learning Force Fields with Data Cost Aware Training

Author: Bukharin, Alexander, Liu, Tianyi, Wang, Shengjie, Zuo, Simiao, Gao, Weihao, Yan, Wen, and Zhao, Tuo
Subjects: Chemical Physics (physics.chem-ph), FOS: Computer and information sciences, Computer Science - Machine Learning, Physics - Chemical Physics, FOS: Biological sciences, FOS: Physical sciences, Quantitative Biology - Quantitative Methods, Quantitative Methods (q-bio.QM), Machine Learning (cs.LG)
Abstract: Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation, which finds widespread applications in chemistry and biomedical research. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels generated by expensive quantum mechanical algorithms, which may scale as $O(n^3)$ to $O(n^7)$, with $n$ proportional to the number of basis functions. To address this issue, we propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data. The motivation behind ASTEROID is that inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Therefore, we first train a MLFF model on a large amount of inaccurate training data, employing a bias-aware loss function to prevent the model from overfitting tahe potential bias of this data. We then fine-tune the obtained model using a small amount of accurate training data, which preserves the knowledge learned from the inaccurate training data while significantly improving the model's accuracy. Moreover, we propose a variant of ASTEROID based on score matching for the setting where the inaccurate training data are unlabeled. Extensive experiments on MD datasets and downstream tasks validate the efficacy of ASTEROID. Our code and data are available at https://github.com/abukharin3/asteroid.
Published: 2023
Full Text: View/download PDF

3. On Deep Generative Models for Approximation and Estimation of Distributions on Manifolds

Author: Dahal, Biraj, Havrilla, Alex, Chen, Minshuo, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Generative networks have experienced great empirical successes in distribution learning. Many existing experiments have demonstrated that generative networks can generate high-dimensional complex data from a low-dimensional easy-to-sample distribution. However, this phenomenon can not be justified by existing theories. The widely held manifold hypothesis speculates that real-world data sets, such as natural images and signals, exhibit low-dimensional geometric structures. In this paper, we take such low-dimensional data structures into consideration by assuming that data distributions are supported on a low-dimensional manifold. We prove statistical guarantees of generative networks under the Wasserstein-1 loss. We show that the Wasserstein-1 loss converges to zero at a fast rate depending on the intrinsic dimension instead of the ambient data dimension. Our theory leverages the low-dimensional geometric structures in data sets and justifies the practical power of generative networks. We require no smoothness assumptions on the data distribution which is desirable in practice.
Published: 2023
Full Text: View/download PDF

4. Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories

Author: Zhang, Zixuan, Chen, Minshuo, Wang, Mengdi, Liao, Wenjing, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to the intrinsic data structures. In real world applications, such an assumption of data lying exactly on a low dimensional manifold is stringent. This paper introduces a relaxed assumption that the input data are concentrated around a subset of $\mathbb{R}^d$ denoted by $\mathcal{S}$, and the intrinsic dimension of $\mathcal{S}$ can be characterized by a new complexity notation -- effective Minkowski dimension. We prove that, the sample complexity of deep nonparametric regression only depends on the effective Minkowski dimension of $\mathcal{S}$ denoted by $p$. We further illustrate our theoretical findings by considering nonparametric regression with an anisotropic Gaussian random design $N(0,\Sigma)$, where $\Sigma$ is full rank. When the eigenvalues of $\Sigma$ have an exponential or polynomial decay, the effective Minkowski dimension of such an Gaussian random design is $p=\mathcal{O}(\sqrt{\log n})$ or $p=\mathcal{O}(n^\gamma)$, respectively, where $n$ is the sample size and $\gamma\in(0,1)$ is a small constant depending on the polynomial decay rate. Our theory shows that, when the manifold assumption does not hold, deep neural networks can still adapt to the effective Minkowski dimension of the data, and circumvent the curse of the ambient dimensionality for moderate sample sizes.
Published: 2023
Full Text: View/download PDF

5. Efficient Long Sequence Modeling via State Space Augmented Transformer

Author: Zuo, Simiao, Liu, Xiaodong, Jiao, Jian, Charles, Denis, Manavoglu, Eren, Zhao, Tuo, and Gao, Jianfeng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. We propose SPADE, short for $\underline{\textbf{S}}$tate s$\underline{\textbf{P}}$ace $\underline{\textbf{A}}$ugmente$\underline{\textbf{D}}$ Transform$\underline{\textbf{E}}$r. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. The SSM augments global information, which complements the lack of long-range dependency issue in local attention methods. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method. To further demonstrate the scalability of SPADE, we pre-train large encoder-decoder models and present fine-tuning results on natural language understanding and natural language generation tasks.
Published: 2022
Full Text: View/download PDF

6. High Dimensional Binary Classification under Label Shift: Phase Transition and Regularization

Author: Cheng, Jiahui, Chen, Minshuo, Liu, Hao, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, FOS: Mathematics, Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: Label Shift has been widely believed to be harmful to the generalization performance of machine learning models. Researchers have proposed many approaches to mitigate the impact of the label shift, e.g., balancing the training data. However, these methods often consider the underparametrized regime, where the sample size is much larger than the data dimension. The research under the overparametrized regime is very limited. To bridge this gap, we propose a new asymptotic analysis of the Fisher Linear Discriminant classifier for binary classification with label shift. Specifically, we prove that there exists a phase transition phenomenon: Under certain overparametrized regime, the classifier trained using imbalanced data outperforms the counterpart with reduced balanced data. Moreover, we investigate the impact of regularization to the label shift: The aforementioned phase transition vanishes as the regularization becomes strong.
Published: 2022
Full Text: View/download PDF

7. Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint

Author: Liu, Hao, Chen, Minshuo, Er, Siawpeng, Liao, Wenjing, Zhang, Tong, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.
Published: 2022
Full Text: View/download PDF

8. PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance

Author: Zhang, Qingru, Zuo, Simiao, Liang, Chen, Bukharin, Alexander, He, Pengcheng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Large Transformer-based models have exhibited superior performance in various natural language processing and computer vision tasks. However, these models contain enormous amounts of parameters, which restrict their deployment to real-world applications. To reduce the model size, researchers prune these models based on the weights' importance scores. However, such scores are usually estimated on mini-batches during training, which incurs large variability/uncertainty due to mini-batch sampling and complicated training dynamics. As a result, some crucial weights could be pruned by commonly used pruning methods because of such uncertainty, which makes training unstable and hurts generalization. To resolve this issue, we propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation. In particular, for the weights with low importance scores but high uncertainty, PLATON tends to retain them and explores their capacity. We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification to validate the effectiveness of PLATON. Results demonstrate that PLATON manifests notable improvement under different sparsity levels. Our code is publicly available at https://github.com/QingruZhang/PLATON., Comment: Proceedings of the 39th International Conference on Machine Learning (ICML 2022)
Published: 2022
Full Text: View/download PDF

9. CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing

Author: Liang, Chen, He, Pengcheng, Shen, Yelong, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Model ensemble is a popular approach to produce a low-variance and well-generalized model. However, it induces large memory and inference costs, which are often not affordable for real-world deployment. Existing work has resorted to sharing weights among models. However, when increasing the proportion of the shared weights, the resulting models tend to be similar, and the benefits of using model ensemble diminish. To retain ensemble benefits while maintaining a low memory cost, we propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO. Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity. Meanwhile, we apply a prediction consistency regularizer across the perturbed models to control the variance due to the model diversity. Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model. Specifically, CAMERO outperforms the standard ensemble of 8 BERT-base models on the GLUE benchmark by 0.7 with a significantly smaller model size (114.2M vs. 880.6M)., Comment: Proceedings of ACL 2022
Published: 2022
Full Text: View/download PDF

10. Block Policy Mirror Descent

Author: Lan, Guanghui, Li, Yan, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Optimization and Control (math.OC), FOS: Mathematics, 90C40, 90C15, 90C26, 68Q25, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In this paper, we present a new policy gradient (PG) methods, namely the block policy mirror descent (BPMD) method for solving a class of regularized reinforcement learning (RL) problems with (strongly)-convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes, and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to comparable worst-case total computational complexity as batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. With a generative model, $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A}\right\rvert /\epsilon)$ (resp. $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A} \right\rvert /\epsilon^2)$) sample complexities are established for the strongly-convex (resp. non-strongly-convex) regularizers, where $\epsilon$ denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.
Published: 2022
Full Text: View/download PDF

11. No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

Author: Liang, Chen, Jiang, Haoming, Zuo, Simiao, He, Pengcheng, Liu, Xiaodong, Gao, Jianfeng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter's contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule. Analysis shows that the proposed schedule indeed reduces the redundancy and improves generalization performance., Comment: Proceedings of ICLR 2022
Published: 2022
Full Text: View/download PDF

12. Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces

Author: Liu, Hao, Yang, Haizhao, Chen, Minshuo, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.
Published: 2022
Full Text: View/download PDF

13. Differentially Private Estimation of Hawkes Process

Author: Zuo, Simiao, Liu, Tianyi, Zhao, Tuo, and Zha, Hongyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Cryptography and Security, Statistics - Machine Learning, Machine Learning (stat.ML), Cryptography and Security (cs.CR), Machine Learning (cs.LG)
Abstract: Point process models are of great importance in real world applications. In certain critical applications, estimation of point process models involves large amounts of sensitive personal data from users. Privacy concerns naturally arise which have not been addressed in the existing literature. To bridge this glaring gap, we propose the first general differentially private estimation procedure for point process models. Specifically, we take the Hawkes process as an example, and introduce a rigorous definition of differential privacy for event stream data based on a discretized representation of the Hawkes process. We then propose two differentially private optimization algorithms, which can efficiently estimate Hawkes process models with the desired privacy and utility guarantees under two different settings. Experiments are provided to back up our theoretical analysis.
Published: 2022
Full Text: View/download PDF

14. Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Author: Liang, Chen, Zuo, Simiao, Zhang, Qingru, He, Pengcheng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. Code is available at https://github.com/cliang1453/task-aware-distillation., Comment: Proceedings of ICML 2023
Published: 2022
Full Text: View/download PDF

15. Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably

Author: Liu, Tianyi, Li, Yan, Zhou, Enlu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We investigate the role of noise in optimization algorithms for learning over-parameterized models. Specifically, we consider the recovery of a rank one matrix $Y^*\in R^{d\times d}$ from a noisy observation $Y$ using an over-parameterization model. We parameterize the rank one matrix $Y^*$ by $XX^\top$, where $X\in R^{d\times d}$. We then show that under mild conditions, the estimator, obtained by the randomly perturbed gradient descent algorithm using the square loss function, attains a mean square error of $O(\sigma^2/d)$, where $\sigma^2$ is the variance of the observational noise. In contrast, the estimator obtained by gradient descent without random perturbation only attains a mean square error of $O(\sigma^2)$. Our result partially justifies the implicit regularization effect of noise when learning over-parameterized models, and provides new understanding of training over-parameterized neural networks.
Published: 2022
Full Text: View/download PDF

16. Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

Author: Li, Yan, Lan, Guanghui, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / \epsilon^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.
Published: 2022
Full Text: View/download PDF

17. Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

Author: Li, Yan, Choudhary, Dhruv, Wei, Xiaohan, Yuan, Baichuan, Bhushanam, Bhargav, Zhao, Tuo, and Lan, Guanghui
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems., Comment: Additional experiments on Word2Vec embedding learning included
Published: 2021
Full Text: View/download PDF

18. Learning Generalizable Vision-Tactile Robotic Grasping Strategy for Deformable Objects via Transformer

Author: Han, Yunhai, Yu, Kelin, Batra, Rahul, Boyd, Nathan, Zhao, Tuo, She, Yu, Hutchinson, Seth, and Zhao, Ye
Subjects: FOS: Computer and information sciences, Robotics (cs.RO)
Abstract: Reliable robotic grasping, especially with deformable objects such as fruits, remains a challenging task due to underactuated contact interactions with a gripper, unknown object dynamics and geometries. In this study, we propose a Transformer-based robotic grasping framework for rigid grippers that leverage tactile and visual information for safe object grasping. Specifically, the Transformer models learn physical feature embeddings with sensor feedback through performing two pre-defined explorative actions (pinching and sliding) and predict a grasping outcome through a multilayer perceptron (MLP) with a given grasping strength. Using these predictions, the gripper predicts a safe grasping strength via inference. Compared with convolutional recurrent networks, the Transformer models can capture the long-term dependencies across the image sequences and process spatial-temporal features simultaneously. We first benchmark the Transformer models on a public dataset for slip detection. Following that, we show that the Transformer models outperform a CNN+LSTM model in terms of grasping accuracy and computational efficiency. We also collect a new fruit grasping dataset and conduct online grasping experiments using the proposed framework for both seen and unseen fruits. Our codes and dataset are public on GitHub.
Published: 2021
Full Text: View/download PDF

19. Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Author: Jiang, Haoming, Dai, Bo, Yang, Mengjiao, Zhao, Tuo, and Wei, Wei
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores., Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Published: 2021
Full Text: View/download PDF

20. Taming Sparsely Activated Transformer with Stochastic Experts

Author: Zuo, Simiao, Liu, Xiaodong, Jiao, Jian, Kim, Young Jin, Hassan, Hany, Zhang, Ruofei, Zhao, Tuo, and Gao, Jianfeng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts., Comment: ICLR 2022
Published: 2021
Full Text: View/download PDF

21. Besov Function Approximation and Binary Classification on Low-Dimensional Manifolds Using Convolutional Residual Networks

Author: Liu, Hao, Chen, Minshuo, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a $d$-dimensional manifold isometrically embedded in $\mathbb{R}^D$, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of $n^{-\frac{s}{2s+2(s\vee d)}}$, where $s$ is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension $d$, instead of the data dimension $D$. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.
Published: 2021
Full Text: View/download PDF

22. Doubly Robust Off-Policy Learning on Low-Dimensional Manifolds by Deep Neural Networks

Author: Chen, Minshuo, Liu, Hao, Liao, Wenjing, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Causal inference explores the causation between actions and the consequent rewards on a covariate set. Recently deep learning has achieved a remarkable performance in causal inference, but existing statistical theories cannot well explain such an empirical success, especially when the covariates are high-dimensional. Most theoretical results in causal inference are asymptotic, suffer from the curse of dimensionality, and only work for the finite-action scenario. To bridge such a gap between theory and practice, this paper studies doubly robust off-policy learning by deep neural networks. When the covariates lie on a low-dimensional manifold, we prove nonasymptotic regret bounds, which converge at a fast rate depending on the intrinsic dimension of the manifold. Our results cover both the finite- and continuous-action scenarios. Our theory shows that deep neural networks are adaptive to the low-dimensional geometric structures of the covariates, and partially explains the success of deep learning for causal inference.
Published: 2020
Full Text: View/download PDF

23. Transformer Hawkes Process

Author: Zuo, Simiao, Jiang, Haoming, Li, Zichong, Zhao, Tuo, and Zha, Hongyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Modern data acquisition routinely produce massive amounts of event sequence data in various domains, such as social media, healthcare, and financial markets. These data often exhibit complicated short-term and long-term temporal dependencies. However, most of the existing recurrent neural network based point process models fail to capture such dependencies, and yield unreliable prediction performance. To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency. Numerical experiments on various datasets show that THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin. Moreover, THP is quite general and can incorporate additional structural knowledge. We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.
Published: 2020
Full Text: View/download PDF

24. The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

Author: Li, Xingguo, Zhao, Tuo, Yuan, Xiaoming, and Liu, Han
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics::Theory, Statistics::Machine Learning, Statistics - Machine Learning, Statistics::Methodology, Computer Science - Mathematical Software, Machine Learning (stat.ML), Mathematical Software (cs.MS), Statistics::Computation, Machine Learning (cs.LG)
Abstract: This paper describes an R package named flare, which implements a family of new high dimensional regression methods (LAD Lasso, SQRT Lasso, $\ell_q$ Lasso, and Dantzig selector) and their extensions to sparse precision matrix estimation (TIGER and CLIME). These methods exploit different nonsmooth loss functions to gain modeling flexibility, estimation robustness, and tuning insensitiveness. The developed solver is based on the alternating direction method of multipliers (ADMM). The package flare is coded in double precision C, and called from R by a user-friendly interface. The memory usage is optimized by using the sparse matrix output. The experiments show that flare is efficient and can scale up to large problems.
Published: 2020
Full Text: View/download PDF

25. Deep Reinforcement Learning with Robust and Smooth Policy

Author: Shen, Qianli, Li, Yan, Jiang, Haoming, Wang, Zhaoran, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Deep reinforcement learning (RL) has achieved great empirical successes in various domains. However, the large search space of neural networks requires a large amount of data, which makes the current RL algorithms not sample efficient. Motivated by the fact that many environments with continuous state space have smooth transitions, we propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- \textbf{S}mooth \textbf{R}egularized \textbf{R}einforcement \textbf{L}earning ($\textbf{SR}^2\textbf{L}$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy. Moreover, our proposed framework can also improve the robustness of policy against measurement error in the state space, and can be naturally extended to distribubutionally robust setting. We apply the proposed framework to both on-policy (TRPO) and off-policy algorithm (DDPG). Through extensive experiments, we demonstrate that our method achieves improved sample efficiency and robustness., Comment: ICML 2020
Published: 2020
Full Text: View/download PDF

26. On Computation and Generalization of Generative Adversarial Imitation Learning

Author: Chen, Minshuo, Wang, Yizhou, Liu, Tianyi, Yang, Zhuoran, Li, Xingguo, Wang, Zhaoran, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Generative Adversarial Imitation Learning (GAIL) is a powerful and practical approach for learning sequential decision-making policies. Different from Reinforcement Learning (RL), GAIL takes advantage of demonstration data by experts (e.g., human), and learns both the policy and reward function of the unknown environment. Despite the significant empirical progresses, the theory behind GAIL is still largely unknown. The major difficulty comes from the underlying temporal dependency of the demonstration data and the minimax computational formulation of GAIL without convex-concave structure. To bridge such a gap between theory and practice, this paper investigates the theoretical properties of GAIL. Specifically, we show: (1) For GAIL with general reward parameterization, the generalization can be guaranteed as long as the class of the reward functions is properly controlled; (2) For GAIL, where the reward is parameterized as a reproducing kernel function, GAIL can be efficiently solved by stochastic first order optimization algorithms, which attain sublinear convergence to a stationary solution. To the best of our knowledge, these are the first results on statistical and computational guarantees of imitation learning with reward/policy function approximation. Numerical experiments are provided to support our analysis.
Published: 2020
Full Text: View/download PDF

27. How Important is the Train-Validation Split in Meta-Learning?

Author: Bai, Yu, Chen, Minshuo, Zhou, Pan, Zhao, Tuo, Lee, Jason D., Kakade, Sham, Wang, Huan, and Xiong, Caiming
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Meta-learning aims to perform fast adaptation on a new task through learning a "prior" from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct \emph{train-train method}, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful in the linear centroid meta-learning problem. In the agnostic case, we show that the expected loss of the train-val method is minimized at the optimal prior for meta testing, and this is not the case for the train-train method in general without structural assumptions on the data. In contrast, in the realizable case where the data are generated from linear models, we show that both the train-val and train-train losses are minimized at the optimal prior in expectation. Further, perhaps surprisingly, our main result shows that the train-train method achieves a \emph{strictly better} excess loss in this realizable case, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that sample splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
Published: 2020
Full Text: View/download PDF

28. Distribution Approximation and Statistical Estimation Guarantees of Generative Adversarial Networks

Author: Chen, Minshuo, Liao, Wenjing, Zha, Hongyuan, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Generative Adversarial Networks (GANs) have achieved a great success in unsupervised learning. Despite its remarkable empirical performance, there are limited theoretical studies on the statistical properties of GANs. This paper provides approximation and statistical guarantees of GANs for the estimation of data distributions that have densities in a H\"{o}lder space. Our main result shows that, if the generator and discriminator network architectures are properly chosen, GANs are consistent estimators of data distributions under strong discrepancy metrics, such as the Wasserstein-1 distance. Furthermore, when the data distribution exhibits low-dimensional structures, we show that GANs are capable of capturing the unknown low-dimensional structures in data and enjoy a fast statistical convergence, which is free of curse of the ambient dimensionality. Our analysis for low-dimensional data builds upon a universal approximation theory of neural networks with Lipschitz continuity guarantees, which may be of independent interest., Comment: Version two extends to low-dimensional linear and mixture distributions
Published: 2020
Full Text: View/download PDF

29. Inductive Bias of Gradient Descent based Adversarial Training on Separable Data

Author: Li, Yan, Fang, Ethan X., Xu, Huan, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient descent based adversarial training by studying its computational properties, specifically on its inductive bias. We take the binary classification task on linearly separable data as an illustrative example, where the loss asymptotically attains its infimum as the parameter diverges to infinity along certain directions. Specifically, we show that when the adversarial perturbation during training has bounded $\ell_2$-norm, the classifier learned by gradient descent based adversarial training converges in direction to the maximum $\ell_2$-norm margin classifier at the rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$, significantly faster than the rate $\mathcal{O}(1/\log T)$ of training with clean data. In addition, when the adversarial perturbation during training has bounded $\ell_q$-norm for some $q\ge 1$, the resulting classifier converges in direction to a maximum mixed-norm margin classifier, which has a natural interpretation of robustness, as being the maximum $\ell_2$-norm margin classifier under worst-case $\ell_q$-norm perturbation to the data. Our findings provide theoretical backups for adversarial training that it indeed promotes robustness against adversarial perturbation.
Published: 2019
Full Text: View/download PDF

30. Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery

Author: Chen, Minshuo, Jiang, Haoming, Liao, Wenjing, and Zhao, Tuo
Subjects: Statistics and Probability, FOS: Computer and information sciences, Computer Science - Machine Learning, Numerical Analysis, Computational Theory and Mathematics, Statistics - Machine Learning, Applied Mathematics, Machine Learning (stat.ML), Analysis, Machine Learning (cs.LG)
Abstract: Real-world data often exhibit low-dimensional geometric structures and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of Hölder functions on low-dimensional manifolds using deep Rectified Linear Unit (ReLU) networks. Suppose $n$ training data are sampled from a Hölder function in $\mathcal{H}^{\ s,\alpha }$ supported on a $d$-dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^D$. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+\alpha )}{2(s+\alpha ) + d}}\log ^3 n$. This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$, which is usually much smaller than the ambient dimension $D$. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures in data and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.
Published: 2019
Full Text: View/download PDF

31. On Scalable and Efficient Computation of Large Scale Optimal Transport

Author: Xie, Yujia, Chen, Minshuo, Jiang, Haoming, Zhao, Tuo, and Zha, Hongyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Optimal Transport (OT) naturally arises in many machine learning applications, yet the heavy computational burden limits its wide-spread uses. To address the scalability issue, we propose an implicit generative learning-based framework called SPOT (Scalable Push-forward of Optimal Transport). Specifically, we approximate the optimal transport plan by a pushforward of a reference distribution, and cast the optimal transport problem into a minimax problem. We then can solve OT problems efficiently using primal dual stochastic gradient-type algorithms. We also show that we can recover the density of the optimal transport plan using neural ordinary differential equations. Numerical experiments on both synthetic and real datasets illustrate that SPOT is robust and has favorable convergence behavior. SPOT also allows us to efficiently sample from the optimal transport plan, which benefits downstream applications such as domain adaptation.
Published: 2019
Full Text: View/download PDF

32. Towards Understanding the Importance of Noise in Training Neural Networks

Author: Zhou, Mo, Liu, Tianyi, Li, Yan, Lin, Dachao, Zhou, Enlu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Numerous empirical evidence has corroborated that the noise plays a crucial rule in effective and efficient training of neural networks. The theory behind, however, is still largely unknown. This paper studies this fundamental problem through training a simple two-layer convolutional neural network model. Although training such a network requires solving a nonconvex optimization problem with a spurious local optimum and a global optimum, we prove that perturbed gradient descent and perturbed mini-batch stochastic gradient algorithms in conjunction with noise annealing is guaranteed to converge to a global optimum in polynomial time with arbitrary initialization. This implies that the noise enables the algorithm to efficiently escape from the spurious local optimum. Numerical experiments are provided to support our theory., Comment: International Conference on Machine Learning (ICML), 2019
Published: 2019
Full Text: View/download PDF

33. On Tighter Generalization Bound for Deep Neural Networks: CNNs, ResNets, and Beyond

Author: Li, Xingguo, Lu, Junwei, Wang, Zhaoran, Haupt, Jarvis, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Science::Neural and Evolutionary Computation, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We establish a margin based data dependent generalization error bound for a general family of deep neural networks in terms of the depth and width, as well as the Jacobian of the networks. Through introducing a new characterization of the Lipschitz properties of neural network family, we achieve significantly tighter generalization bounds than existing results. Moreover, we show that the generalization bound can be further improved for bounded losses. Aside from the general feedforward deep neural networks, our results can be applied to derive new bounds for popular architectures, including convolutional neural networks (CNNs) and residual networks (ResNets). When achieving same generalization errors with previous arts, our bounds allow for the choice of larger parameter spaces of weight matrices, inducing potentially stronger expressive ability for neural networks. Numerical evaluation is also provided to support our theory.
Published: 2018
Full Text: View/download PDF

34. Provable Gaussian Embedding with One Observation

Author: Yu, Ming, Yang, Zhuoran, Zhao, Tuo, Kolar, Mladen, and Wang, Zhaoran
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: The success of machine learning methods heavily relies on having an appropriate representation for data at hand. Traditionally, machine learning approaches relied on user-defined heuristics to extract features encoding structural information about data. However, recently there has been a surge in approaches that learn how to encode the data automatically in a low dimensional space. Exponential family embedding provides a probabilistic framework for learning low-dimensional representation for various types of high-dimensional data. Though successful in practice, theoretical underpinnings for exponential family embeddings have not been established. In this paper, we study the Gaussian embedding model and develop the first theoretical results for exponential family embedding models. First, we show that, under mild condition, the embedding structure can be learned from one observation by leveraging the parameter sharing between different contexts even though the data are dependent with each other. Second, we study properties of two algorithms used for learning the embedding structure and establish convergence results for each of them. The first algorithm is based on a convex relaxation, while the other solved the non-convex formulation of the problem directly. Experiments demonstrate the effectiveness of our approach.
Published: 2018
Full Text: View/download PDF

35. Learning to Defend by Learning to Attack

Author: Jiang, Haoming, Chen, Zhehui, Shi, Yuyang, Dai, Bo, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Cryptography and Security, Statistics - Machine Learning, Machine Learning (stat.ML), Cryptography and Security (cs.CR), Machine Learning (cs.LG)
Abstract: Adversarial training provides a principled approach for training robust neural networks. From an optimization perspective, adversarial training is essentially solving a bilevel optimization problem. The leader problem is trying to learn a robust classifier, while the follower problem is trying to generate adversarial samples. Unfortunately, such a bilevel problem is difficult to solve due to its highly complicated structure. This work proposes a new adversarial training method based on a generic learning-to-learn (L2L) framework. Specifically, instead of applying existing hand-designed algorithms for the inner problem, we learn an optimizer, which is parametrized as a convolutional neural network. At the same time, a robust classifier is learned to defense the adversarial attack generated by the learned optimizer. Experiments over CIFAR-10 and CIFAR-100 datasets demonstrate that L2L outperforms existing adversarial training methods in both classification accuracy and computational efficiency. Moreover, our L2L framework can be extended to generative adversarial imitation learning and stabilize the training.
Published: 2018
Full Text: View/download PDF

36. Towards Understanding Acceleration Tradeoff between Momentum and Asynchrony in Nonconvex Stochastic Optimization

Author: Liu, Tianyi, Li, Shiyang, Shi, Jianping, Zhou, Enlu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Asynchronous momentum stochastic gradient descent algorithms (Async-MSGD) is one of the most popular algorithms in distributed machine learning. However, its convergence properties for these complicated nonconvex problems is still largely unknown, because of the current technical limit. Therefore, in this paper, we propose to analyze the algorithm through a simpler but nontrivial nonconvex problem - streaming PCA, which helps us to understand Aync-MSGD better even for more general problems. Specifically, we establish the asymptotic rate of convergence of Async-MSGD for streaming PCA by diffusion approximation. Our results indicate a fundamental tradeoff between asynchrony and momentum: To ensure convergence and acceleration through asynchrony, we have to reduce the momentum (compared with Sync-MSGD). To the best of our knowledge, this is the first theoretical attempt on understanding Async-MSGD for distributed nonconvex stochastic optimization. Numerical experiments on both streaming PCA and training deep neural networks are provided to support our findings for Async-MSGD., Comment: arXiv admin note: text overlap with arXiv:1802.05155
Published: 2018
Full Text: View/download PDF

37. A Diffusion Approximation Theory of Momentum SGD in Nonconvex Optimization

Author: Liu, Tianyi, Chen, Zhehui, Zhou, Enlu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Momentum Stochastic Gradient Descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning, e.g., training deep neural networks, variational Bayesian inference, and etc. Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points, but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks., Comment: arXiv admin note: text overlap with arXiv:1806.01660
Published: 2018
Full Text: View/download PDF

38. Deep Hyperspherical Learning

Author: Liu, Weiyang, Zhang, Yan-Ming, Li, Xingguo, Yu, Zhiding, Dai, Bo, Zhao, Tuo, and Song, Le
Subjects: FOS: Computer and information sciences, Computer Science - Learning, ComputingMethodologies_PATTERNRECOGNITION, Statistics - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Convolution as inner product has been the founding basis of convolutional neural networks (CNNs) and the key to end-to-end visual representation learning. Benefiting from deeper architectures, recent CNNs have demonstrated increasingly strong representation abilities. Despite such improvement, the increased depth and larger parameter space have also led to challenges in properly training a network. In light of such challenges, we propose hyperspherical convolution (SphereConv), a novel learning framework that gives angular representations on hyperspheres. We introduce SphereNet, deep hyperspherical convolution networks that are distinct from conventional inner product based convolutional networks. In particular, SphereNet adopts SphereConv as its basic convolution operator and is supervised by generalized angular softmax loss - a natural loss formulation under SphereConv. We show that SphereNet can effectively encode discriminative representation and alleviate training difficulty, leading to easier optimization, faster convergence and comparable (even better) classification accuracy over convolutional counterparts. We also provide some theoretical insights for the advantages of learning on hyperspheres. In addition, we introduce the learnable SphereConv, i.e., a natural improvement over prefixed SphereConv, and SphereNorm, i.e., hyperspherical learning as a normalization method. Experiments have verified our conclusions., Comment: NIPS 2017 (Spotlight)
Published: 2017
Full Text: View/download PDF

39. Nonconvex Sparse Learning via Stochastic Optimization with Progressive Variance Reduction

Author: Li, Xingguo, Arora, Raman, Liu, Han, Haupt, Jarvis, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Learning, Statistics - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We propose a stochastic variance reduced optimization algorithm for solving sparse learning problems with cardinality constraints. Sufficient conditions are provided, under which the proposed algorithm enjoys strong linear convergence guarantees and optimal estimation accuracy in high dimensions. We further extend the proposed algorithm to an asynchronous parallel variant with a near linear speedup. Numerical experiments demonstrate the efficiency of our algorithm in terms of both parameter estimation and computational performance.
Published: 2016
Full Text: View/download PDF

40. On Fast Convergence of Proximal Algorithms for SQRT-Lasso Optimization: Don't Worry About Its Nonsmooth Loss Function

Author: Li, Xingguo, Jiang, Haoming, Haupt, Jarvis, Arora, Raman, Liu, Han, Hong, Mingyi, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics::Machine Learning, Statistics - Machine Learning, Optimization and Control (math.OC), MathematicsofComputing_NUMERICALANALYSIS, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Statistics::Computation, Machine Learning (cs.LG)
Abstract: Many machine learning techniques sacrifice convenient computational structures to gain estimation robustness and modeling flexibility. However, by exploring the modeling structures, we find these "sacrifices" do not always require more computational efforts. To shed light on such a "free-lunch" phenomenon, we study the square-root-Lasso (SQRT-Lasso) type regression problem. Specifically, we show that the nonsmooth loss functions of SQRT-Lasso type regression ease tuning effort and gain adaptivity to inhomogeneous noise, but is not necessarily more challenging than Lasso in computation. We can directly apply proximal algorithms (e.g. proximal gradient descent, proximal Newton, and proximal Quasi-Newton algorithms) without worrying the nonsmoothness of the loss function. Theoretically, we prove that the proximal algorithms combined with the pathwise optimization scheme enjoy fast convergence guarantees with high probability. Numerical results are provided to support our theory.
Published: 2016
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

40 results on '"Zhao, Tuo"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources