Author: "Zhao, Tuo" / Topic: machine learning (cs.lg) - Searchworks@Jio Institute Digital Library Search Results

1. Nonparametric Classification on Low Dimensional Manifolds using Overparameterized Convolutional Residual Networks

Author: Zhang, Kaiqi, Zhang, Zixuan, Chen, Minshuo, Wang, Mengdi, Zhao, Tuo, and Wang, Yu-Xiang
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Convolutional residual neural networks (ConvResNets), though overparameterized, can achieve remarkable prediction performance in practice, which cannot be well explained by conventional wisdom. To bridge this gap, we study the performance of ConvResNeXts, which cover ConvResNets as a special case, trained with weight decay from the perspective of nonparametric classification. Our analysis allows for infinitely many building blocks in ConvResNeXts, and shows that weight decay implicitly enforces sparsity on these blocks. Specifically, we consider a smooth target function supported on a low-dimensional manifold, then prove that ConvResNeXts can adapt to the function smoothness and low-dimensional structures and efficiently learn the function without suffering from the curse of dimensionality. Our findings partially justify the advantage of overparameterized ConvResNeXts over conventional machine learning models., 20 pages, 1 figure
Published: 2023

2. Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

Author: Zhang, Qingru, Chen, Minshuo, Bukharin, Alexander, He, Pengcheng, Cheng, Yu, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Fine-tuning large pre-trained language models on downstream tasks has become an important paradigm in NLP. However, common practice fine-tunes all of the parameters in a pre-trained model, which becomes prohibitive when a large number of downstream tasks are present. Therefore, many fine-tuning methods are proposed to learn incremental updates of pre-trained weights in a parameter efficient way, e.g., low-rank increments. These methods often evenly distribute the budget of incremental updates across all pre-trained weight matrices, and overlook the varying importance of different weight parameters. As a consequence, the fine-tuning performance is suboptimal. To bridge this gap, we propose AdaLoRA, which adaptively allocates the parameter budget among weight matrices according to their importance score. In particular, AdaLoRA parameterizes the incremental updates in the form of singular value decomposition. Such a novel approach allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations. We conduct extensive experiments with several pre-trained models on natural language processing, question answering, and natural language generation to validate the effectiveness of AdaLoRA. Results demonstrate that AdaLoRA manifests notable improvement over baselines, especially in the low budget settings. Our code is publicly available at https://github.com/QingruZhang/AdaLoRA ., The 11th International Conference on Learning Representations (ICLR 2023)
Published: 2023

3. HomoDistil: Homotopic Task-Agnostic Distillation of Pre-trained Transformers

Author: Liang, Chen, Jiang, Haoming, Li, Zheng, Tang, Xianfeng, Yin, Bin, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Knowledge distillation has been shown to be a powerful model compression approach to facilitate the deployment of pre-trained language models in practice. This paper focuses on task-agnostic distillation. It produces a compact pre-trained model that can be easily fine-tuned on various tasks with small computational costs and memory footprints. Despite the practical benefits, task-agnostic distillation is challenging. Since the teacher model has a significantly larger capacity and stronger representation power than the student model, it is very difficult for the student to produce predictions that match the teacher's over a massive amount of open-domain training data. Such a large prediction discrepancy often diminishes the benefits of knowledge distillation. To address this challenge, we propose Homotopic Distillation (HomoDistil), a novel task-agnostic distillation approach equipped with iterative pruning. Specifically, we initialize the student model from the teacher model, and iteratively prune the student's neurons until the target width is reached. Such an approach maintains a small discrepancy between the teacher's and student's predictions throughout the distillation process, which ensures the effectiveness of knowledge transfer. Extensive experiments demonstrate that HomoDistil achieves significant improvements on existing baselines.
Published: 2023

4. Score Approximation, Estimation and Distribution Recovery of Diffusion Models on Low-Dimensional Data

Author: Chen, Minshuo, Huang, Kaixuan, Zhao, Tuo, and Wang, Mengdi
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Diffusion models achieve state-of-the-art performance in various generation tasks. However, their theoretical foundations fall far behind. This paper studies score approximation, estimation, and distribution recovery of diffusion models, when data are supported on an unknown low-dimensional linear subspace. Our result provides sample complexity bounds for distribution estimation using diffusion models. We show that with a properly chosen neural network architecture, the score function can be both accurately approximated and efficiently estimated. Furthermore, the generated distribution based on the estimated score function captures the data geometric structures and converges to a close vicinity of the data distribution. The convergence rate depends on the subspace dimension, indicating that diffusion models can circumvent the curse of data ambient dimensionality., 52 pages, 4 figures
Published: 2023

5. LoSparse: Structured Compression of Large Language Models based on Low-Rank and Sparse Approximation

Author: Li, Yixiao, Yu, Yifan, Zhang, Qingru, Liang, Chen, He, Pengcheng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Transformer models have achieved remarkable results in various natural language tasks, but they are often prohibitively large, requiring massive memories and computational resources. To reduce the size and complexity of these models, we propose LoSparse (Low-Rank and Sparse approximation), a novel model compression technique that approximates a weight matrix by the sum of a low-rank matrix and a sparse matrix. Our method combines the advantages of both low-rank approximations and pruning, while avoiding their limitations. Low-rank approximation compresses the coherent and expressive parts in neurons, while pruning removes the incoherent and non-expressive parts in neurons. Pruning enhances the diversity of low-rank approximations, and low-rank approximation prevents pruning from losing too many expressive neurons. We evaluate our method on natural language understanding, question answering, and natural language generation tasks. We show that it significantly outperforms existing compression methods.
Published: 2023
Full Text: View/download PDF

6. Machine Learning Force Fields with Data Cost Aware Training

Author: Bukharin, Alexander, Liu, Tianyi, Wang, Shengjie, Zuo, Simiao, Gao, Weihao, Yan, Wen, and Zhao, Tuo
Subjects: Chemical Physics (physics.chem-ph), FOS: Computer and information sciences, Computer Science - Machine Learning, Physics - Chemical Physics, FOS: Biological sciences, FOS: Physical sciences, Quantitative Biology - Quantitative Methods, Quantitative Methods (q-bio.QM), Machine Learning (cs.LG)
Abstract: Machine learning force fields (MLFF) have been proposed to accelerate molecular dynamics (MD) simulation, which finds widespread applications in chemistry and biomedical research. Even for the most data-efficient MLFFs, reaching chemical accuracy can require hundreds of frames of force and energy labels generated by expensive quantum mechanical algorithms, which may scale as $O(n^3)$ to $O(n^7)$, with $n$ proportional to the number of basis functions. To address this issue, we propose a multi-stage computational framework -- ASTEROID, which lowers the data cost of MLFFs by leveraging a combination of cheap inaccurate data and expensive accurate data. The motivation behind ASTEROID is that inaccurate data, though incurring large bias, can help capture the sophisticated structures of the underlying force field. Therefore, we first train a MLFF model on a large amount of inaccurate training data, employing a bias-aware loss function to prevent the model from overfitting tahe potential bias of this data. We then fine-tune the obtained model using a small amount of accurate training data, which preserves the knowledge learned from the inaccurate training data while significantly improving the model's accuracy. Moreover, we propose a variant of ASTEROID based on score matching for the setting where the inaccurate training data are unlabeled. Extensive experiments on MD datasets and downstream tasks validate the efficacy of ASTEROID. Our code and data are available at https://github.com/abukharin3/asteroid.
Published: 2023
Full Text: View/download PDF

7. On Deep Generative Models for Approximation and Estimation of Distributions on Manifolds

Author: Dahal, Biraj, Havrilla, Alex, Chen, Minshuo, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Generative networks have experienced great empirical successes in distribution learning. Many existing experiments have demonstrated that generative networks can generate high-dimensional complex data from a low-dimensional easy-to-sample distribution. However, this phenomenon can not be justified by existing theories. The widely held manifold hypothesis speculates that real-world data sets, such as natural images and signals, exhibit low-dimensional geometric structures. In this paper, we take such low-dimensional data structures into consideration by assuming that data distributions are supported on a low-dimensional manifold. We prove statistical guarantees of generative networks under the Wasserstein-1 loss. We show that the Wasserstein-1 loss converges to zero at a fast rate depending on the intrinsic dimension instead of the ambient data dimension. Our theory leverages the low-dimensional geometric structures in data sets and justifies the practical power of generative networks. We require no smoothness assumptions on the data distribution which is desirable in practice.
Published: 2023
Full Text: View/download PDF

8. Effective Minkowski Dimension of Deep Nonparametric Regression: Function Approximation and Statistical Theories

Author: Zhang, Zixuan, Chen, Minshuo, Wang, Mengdi, Liao, Wenjing, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Existing theories on deep nonparametric regression have shown that when the input data lie on a low-dimensional manifold, deep neural networks can adapt to the intrinsic data structures. In real world applications, such an assumption of data lying exactly on a low dimensional manifold is stringent. This paper introduces a relaxed assumption that the input data are concentrated around a subset of $\mathbb{R}^d$ denoted by $\mathcal{S}$, and the intrinsic dimension of $\mathcal{S}$ can be characterized by a new complexity notation -- effective Minkowski dimension. We prove that, the sample complexity of deep nonparametric regression only depends on the effective Minkowski dimension of $\mathcal{S}$ denoted by $p$. We further illustrate our theoretical findings by considering nonparametric regression with an anisotropic Gaussian random design $N(0,\Sigma)$, where $\Sigma$ is full rank. When the eigenvalues of $\Sigma$ have an exponential or polynomial decay, the effective Minkowski dimension of such an Gaussian random design is $p=\mathcal{O}(\sqrt{\log n})$ or $p=\mathcal{O}(n^\gamma)$, respectively, where $n$ is the sample size and $\gamma\in(0,1)$ is a small constant depending on the polynomial decay rate. Our theory shows that, when the manifold assumption does not hold, deep neural networks can still adapt to the effective Minkowski dimension of the data, and circumvent the curse of the ambient dimensionality for moderate sample sizes.
Published: 2023
Full Text: View/download PDF

9. First-order Policy Optimization for Robust Markov Decision Process

Author: Li, Yan, Lan, Guanghui, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Optimization and Control (math.OC), Computer Science - Artificial Intelligence, FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We consider the problem of solving robust Markov decision process (MDP), which involves a set of discounted, finite state, finite action space MDPs with uncertain transition kernels. The goal of planning is to find a robust policy that optimizes the worst-case values against the transition uncertainties, and thus encompasses the standard MDP planning as a special case. For $(\mathbf{s},\mathbf{a})$-rectangular uncertainty sets, we establish several structural observations on the robust objective, which facilitates the development of a policy-based first-order method, namely the robust policy mirror descent (RPMD). An $\mathcal{O}(\log(1/\epsilon))$ iteration complexity for finding an $\epsilon$-optimal policy is established with linearly increasing stepsizes. We further develop a stochastic variant of the robust policy mirror descent method, named SRPMD, when the first-order information is only available through online interactions with the nominal environment. We show that the optimality gap converges linearly up to the noise level, and consequently establish an $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity by developing a temporal difference learning method for policy evaluation. Both iteration and sample complexities are also discussed for RPMD with a constant stepsize. To the best of our knowledge, all the aforementioned results appear to be new for policy-based first-order methods applied to the robust MDP problem.
Published: 2022

10. DiP-GNN: Discriminative Pre-Training of Graph Neural Networks

Author: Zuo, Simiao, Jiang, Haoming, Yin, Qingyu, Tang, Xianfeng, Yin, Bing, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Graph neural network (GNN) pre-training methods have been proposed to enhance the power of GNNs. Specifically, a GNN is first pre-trained on a large-scale unlabeled graph and then fine-tuned on a separate small labeled graph for downstream applications, such as node classification. One popular pre-training method is to mask out a proportion of the edges, and a GNN is trained to recover them. However, such a generative method suffers from graph mismatch. That is, the masked graph inputted to the GNN deviates from the original graph. To alleviate this issue, we propose DiP-GNN (Discriminative Pre-training of Graph Neural Networks). Specifically, we train a generator to recover identities of the masked edges, and simultaneously, we train a discriminator to distinguish the generated edges from the original graph's edges. In our framework, the graph seen by the discriminator better matches the original graph because the generator can recover a proportion of the masked edges. Extensive experiments on large-scale homogeneous and heterogeneous graphs demonstrate the effectiveness of the proposed framework.
Published: 2022

11. Sample Complexity of Nonparametric Off-Policy Evaluation on Low-Dimensional Manifolds using Deep Networks

Author: Ji, Xiang, Chen, Minshuo, Wang, Mengdi, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We consider the off-policy evaluation problem of reinforcement learning using deep convolutional neural networks. We analyze the deep fitted Q-evaluation method for estimating the expected cumulative reward of a target policy, when the data are generated from an unknown behavior policy. We show that, by choosing network size appropriately, one can leverage any low-dimensional manifold structure in the Markov decision process and obtain a sample-efficient estimator without suffering from the curse of high data ambient dimensionality. Specifically, we establish a sharp error bound for fitted Q-evaluation, which depends on the intrinsic dimension of the state-action space, the smoothness of Bellman operator, and a function class-restricted $\chi^2$-divergence. It is noteworthy that the restricted $\chi^2$-divergence measures the behavior and target policies' {\it mismatch in the function space}, which can be small even if the two policies are not close to each other in their tabular forms. We also develop a novel approximation result for convolutional neural networks in Q-function estimation. Numerical experiments are provided to support our theoretical analysis., Comment: 52 pages, 2 figures
Published: 2022

12. Efficient Long Sequence Modeling via State Space Augmented Transformer

Author: Zuo, Simiao, Liu, Xiaodong, Jiao, Jian, Charles, Denis, Manavoglu, Eren, Zhao, Tuo, and Gao, Jianfeng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Transformer models have achieved superior performance in various natural language processing tasks. However, the quadratic computational cost of the attention mechanism limits its practicality for long sequences. There are existing attention variants that improve the computational efficiency, but they have limited ability to effectively compute global information. In parallel to Transformer models, state space models (SSMs) are tailored for long sequences, but they are not flexible enough to capture complicated local information. We propose SPADE, short for $\underline{\textbf{S}}$tate s$\underline{\textbf{P}}$ace $\underline{\textbf{A}}$ugmente$\underline{\textbf{D}}$ Transform$\underline{\textbf{E}}$r. Specifically, we augment a SSM into the bottom layer of SPADE, and we employ efficient local attention methods for the other layers. The SSM augments global information, which complements the lack of long-range dependency issue in local attention methods. Experimental results on the Long Range Arena benchmark and language modeling tasks demonstrate the effectiveness of the proposed method. To further demonstrate the scalability of SPADE, we pre-train large encoder-decoder models and present fine-tuning results on natural language understanding and natural language generation tasks.
Published: 2022
Full Text: View/download PDF

13. High Dimensional Binary Classification under Label Shift: Phase Transition and Regularization

Author: Cheng, Jiahui, Chen, Minshuo, Liu, Hao, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, FOS: Mathematics, Mathematics - Statistics Theory, Statistics Theory (math.ST), Machine Learning (cs.LG)
Abstract: Label Shift has been widely believed to be harmful to the generalization performance of machine learning models. Researchers have proposed many approaches to mitigate the impact of the label shift, e.g., balancing the training data. However, these methods often consider the underparametrized regime, where the sample size is much larger than the data dimension. The research under the overparametrized regime is very limited. To bridge this gap, we propose a new asymptotic analysis of the Fisher Linear Discriminant classifier for binary classification with label shift. Specifically, we prove that there exists a phase transition phenomenon: Under certain overparametrized regime, the classifier trained using imbalanced data outperforms the counterpart with reduced balanced data. Moreover, we investigate the impact of regularization to the label shift: The aforementioned phase transition vanishes as the regularization becomes strong.
Published: 2022
Full Text: View/download PDF

14. Benefits of Overparameterized Convolutional Residual Networks: Function Approximation under Smoothness Constraint

Author: Liu, Hao, Chen, Minshuo, Er, Siawpeng, Liao, Wenjing, Zhang, Tong, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Overparameterized neural networks enjoy great representation power on complex data, and more importantly yield sufficiently smooth output, which is crucial to their generalization and robustness. Most existing function approximation theories suggest that with sufficiently many parameters, neural networks can well approximate certain classes of functions in terms of the function value. The neural network themselves, however, can be highly nonsmooth. To bridge this gap, we take convolutional residual networks (ConvResNets) as an example, and prove that large ConvResNets can not only approximate a target function in terms of function value, but also exhibit sufficient first-order smoothness. Moreover, we extend our theory to approximating functions supported on a low-dimensional manifold. Our theory partially justifies the benefits of using deep and wide networks in practice. Numerical experiments on adversarial robust image classification are provided to support our theory.
Published: 2022
Full Text: View/download PDF

15. PLATON: Pruning Large Transformer Models with Upper Confidence Bound of Weight Importance

Author: Zhang, Qingru, Zuo, Simiao, Liang, Chen, Bukharin, Alexander, He, Pengcheng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Large Transformer-based models have exhibited superior performance in various natural language processing and computer vision tasks. However, these models contain enormous amounts of parameters, which restrict their deployment to real-world applications. To reduce the model size, researchers prune these models based on the weights' importance scores. However, such scores are usually estimated on mini-batches during training, which incurs large variability/uncertainty due to mini-batch sampling and complicated training dynamics. As a result, some crucial weights could be pruned by commonly used pruning methods because of such uncertainty, which makes training unstable and hurts generalization. To resolve this issue, we propose PLATON, which captures the uncertainty of importance scores by upper confidence bound (UCB) of importance estimation. In particular, for the weights with low importance scores but high uncertainty, PLATON tends to retain them and explores their capacity. We conduct extensive experiments with several Transformer-based models on natural language understanding, question answering and image classification to validate the effectiveness of PLATON. Results demonstrate that PLATON manifests notable improvement under different sparsity levels. Our code is publicly available at https://github.com/QingruZhang/PLATON., Comment: Proceedings of the 39th International Conference on Machine Learning (ICML 2022)
Published: 2022
Full Text: View/download PDF

16. CAMERO: Consistency Regularized Ensemble of Perturbed Language Models with Weight Sharing

Author: Liang, Chen, He, Pengcheng, Shen, Yelong, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Model ensemble is a popular approach to produce a low-variance and well-generalized model. However, it induces large memory and inference costs, which are often not affordable for real-world deployment. Existing work has resorted to sharing weights among models. However, when increasing the proportion of the shared weights, the resulting models tend to be similar, and the benefits of using model ensemble diminish. To retain ensemble benefits while maintaining a low memory cost, we propose a consistency-regularized ensemble learning approach based on perturbed models, named CAMERO. Specifically, we share the weights of bottom layers across all models and apply different perturbations to the hidden representations for different models, which can effectively promote the model diversity. Meanwhile, we apply a prediction consistency regularizer across the perturbed models to control the variance due to the model diversity. Our experiments using large language models demonstrate that CAMERO significantly improves the generalization performance of the ensemble model. Specifically, CAMERO outperforms the standard ensemble of 8 BERT-base models on the GLUE benchmark by 0.7 with a significantly smaller model size (114.2M vs. 880.6M)., Comment: Proceedings of ACL 2022
Published: 2022
Full Text: View/download PDF

17. Block Policy Mirror Descent

Author: Lan, Guanghui, Li, Yan, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Optimization and Control (math.OC), FOS: Mathematics, 90C40, 90C15, 90C26, 68Q25, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: In this paper, we present a new policy gradient (PG) methods, namely the block policy mirror descent (BPMD) method for solving a class of regularized reinforcement learning (RL) problems with (strongly)-convex regularizers. Compared to the traditional PG methods with a batch update rule, which visits and updates the policy for every state, BPMD method has cheap per-iteration computation via a partial update rule that performs the policy update on a sampled state. Despite the nonconvex nature of the problem and a partial update rule, we provide a unified analysis for several sampling schemes, and show that BPMD achieves fast linear convergence to the global optimality. In particular, uniform sampling leads to comparable worst-case total computational complexity as batch PG methods. A necessary and sufficient condition for convergence with on-policy sampling is also identified. With a hybrid sampling scheme, we further show that BPMD enjoys potential instance-dependent acceleration, leading to improved dependence on the state space and consequently outperforming batch PG methods. We then extend BPMD methods to the stochastic setting, by utilizing stochastic first-order information constructed from samples. With a generative model, $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A}\right\rvert /\epsilon)$ (resp. $\tilde{\mathcal{O}}(\left\lvert \mathcal{S}\right\rvert \left\lvert \mathcal{A} \right\rvert /\epsilon^2)$) sample complexities are established for the strongly-convex (resp. non-strongly-convex) regularizers, where $\epsilon$ denotes the target accuracy. To the best of our knowledge, this is the first time that block coordinate descent methods have been developed and analyzed for policy optimization in reinforcement learning, which provides a new perspective on solving large-scale RL problems.
Published: 2022
Full Text: View/download PDF

18. No Parameters Left Behind: Sensitivity Guided Adaptive Learning Rate for Training Large Transformer Models

Author: Liang, Chen, Jiang, Haoming, Zuo, Simiao, He, Pengcheng, Liu, Xiaodong, Gao, Jianfeng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Recent research has shown the existence of significant redundancy in large Transformer models. One can prune the redundant parameters without significantly sacrificing the generalization performance. However, we question whether the redundant parameters could have contributed more if they were properly trained. To answer this question, we propose a novel training strategy that encourages all parameters to be trained sufficiently. Specifically, we adaptively adjust the learning rate for each parameter according to its sensitivity, a robust gradient-based measure reflecting this parameter's contribution to the model performance. A parameter with low sensitivity is redundant, and we improve its fitting by increasing its learning rate. In contrast, a parameter with high sensitivity is well-trained, and we regularize it by decreasing its learning rate to prevent further overfitting. We conduct extensive experiments on natural language understanding, neural machine translation, and image classification to demonstrate the effectiveness of the proposed schedule. Analysis shows that the proposed schedule indeed reduces the redundancy and improves generalization performance., Comment: Proceedings of ICLR 2022
Published: 2022
Full Text: View/download PDF

19. Deep Nonparametric Estimation of Operators between Infinite Dimensional Spaces

Author: Liu, Hao, Yang, Haizhao, Chen, Minshuo, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Learning operators between infinitely dimensional spaces is an important learning task arising in wide applications in machine learning, imaging science, mathematical modeling and simulations, etc. This paper studies the nonparametric estimation of Lipschitz operators using deep neural networks. Non-asymptotic upper bounds are derived for the generalization error of the empirical risk minimizer over a properly chosen network class. Under the assumption that the target operator exhibits a low dimensional structure, our error bounds decay as the training sample size increases, with an attractive fast rate depending on the intrinsic dimension in our estimation. Our assumptions cover most scenarios in real applications and our results give rise to fast rates by exploiting low dimensional structures of data in operator estimation. We also investigate the influence of network structures (e.g., network width, depth, and sparsity) on the generalization error of the neural network estimator and propose a general suggestion on the choice of network structures to maximize the learning efficiency quantitatively.
Published: 2022
Full Text: View/download PDF

20. Differentially Private Estimation of Hawkes Process

Author: Zuo, Simiao, Liu, Tianyi, Zhao, Tuo, and Zha, Hongyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Cryptography and Security, Statistics - Machine Learning, Machine Learning (stat.ML), Cryptography and Security (cs.CR), Machine Learning (cs.LG)
Abstract: Point process models are of great importance in real world applications. In certain critical applications, estimation of point process models involves large amounts of sensitive personal data from users. Privacy concerns naturally arise which have not been addressed in the existing literature. To bridge this glaring gap, we propose the first general differentially private estimation procedure for point process models. Specifically, we take the Hawkes process as an example, and introduce a rigorous definition of differential privacy for event stream data based on a discretized representation of the Hawkes process. We then propose two differentially private optimization algorithms, which can efficiently estimate Hawkes process models with the desired privacy and utility guarantees under two different settings. Experiments are provided to back up our theoretical analysis.
Published: 2022
Full Text: View/download PDF

21. Less is More: Task-aware Layer-wise Distillation for Language Model Compression

Author: Liang, Chen, Zuo, Simiao, Zhang, Qingru, He, Pengcheng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Layer-wise distillation is a powerful tool to compress large models (i.e. teacher models) into small ones (i.e., student models). The student distills knowledge from the teacher by mimicking the hidden representations of the teacher at every intermediate layer. However, layer-wise distillation is difficult. Since the student has a smaller model capacity than the teacher, it is often under-fitted. Furthermore, the hidden representations of the teacher contain redundant information that the student does not necessarily need for the target task's learning. To address these challenges, we propose a novel Task-aware layEr-wise Distillation (TED). TED designs task-aware filters to align the hidden representations of the student and the teacher at each layer. The filters select the knowledge that is useful for the target task from the hidden representations. As such, TED reduces the knowledge gap between the two models and helps the student to fit better on the target task. We evaluate TED in two scenarios: continual pre-training and fine-tuning. TED demonstrates significant and consistent improvements over existing distillation methods in both scenarios. Code is available at https://github.com/cliang1453/task-aware-distillation., Comment: Proceedings of ICML 2023
Published: 2022
Full Text: View/download PDF

22. Noise Regularizes Over-parameterized Rank One Matrix Recovery, Provably

Author: Liu, Tianyi, Li, Yan, Zhou, Enlu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: We investigate the role of noise in optimization algorithms for learning over-parameterized models. Specifically, we consider the recovery of a rank one matrix $Y^*\in R^{d\times d}$ from a noisy observation $Y$ using an over-parameterization model. We parameterize the rank one matrix $Y^*$ by $XX^\top$, where $X\in R^{d\times d}$. We then show that under mild conditions, the estimator, obtained by the randomly perturbed gradient descent algorithm using the square loss function, attains a mean square error of $O(\sigma^2/d)$, where $\sigma^2$ is the variance of the observational noise. In contrast, the estimator obtained by gradient descent without random perturbation only attains a mean square error of $O(\sigma^2)$. Our result partially justifies the implicit regularization effect of noise when learning over-parameterized models, and provides new understanding of training over-parameterized neural networks.
Published: 2022
Full Text: View/download PDF

23. Homotopic Policy Mirror Descent: Policy Convergence, Implicit Regularization, and Improved Sample Complexity

Author: Li, Yan, Lan, Guanghui, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: We propose a new policy gradient method, named homotopic policy mirror descent (HPMD), for solving discounted, infinite horizon MDPs with finite state and action spaces. HPMD performs a mirror descent type policy update with an additional diminishing regularization term, and possesses several computational properties that seem to be new in the literature. We first establish the global linear convergence of HPMD instantiated with Kullback-Leibler divergence, for both the optimality gap, and a weighted distance to the set of optimal policies. Then local superlinear convergence is obtained for both quantities without any assumption. With local acceleration and diminishing regularization, we establish the first result among policy gradient methods on certifying and characterizing the limiting policy, by showing, with a non-asymptotic characterization, that the last-iterate policy converges to the unique optimal policy with the maximal entropy. We then extend all the aforementioned results to HPMD instantiated with a broad class of decomposable Bregman divergences, demonstrating the generality of the these computational properties. As a by product, we discover the finite-time exact convergence for some commonly used Bregman divergences, implying the continuing convergence of HPMD to the limiting policy even if the current policy is already optimal. Finally, we develop a stochastic version of HPMD and establish similar convergence properties. By exploiting the local acceleration, we show that for small optimality gap, a better than $\tilde{\mathcal{O}}(\left|\mathcal{S}\right| \left|\mathcal{A}\right| / \epsilon^2)$ sample complexity holds with high probability, when assuming a generative model for policy evaluation.
Published: 2022
Full Text: View/download PDF

24. Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect

Author: Wang, Yuqing, Chen, Minshuo, Zhao, Tuo, and Tao, Molei
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Dynamical Systems (math.DS), Mathematics - Dynamical Systems, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Recent empirical advances show that training deep models with large learning rate often improves generalization performance. However, theoretical justifications on the benefits of large learning rate are highly limited, due to challenges in analysis. In this paper, we consider using Gradient Descent (GD) with a large learning rate on a homogeneous matrix factorization problem, i.e., $\min_{X, Y} \|A - XY^\top\|_{\sf F}^2$. We prove a convergence theory for constant large learning rates well beyond $2/L$, where $L$ is the largest eigenvalue of Hessian at the initialization. Moreover, we rigorously establish an implicit bias of GD induced by such a large learning rate, termed 'balancing', meaning that magnitudes of $X$ and $Y$ at the limit of GD iterations will be close even if their initialization is significantly unbalanced. Numerical experiments are provided to support our theory., 45 pages
Published: 2021

25. Adversarially Regularized Policy Learning Guided by Trajectory Optimization

Author: Zhao, Zhigen, Zuo, Simiao, Zhao, Tuo, and Zhao, Ye
Subjects: FOS: Computer and information sciences, Computer Science - Robotics, Computer Science - Machine Learning, Robotics (cs.RO), Machine Learning (cs.LG)
Abstract: Recent advancement in combining trajectory optimization with function approximation (especially neural networks) shows promise in learning complex control policies for diverse tasks in robot systems. Despite their great flexibility, the large neural networks for parameterizing control policies impose significant challenges. The learned neural control policies are often overcomplex and non-smooth, which can easily cause unexpected or diverging robot motions. Therefore, they often yield poor generalization performance in practice. To address this issue, we propose adVErsarially Regularized pOlicy learNIng guided by trajeCtory optimizAtion (VERONICA) for learning smooth control policies. Specifically, our proposed approach controls the smoothness (local Lipschitz continuity) of the neural control policies by stabilizing the output control with respect to the worst-case perturbation to the input state. Our experiments on robot manipulation show that our proposed approach not only improves the sample efficiency of neural policy learning but also enhances the robustness of the policy against various types of disturbances, including sensor noise, environmental uncertainty, and model mismatch., Accepted at L4DC 2022
Published: 2021

26. Implicit Regularization of Bregman Proximal Point Algorithm and Mirror Descent on Separable Data

Author: Li, Yan, Ju, Caleb, Fang, Ethan X., and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Machine Learning (cs.LG)
Abstract: Bregman proximal point algorithm (BPPA), as one of the centerpieces in the optimization toolbox, has been witnessing emerging applications. With simple and easy to implement update rule, the algorithm bears several compelling intuitions for empirical successes, yet rigorous justifications are still largely unexplored. We study the computational properties of BPPA through classification tasks with separable data, and demonstrate provable algorithmic regularization effects associated with BPPA. We show that BPPA attains non-trivial margin, which closely depends on the condition number of the distance generating function inducing the Bregman divergence. We further demonstrate that the dependence on the condition number is tight for a class of problems, thus showing the importance of divergence in affecting the quality of the obtained solutions. In addition, we extend our findings to mirror descent (MD), for which we establish similar connections between the margin and Bregman divergence. We demonstrate through a concrete example, and show BPPA/MD converges in direction to the maximal margin solution with respect to the Mahalanobis distance. Our theoretical findings are among the first to demonstrate the benign learning properties BPPA/MD, and also provide corroborations for a careful choice of divergence in the algorithmic design., Additional CIFAR-100 experiments with BPPA added
Published: 2021

27. Reinforcement Learning for Adaptive Mesh Refinement

Author: Yang, Jiachen, Dzanic, Tarik, Petersen, Brenden, Kudo, Jun, Mittal, Ketan, Tomov, Vladimir, Camier, Jean-Sylvain, Zhao, Tuo, Zha, Hongyuan, Kolev, Tzanio, Anderson, Robert, and Faissol, Daniel
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, FOS: Mathematics, Numerical Analysis (math.NA), Mathematics - Numerical Analysis, Machine Learning (cs.LG)
Abstract: Large-scale finite element simulations of complex physical systems governed by partial differential equations (PDE) crucially depend on adaptive mesh refinement (AMR) to allocate computational budget to regions where higher resolution is required. Existing scalable AMR methods make heuristic refinement decisions based on instantaneous error estimation and thus do not aim for long-term optimality over an entire simulation. We propose a novel formulation of AMR as a Markov decision process and apply deep reinforcement learning (RL) to train refinement policies directly from simulation. AMR poses a new problem for RL as both the state dimension and available action set changes at every step, which we solve by proposing new policy architectures with differing generality and inductive bias. The model sizes of these policy architectures are independent of the mesh size and hence can be deployed on larger simulations than those used at train time. We demonstrate in comprehensive experiments on static function estimation and time-dependent equations that RL policies can be trained on problems without using ground truth solutions, are competitive with a widely-used error estimator, and generalize to larger, more complex, and unseen test problems., AISTATS 2023. 18 pages, 15 figures
Published: 2021

28. Noisy Gradient Descent Converges to Flat Minima for Nonconvex Matrix Factorization

Author: Liu, Tianyi, Li, Yan, Wei, Song, Zhou, Enlu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Numerous empirical evidences have corroborated the importance of noise in nonconvex optimization problems. The theory behind such empirical observations, however, is still largely unknown. This paper studies this fundamental problem through investigating the nonconvex rectangular matrix factorization problem, which has infinitely many global minima due to rotation and scaling invariance. Hence, gradient descent (GD) can converge to any optimum, depending on the initialization. In contrast, we show that a perturbed form of GD with an arbitrary initialization converges to a global optimum that is uniquely determined by the injected noise. Our result implies that the noise imposes implicit bias towards certain optima. Numerical experiments are provided to support our theory.
Published: 2021

29. Adversarial Regularization as Stackelberg Game: An Unrolled Optimization Approach

Author: Zuo, Simiao, Liang, Chen, Jiang, Haoming, Liu, Xiaodong, He, Pengcheng, Gao, Jianfeng, Chen, Weizhu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Adversarial regularization has been shown to improve the generalization performance of deep learning models in various natural language processing tasks. Existing works usually formulate the method as a zero-sum game, which is solved by alternating gradient descent/ascent algorithms. Such a formulation treats the adversarial and the defending players equally, which is undesirable because only the defending player contributes to the generalization performance. To address this issue, we propose Stackelberg Adversarial Regularization (SALT), which formulates adversarial regularization as a Stackelberg game. This formulation induces a competition between a leader and a follower, where the follower generates perturbations, and the leader trains the model subject to the perturbations. Different from conventional approaches, in SALT, the leader is in an advantageous position. When the leader moves, it recognizes the strategy of the follower and takes the anticipated follower's outcomes into consideration. Such a leader's advantage enables us to improve the model fitting to the unperturbed data. The leader's strategic information is captured by the Stackelberg gradient, which is obtained using an unrolling algorithm. Our experimental results on a set of machine translation and natural language understanding tasks show that SALT outperforms existing adversarial regularization baselines across all tasks. Our code is available at https://github.com/SimiaoZuo/Stackelberg-Adv., Comment: EMNLP 2021
Published: 2021

30. Frequency-aware SGD for Efficient Embedding Learning with Provable Benefits

Author: Li, Yan, Choudhary, Dhruv, Wei, Xiaohan, Yuan, Baichuan, Bhushanam, Bhargav, Zhao, Tuo, and Lan, Guanghui
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Embedding learning has found widespread applications in recommendation systems and natural language modeling, among other domains. To learn quality embeddings efficiently, adaptive learning rate algorithms have demonstrated superior empirical performance over SGD, largely accredited to their token-dependent learning rate. However, the underlying mechanism for the efficiency of token-dependent learning rate remains underexplored. We show that incorporating frequency information of tokens in the embedding learning problems leads to provably efficient algorithms, and demonstrate that common adaptive algorithms implicitly exploit the frequency information to a large extent. Specifically, we propose (Counter-based) Frequency-aware Stochastic Gradient Descent, which applies a frequency-dependent learning rate for each token, and exhibits provable speed-up compared to SGD when the token distribution is imbalanced. Empirically, we show the proposed algorithms are able to improve or match adaptive algorithms on benchmark recommendation tasks and a large-scale industrial recommendation system, closing the performance gap between SGD and adaptive algorithms. Our results are the first to show token-dependent learning rate provably improves convergence for non-convex embedding learning problems., Comment: Additional experiments on Word2Vec embedding learning included
Published: 2021
Full Text: View/download PDF

31. Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Author: Jiang, Haoming, Dai, Bo, Yang, Mengjiao, Zhao, Tuo, and Wei, Wei
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Reliable automatic evaluation of dialogue systems under an interactive environment has long been overdue. An ideal environment for evaluating dialog systems, also known as the Turing test, needs to involve human interaction, which is usually not affordable for large-scale experiments. Though researchers have attempted to use metrics (e.g., perplexity, BLEU) in language generation tasks or some model-based reinforcement learning methods (e.g., self-play evaluation) for automatic evaluation, these methods only show a very weak correlation with the actual human evaluation in practice. To bridge such a gap, we propose a new framework named ENIGMA for estimating human evaluation scores based on recent advances of off-policy evaluation in reinforcement learning. ENIGMA only requires a handful of pre-collected experience data, and therefore does not involve human interaction with the target policy during the evaluation, making automatic evaluations feasible. More importantly, ENIGMA is model-free and agnostic to the behavior policies for collecting the experience data (see details in Section 2), which significantly alleviates the technical difficulties of modeling complex dialogue environments and human behaviors. Our experiments show that ENIGMA significantly outperforms existing methods in terms of correlation with human evaluation scores., Comment: Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
Published: 2021
Full Text: View/download PDF

32. Taming Sparsely Activated Transformer with Stochastic Experts

Author: Zuo, Simiao, Liu, Xiaodong, Jiao, Jian, Kim, Young Jin, Hassan, Hany, Zhang, Ruofei, Zhao, Tuo, and Gao, Jianfeng
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Computer Science - Computation and Language, Computation and Language (cs.CL), Machine Learning (cs.LG)
Abstract: Sparsely activated models (SAMs), such as Mixture-of-Experts (MoE), can easily scale to have outrageously large amounts of parameters without significant increase in computational cost. However, SAMs are reported to be parameter inefficient such that larger models do not always lead to better performance. While most on-going research focuses on improving SAMs models by exploring methods of routing inputs to experts, our analysis reveals that such research might not lead to the solution we expect, i.e., the commonly-used routing methods based on gating mechanisms do not work better than randomly routing inputs to experts. In this paper, we propose a new expert-based model, THOR (Transformer witH StOchastic ExpeRts). Unlike classic expert-based models, such as the Switch Transformer, experts in THOR are randomly activated for each input during training and inference. THOR models are trained using a consistency regularized loss, where experts learn not only from training data but also from other experts as teachers, such that all the experts make consistent predictions. We validate the effectiveness of THOR on machine translation tasks. Results show that THOR models are more parameter efficient in that they significantly outperform the Transformer and MoE models across various settings. For example, in multilingual translation, THOR outperforms the Switch Transformer by 2 BLEU scores, and obtains the same BLEU score as that of a state-of-the-art MoE model that is 18 times larger. Our code is publicly available at: https://github.com/microsoft/Stochastic-Mixture-of-Experts., Comment: ICLR 2022
Published: 2021
Full Text: View/download PDF

33. Besov Function Approximation and Binary Classification on Low-Dimensional Manifolds Using Convolutional Residual Networks

Author: Liu, Hao, Chen, Minshuo, Zhao, Tuo, and Liao, Wenjing
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Most of existing statistical theories on deep neural networks have sample complexities cursed by the data dimension and therefore cannot well explain the empirical success of deep learning on high-dimensional data. To bridge this gap, we propose to exploit low-dimensional geometric structures of the real world data sets. We establish theoretical guarantees of convolutional residual networks (ConvResNet) in terms of function approximation and statistical estimation for binary classification. Specifically, given the data lying on a $d$-dimensional manifold isometrically embedded in $\mathbb{R}^D$, we prove that if the network architecture is properly chosen, ConvResNets can (1) approximate Besov functions on manifolds with arbitrary accuracy, and (2) learn a classifier by minimizing the empirical logistic risk, which gives an excess risk in the order of $n^{-\frac{s}{2s+2(s\vee d)}}$, where $s$ is a smoothness parameter. This implies that the sample complexity depends on the intrinsic dimension $d$, instead of the data dimension $D$. Our results demonstrate that ConvResNets are adaptive to low-dimensional structures of data sets.
Published: 2021
Full Text: View/download PDF

34. A Hypergradient Approach to Robust Regression without Correspondence

Author: Xie, Yujia, Mao, Yixiu, Zuo, Simiao, Xu, Hongteng, Ye, Xiaojing, Zhao, Tuo, and Zha, Hongyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: We consider a variant of regression problem, where the correspondence between input and output data is not available. Such shuffled data is commonly observed in many real world problems. Taking flow cytometry as an example, the measuring instruments may not be able to maintain the correspondence between the samples and the measurements. Due to the combinatorial nature of the problem, most existing methods are only applicable when the sample size is small, and limited to linear regression models. To overcome such bottlenecks, we propose a new computational framework -- ROBOT -- for the shuffled regression problem, which is applicable to large data and complex nonlinear models. Specifically, we reformulate the regression without correspondence as a continuous optimization problem. Then by exploiting the interaction between the regression model and the data correspondence, we develop a hypergradient approach based on differentiable programming techniques. Such a hypergradient approach essentially views the data correspondence as an operator of the regression, and therefore allows us to find a better descent direction for the model parameter by differentiating through the data correspondence. ROBOT can be further extended to the inexact correspondence setting, where there may not be an exact alignment between the input and output data. Thorough numerical experiments show that ROBOT achieves better performance than existing methods in both linear and nonlinear regression tasks, including real-world applications such as flow cytometry and multi-object tracking.
Published: 2020

35. Picasso: A Sparse Learning Library for High Dimensional Data Analysis in R and Python

Author: Ge, Jason, Li, Xingguo, Jiang, Haoming, Liu, Han, Zhang, Tong, Wang, Mengdi, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Statistics::Machine Learning, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, MathematicsofComputing_NUMERICALANALYSIS, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG), Statistics::Computation
Abstract: We describe a new library named picasso, which implements a unified framework of pathwise coordinate optimization for a variety of sparse learning problems (e.g., sparse linear regression, sparse logistic regression, sparse Poisson regression and scaled sparse linear regression) combined with efficient active set selection strategies. Besides, the library allows users to choose different sparsity-inducing regularizers, including the convex $\ell_1$, nonconvex MCP and SCAD regularizers. The library is coded in C++ and has user-friendly R and Python wrappers. Numerical experiments demonstrate that picasso can scale up to large problems efficiently.
Published: 2020

36. Towards Understanding Hierarchical Learning: Benefits of Neural Representations

Author: Chen, Minshuo, Bai, Yu, Lee, Jason D., Zhao, Tuo, Wang, Huan, Xiong, Caiming, and Socher, Richard
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Deep neural networks can empirically perform efficient hierarchical learning, in which the layers learn useful representations of the data. However, how they make use of the intermediate representations are not explained by recent theories that relate them to "shallow learners" such as kernels. In this work, we demonstrate that intermediate neural representations add more flexibility to neural networks and can be advantageous over raw inputs. We consider a fixed, randomly initialized neural network as a representation function fed into another trainable network. When the trainable network is the quadratic Taylor model of a wide two-layer network, we show that neural representation can achieve improved sample complexities compared with the raw input: For learning a low-rank degree-$p$ polynomial ($p \geq 4$) in $d$ dimension, neural representation requires only $\tilde{O}(d^{\lceil p/2 \rceil})$ samples, while the best-known sample complexity upper bound for the raw input is $\tilde{O}(d^{p-1})$. We contrast our result with a lower bound showing that neural representations do not improve over the raw input (in the infinite width limit), when the trainable network is instead a neural tangent kernel. Our results characterize when neural representations are beneficial, and may provide a new perspective on why depth is important in deep learning., 41 pages, published in NeurIPS 2020
Published: 2020

37. Differentiable Top-k Operator with Optimal Transport

Author: Xie, Yujia, Dai, Hanjun, Chen, Minshuo, Dai, Bo, Zhao, Tuo, Zha, Hongyuan, Wei, Wei, and Pfister, Tomas
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: The top-k operation, i.e., finding the k largest or smallest elements from a collection of scores, is an important model component, which is widely used in information retrieval, machine learning, and data mining. However, if the top-k operation is implemented in an algorithmic way, e.g., using bubble algorithm, the resulting model cannot be trained in an end-to-end way using prevalent gradient descent algorithms. This is because these implementations typically involve swapping indices, whose gradient cannot be computed. Moreover, the corresponding mapping from the input scores to the indicator vector of whether this element belongs to the top-k set is essentially discontinuous. To address the issue, we propose a smoothed approximation, namely the SOFT (Scalable Optimal transport-based diFferenTiable) top-k operator. Specifically, our SOFT top-k operator approximates the output of the top-k operation as the solution of an Entropic Optimal Transport (EOT) problem. The gradient of the SOFT operator can then be efficiently approximated based on the optimality conditions of EOT problem. We apply the proposed operator to the k-nearest neighbors and beam search algorithms, and demonstrate improved performance.
Published: 2020

38. Why Do Deep Residual Networks Generalize Better than Deep Feedforward Networks? -- A Neural Tangent Kernel Perspective

Author: Huang, Kaixuan, Wang, Yuqing, Tao, Molei, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Deep residual networks (ResNets) have demonstrated better generalization performance than deep feedforward networks (FFNets). However, the theory behind such a phenomenon is still largely unknown. This paper studies this fundamental problem in deep learning from a so-called "neural tangent kernel" perspective. Specifically, we first show that under proper conditions, as the width goes to infinity, training deep ResNets can be viewed as learning reproducing kernel functions with some kernel function. We then compare the kernel of deep ResNets with that of deep FFNets and discover that the class of functions induced by the kernel of FFNets is asymptotically not learnable, as the depth goes to infinity. In contrast, the class of functions induced by the kernel of ResNets does not exhibit such degeneracy. Our discovery partially justifies the advantages of deep ResNets over deep FFNets in generalization abilities. Numerical results are provided to support our claim., Accepted in NeurIPS 2020
Published: 2020

39. Doubly Robust Off-Policy Learning on Low-Dimensional Manifolds by Deep Neural Networks

Author: Chen, Minshuo, Liu, Hao, Liao, Wenjing, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Causal inference explores the causation between actions and the consequent rewards on a covariate set. Recently deep learning has achieved a remarkable performance in causal inference, but existing statistical theories cannot well explain such an empirical success, especially when the covariates are high-dimensional. Most theoretical results in causal inference are asymptotic, suffer from the curse of dimensionality, and only work for the finite-action scenario. To bridge such a gap between theory and practice, this paper studies doubly robust off-policy learning by deep neural networks. When the covariates lie on a low-dimensional manifold, we prove nonasymptotic regret bounds, which converge at a fast rate depending on the intrinsic dimension of the manifold. Our results cover both the finite- and continuous-action scenarios. Our theory shows that deep neural networks are adaptive to the low-dimensional geometric structures of the covariates, and partially explains the success of deep learning for causal inference.
Published: 2020
Full Text: View/download PDF

40. Transformer Hawkes Process

Author: Zuo, Simiao, Jiang, Haoming, Li, Zichong, Zhao, Tuo, and Zha, Hongyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Modern data acquisition routinely produce massive amounts of event sequence data in various domains, such as social media, healthcare, and financial markets. These data often exhibit complicated short-term and long-term temporal dependencies. However, most of the existing recurrent neural network based point process models fail to capture such dependencies, and yield unreliable prediction performance. To address this issue, we propose a Transformer Hawkes Process (THP) model, which leverages the self-attention mechanism to capture long-term dependencies and meanwhile enjoys computational efficiency. Numerical experiments on various datasets show that THP outperforms existing models in terms of both likelihood and event prediction accuracy by a notable margin. Moreover, THP is quite general and can incorporate additional structural knowledge. We provide a concrete example, where THP achieves improved prediction performance for learning multiple point processes when incorporating their relational information.
Published: 2020
Full Text: View/download PDF

41. The flare Package for High Dimensional Linear Regression and Precision Matrix Estimation in R

Author: Li, Xingguo, Zhao, Tuo, Yuan, Xiaoming, and Liu, Han
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics::Theory, Statistics::Machine Learning, Statistics - Machine Learning, Statistics::Methodology, Computer Science - Mathematical Software, Machine Learning (stat.ML), Mathematical Software (cs.MS), Statistics::Computation, Machine Learning (cs.LG)
Abstract: This paper describes an R package named flare, which implements a family of new high dimensional regression methods (LAD Lasso, SQRT Lasso, $\ell_q$ Lasso, and Dantzig selector) and their extensions to sparse precision matrix estimation (TIGER and CLIME). These methods exploit different nonsmooth loss functions to gain modeling flexibility, estimation robustness, and tuning insensitiveness. The developed solver is based on the alternating direction method of multipliers (ADMM). The package flare is coded in double precision C, and called from R by a user-friendly interface. The memory usage is optimized by using the sparse matrix output. The experiments show that flare is efficient and can scale up to large problems.
Published: 2020
Full Text: View/download PDF

42. Deep Reinforcement Learning with Robust and Smooth Policy

Author: Shen, Qianli, Li, Yan, Jiang, Haoming, Wang, Zhaoran, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Artificial Intelligence (cs.AI), Computer Science - Artificial Intelligence, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Deep reinforcement learning (RL) has achieved great empirical successes in various domains. However, the large search space of neural networks requires a large amount of data, which makes the current RL algorithms not sample efficient. Motivated by the fact that many environments with continuous state space have smooth transitions, we propose to learn a smooth policy that behaves smoothly with respect to states. We develop a new framework -- \textbf{S}mooth \textbf{R}egularized \textbf{R}einforcement \textbf{L}earning ($\textbf{SR}^2\textbf{L}$), where the policy is trained with smoothness-inducing regularization. Such regularization effectively constrains the search space, and enforces smoothness in the learned policy. Moreover, our proposed framework can also improve the robustness of policy against measurement error in the state space, and can be naturally extended to distribubutionally robust setting. We apply the proposed framework to both on-policy (TRPO) and off-policy algorithm (DDPG). Through extensive experiments, we demonstrate that our method achieves improved sample efficiency and robustness., Comment: ICML 2020
Published: 2020
Full Text: View/download PDF

43. On Computation and Generalization of Generative Adversarial Imitation Learning

Author: Chen, Minshuo, Wang, Yizhou, Liu, Tianyi, Yang, Zhuoran, Li, Xingguo, Wang, Zhaoran, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Generative Adversarial Imitation Learning (GAIL) is a powerful and practical approach for learning sequential decision-making policies. Different from Reinforcement Learning (RL), GAIL takes advantage of demonstration data by experts (e.g., human), and learns both the policy and reward function of the unknown environment. Despite the significant empirical progresses, the theory behind GAIL is still largely unknown. The major difficulty comes from the underlying temporal dependency of the demonstration data and the minimax computational formulation of GAIL without convex-concave structure. To bridge such a gap between theory and practice, this paper investigates the theoretical properties of GAIL. Specifically, we show: (1) For GAIL with general reward parameterization, the generalization can be guaranteed as long as the class of the reward functions is properly controlled; (2) For GAIL, where the reward is parameterized as a reproducing kernel function, GAIL can be efficiently solved by stochastic first order optimization algorithms, which attain sublinear convergence to a stationary solution. To the best of our knowledge, these are the first results on statistical and computational guarantees of imitation learning with reward/policy function approximation. Numerical experiments are provided to support our analysis.
Published: 2020
Full Text: View/download PDF

44. How Important is the Train-Validation Split in Meta-Learning?

Author: Bai, Yu, Chen, Minshuo, Zhou, Pan, Zhao, Tuo, Lee, Jason D., Kakade, Sham, Wang, Huan, and Xiong, Caiming
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Meta-learning aims to perform fast adaptation on a new task through learning a "prior" from multiple existing tasks. A common practice in meta-learning is to perform a train-validation split (\emph{train-val method}) where the prior adapts to the task on one split of the data, and the resulting predictor is evaluated on another split. Despite its prevalence, the importance of the train-validation split is not well understood either in theory or in practice, particularly in comparison to the more direct \emph{train-train method}, which uses all the per-task data for both training and evaluation. We provide a detailed theoretical study on whether and when the train-validation split is helpful in the linear centroid meta-learning problem. In the agnostic case, we show that the expected loss of the train-val method is minimized at the optimal prior for meta testing, and this is not the case for the train-train method in general without structural assumptions on the data. In contrast, in the realizable case where the data are generated from linear models, we show that both the train-val and train-train losses are minimized at the optimal prior in expectation. Further, perhaps surprisingly, our main result shows that the train-train method achieves a \emph{strictly better} excess loss in this realizable case, even when the regularization parameter and split ratio are optimally tuned for both methods. Our results highlight that sample splitting may not always be preferable, especially when the data is realizable by the model. We validate our theories by experimentally showing that the train-train method can indeed outperform the train-val method, on both simulations and real meta-learning tasks.
Published: 2020
Full Text: View/download PDF

45. Distribution Approximation and Statistical Estimation Guarantees of Generative Adversarial Networks

Author: Chen, Minshuo, Liao, Wenjing, Zha, Hongyuan, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Generative Adversarial Networks (GANs) have achieved a great success in unsupervised learning. Despite its remarkable empirical performance, there are limited theoretical studies on the statistical properties of GANs. This paper provides approximation and statistical guarantees of GANs for the estimation of data distributions that have densities in a H\"{o}lder space. Our main result shows that, if the generator and discriminator network architectures are properly chosen, GANs are consistent estimators of data distributions under strong discrepancy metrics, such as the Wasserstein-1 distance. Furthermore, when the data distribution exhibits low-dimensional structures, we show that GANs are capable of capturing the unknown low-dimensional structures in data and enjoy a fast statistical convergence, which is free of curse of the ambient dimensionality. Our analysis for low-dimensional data builds upon a universal approximation theory of neural networks with Lipschitz continuity guarantees, which may be of independent interest., Comment: Version two extends to low-dimensional linear and mixture distributions
Published: 2020
Full Text: View/download PDF

46. On Generalization Bounds of a Family of Recurrent Neural Networks

Author: Chen, Minshuo, Li, Xingguo, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Recurrent Neural Networks (RNNs) have been widely applied to sequential data analysis. Due to their complicated modeling structures, however, the theory behind is still largely missing. To connect theory and practice, we study the generalization properties of vanilla RNNs as well as their variants, including Minimal Gated Unit (MGU), Long Short Term Memory (LSTM), and Convolutional (Conv) RNNs. Specifically, our theory is established under the PAC-Learning framework. The generalization bound is presented in terms of the spectral norms of the weight matrices and the total number of parameters. We also establish refined generalization bounds with additional norm assumptions, and draw a comparison among these bounds. We remark: (1) Our generalization bound for vanilla RNNs is significantly tighter than the best of existing results; (2) We are not aware of any other generalization bounds for MGU, LSTM, and Conv RNNs in the exiting literature; (3) We demonstrate the advantages of these variants in generalization., 30 pages, 5 figures
Published: 2019

47. Towards Understanding the Importance of Shortcut Connections in Residual Networks

Author: Liu, Tianyi, Chen, Minshuo, Zhou, Mo, Du, Simon S., Zhou, Enlu, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), Statistics - Machine Learning, FOS: Mathematics, Machine Learning (stat.ML), Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Residual Network (ResNet) is undoubtedly a milestone in deep learning. ResNet is equipped with shortcut connections between layers, and exhibits efficient training using simple first order algorithms. Despite of the great empirical success, the reason behind is far from being well understood. In this paper, we study a two-layer non-overlapping convolutional ResNet. Training such a network requires solving a non-convex optimization problem with a spurious local optimum. We show, however, that gradient descent combined with proper normalization, avoids being trapped by the spurious local optimum, and converges to a global optimum in polynomial time, when the weight of the first layer is initialized at 0, and that of the second layer is initialized arbitrarily in a ball. Numerical experiments are provided to support our theory., Thirty-third Conference on Neural Information Processing Systems, 2019
Published: 2019

48. Meta Learning with Relational Information for Short Sequences

Author: Xie, Yujia, Jiang, Haoming, Liu, Feng, Zhao, Tuo, and Zha, Hongyuan
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: This paper proposes a new meta-learning method -- named HARMLESS (HAwkes Relational Meta LEarning method for Short Sequences) for learning heterogeneous point process models from short event sequence data along with a relational network. Specifically, we propose a hierarchical Bayesian mixture Hawkes process model, which naturally incorporates the relational information among sequences into point process modeling. Compared with existing methods, our model can capture the underlying mixed-community patterns of the relational network, which simultaneously encourages knowledge sharing among sequences and facilitates adaptive learning for each individual sequence. We further propose an efficient stochastic variational meta expectation maximization algorithm that can scale to large problems. Numerical experiments on both synthetic and real data show that HARMLESS outperforms existing methods in terms of predicting the future events.
Published: 2019

49. Inductive Bias of Gradient Descent based Adversarial Training on Separable Data

Author: Li, Yan, Fang, Ethan X., Xu, Huan, and Zhao, Tuo
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Statistics - Machine Learning, Machine Learning (stat.ML), Machine Learning (cs.LG)
Abstract: Adversarial training is a principled approach for training robust neural networks. Despite of tremendous successes in practice, its theoretical properties still remain largely unexplored. In this paper, we provide new theoretical insights of gradient descent based adversarial training by studying its computational properties, specifically on its inductive bias. We take the binary classification task on linearly separable data as an illustrative example, where the loss asymptotically attains its infimum as the parameter diverges to infinity along certain directions. Specifically, we show that when the adversarial perturbation during training has bounded $\ell_2$-norm, the classifier learned by gradient descent based adversarial training converges in direction to the maximum $\ell_2$-norm margin classifier at the rate of $\tilde{\mathcal{O}}(1/\sqrt{T})$, significantly faster than the rate $\mathcal{O}(1/\log T)$ of training with clean data. In addition, when the adversarial perturbation during training has bounded $\ell_q$-norm for some $q\ge 1$, the resulting classifier converges in direction to a maximum mixed-norm margin classifier, which has a natural interpretation of robustness, as being the maximum $\ell_2$-norm margin classifier under worst-case $\ell_q$-norm perturbation to the data. Our findings provide theoretical backups for adversarial training that it indeed promotes robustness against adversarial perturbation.
Published: 2019
Full Text: View/download PDF

50. Nonparametric Regression on Low-Dimensional Manifolds using Deep ReLU Networks : Function Approximation and Statistical Recovery

Author: Chen, Minshuo, Jiang, Haoming, Liao, Wenjing, and Zhao, Tuo
Subjects: Statistics and Probability, FOS: Computer and information sciences, Computer Science - Machine Learning, Numerical Analysis, Computational Theory and Mathematics, Statistics - Machine Learning, Applied Mathematics, Machine Learning (stat.ML), Analysis, Machine Learning (cs.LG)
Abstract: Real-world data often exhibit low-dimensional geometric structures and can be viewed as samples near a low-dimensional manifold. This paper studies nonparametric regression of Hölder functions on low-dimensional manifolds using deep Rectified Linear Unit (ReLU) networks. Suppose $n$ training data are sampled from a Hölder function in $\mathcal{H}^{\ s,\alpha }$ supported on a $d$-dimensional Riemannian manifold isometrically embedded in $\mathbb{R}^D$. A deep ReLU network architecture is designed to estimate the underlying function from the training data. The mean squared error of the empirical estimator is proved to converge in the order of $n^{-\frac{2(s+\alpha )}{2(s+\alpha ) + d}}\log ^3 n$. This result shows that deep ReLU networks give rise to a fast convergence rate depending on the data intrinsic dimension $d$, which is usually much smaller than the ambient dimension $D$. It therefore demonstrates the adaptivity of deep ReLU networks to low-dimensional geometric structures in data and partially explains the power of deep ReLU networks in tackling high-dimensional data with low-dimensional geometric structures.
Published: 2019
Full Text: View/download PDF

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Database

Publisher

68 results on '"Zhao, Tuo"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources