Author: "Liao, Fangshuo" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Liao, Fangshuo"' showing total 10 results

Start Over Author "Liao, Fangshuo"

10 results on '"Liao, Fangshuo"'

1. GIST: distributed training for large-scale graph convolutional networks

Author: Wolfe, Cameron R., Yang, Jingkang, Liao, Fangshuo, Chowdhury, Arindam, Dun, Chen, Bayer, Artun, Segarra, Santiago, and Kyrillidis, Anastasios
Published: 2024
Full Text: View/download PDF

2. On the Error-Propagation of Inexact Hotelling's Deflation for Principal Component Analysis

Author: Liao, Fangshuo, Kim, Junhyung Lyle, Barnum, Cruz, and Kyrillidis, Anastasios
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Principal Component Analysis (PCA) aims to find subspaces spanned by the so-called principal components that best represent the variance in the dataset. The deflation method is a popular meta-algorithm that sequentially finds individual principal components, starting from the most important ones and working towards the less important ones. However, as deflation proceeds, numerical errors from the imprecise estimation of principal components propagate due to its sequential nature. This paper mathematically characterizes the error propagation of the inexact Hotelling's deflation method. We consider two scenarios: $i)$ when the sub-routine for finding the leading eigenvector is abstract and can represent various algorithms; and $ii)$ when power iteration is used as the sub-routine. In the latter case, the additional directional information from power iteration allows us to obtain a tighter error bound than the sub-routine agnostic case. For both scenarios, we explicitly characterize how the errors progress and affect subsequent principal component estimations., Comment: ICML2024
Published: 2023

3. Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks

Author: Liao, Fangshuo and Kyrillidis, Anastasios
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control
Abstract: Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov's momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov's momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which --to the best of our knowledge--constitutes this work the first that proves accelerated convergence rate for non-trivial neural network architectures., Comment: Accepted by ALT 2024
Published: 2023

4. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Author: Liu, Zichang, Desai, Aditya, Liao, Fangshuo, Wang, Weitao, Xie, Victor, Xu, Zhaozhuo, Kyrillidis, Anastasios, and Shrivastava, Anshumali
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Large language models(LLMs) have sparked a new wave of exciting AI applications. Hosting these models at scale requires significant memory resources. One crucial memory bottleneck for the deployment stems from the context window. It is commonly recognized that model weights are memory hungry; however, the size of key-value embedding stored during the generation process (KV cache) can easily surpass the model size. The enormous size of the KV cache puts constraints on the inference batch size, which is crucial for high throughput inference workload. Inspired by an interesting observation of the attention scores, we hypothesize the persistence of importance: only pivotal tokens, which had a substantial influence at one step, will significantly influence future generations. Based on our empirical verification and theoretical analysis around this hypothesis, we propose Scissorhands, a system that maintains the memory usage of the KV cache at a fixed budget without finetuning the model. In essence, Scissorhands manages the KV cache by storing the pivotal tokens with a higher probability. We validate that Scissorhands reduces the inference memory usage of the KV cache by up to 5X without compromising model quality. We further demonstrate that Scissorhands can be combined with 4-bit quantization, traditionally used to compress model weights, to achieve up to 20X compression.
Published: 2023

5. Strong Lottery Ticket Hypothesis with $\varepsilon$--perturbation

Author: Xiong, Zheyang, Liao, Fangshuo, and Kyrillidis, Anastasios
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Information Theory, Mathematics - Optimization and Control
Abstract: The strong Lottery Ticket Hypothesis (LTH) claims the existence of a subnetwork in a sufficiently large, randomly initialized neural network that approximates some target neural network without the need of training. We extend the theoretical guarantee of the strong LTH literature to a scenario more similar to the original LTH, by generalizing the weight change in the pre-training step to some perturbation around initialization. In particular, we focus on the following open questions: By allowing an $\varepsilon$-scale perturbation on the random initial weights, can we reduce the over-parameterization requirement for the candidate network in the strong LTH? Furthermore, does the weight change by SGD coincide with a good set of such perturbation? We answer the first question by first extending the theoretical result on subset sum to allow perturbation on the candidates. Applying this result to the neural network setting, we show that such $\varepsilon$-perturbation reduces the over-parameterization requirement of the strong LTH. To answer the second question, we show via experiments that the perturbed weight achieved by the projected SGD shows better performance under the strong LTH pruning.
Published: 2022

6. LOFT: Finding Lottery Tickets through Filter-wise Training

Author: Wang, Qihan, Dun, Chen, Liao, Fangshuo, Jermaine, Chris, and Kyrillidis, Anastasios
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Information Theory, Mathematics - Optimization and Control
Abstract: Recent work on the Lottery Ticket Hypothesis (LTH) shows that there exist ``\textit{winning tickets}'' in large neural networks. These tickets represent ``sparse'' versions of the full model that can be trained independently to achieve comparable accuracy with respect to the full model. However, finding the winning tickets requires one to \emph{pretrain} the large model for at least a number of epochs, which can be a burdensome task, especially when the original neural network gets larger. In this paper, we explore how one can efficiently identify the emergence of such winning tickets, and use this observation to design efficient pretraining algorithms. For clarity of exposition, our focus is on convolutional neural networks (CNNs). To identify good filters, we propose a novel filter distance metric that well-represents the model convergence. As our theory dictates, our filter analysis behaves consistently with recent findings of neural network learning dynamics. Motivated by these observations, we present the \emph{LOttery ticket through Filter-wise Training} algorithm, dubbed as \textsc{LoFT}. \textsc{LoFT} is a model-parallel pretraining algorithm that partitions convolutional layers by filters to train them independently in a distributed setting, resulting in reduced memory and communication costs during pretraining. Experiments show that \textsc{LoFT} $i)$ preserves and finds good lottery tickets, while $ii)$ it achieves non-trivial computation and communication savings, and maintains comparable or even better accuracy than other pretraining methods.
Published: 2022

7. On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons

Author: Liao, Fangshuo and Kyrillidis, Anastasios
Subjects: Computer Science - Machine Learning
Abstract: With the motive of training all the parameters of a neural network, we study why and when one can achieve this by iteratively creating, training, and combining randomly selected subnetworks. Such scenarios have either implicitly or explicitly emerged in the recent literature: see e.g., the Dropout family of regularization techniques, or some distributed ML training protocols that reduce communication/computation complexities, such as the Independent Subnet Training protocol. While these methods are studied empirically and utilized in practice, they often enjoy partial or no theoretical support, especially when applied on neural network-based objectives. In this manuscript, our focus is on overparameterized single hidden layer neural networks with ReLU activations in the lazy training regime. By carefully analyzing $i)$ the subnetworks' neural tangent kernel, $ii)$ the surrogate functions' gradient, and $iii)$ how we sample and combine the surrogate functions, we prove linear convergence rate of the training error -- up to a neighborhood around the optimal point -- for an overparameterized single-hidden layer perceptron with a regression loss. Our analysis reveals a dependency of the size of the neighborhood around the optimal point on the number of surrogate models and the number of local training steps for each selected subnetwork. Moreover, the considered framework generalizes and provides new insights on dropout training, multi-sample dropout training, as well as Independent Subnet Training; for each case, we provide convergence results as corollaries of our main theorem.
Published: 2021

8. How much pre-training is enough to discover a good subnetwork?

Author: Wolfe, Cameron R., Liao, Fangshuo, Wang, Qihan, Kim, Junhyung Lyle, and Kyrillidis, Anastasios
Subjects: Statistics - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Mathematics - Optimization and Control, 68T07, I.2.6, I.2.10, I.4.0
Abstract: Neural network pruning is useful for discovering efficient, high-performing subnetworks within pre-trained, dense network architectures. More often than not, it involves a three-step process -- pre-training, pruning, and re-training -- that is computationally expensive, as the dense model must be fully pre-trained. While previous work has revealed through experiments the relationship between the amount of pre-training and the performance of the pruned network, a theoretical characterization of such dependency is still missing. Aiming to mathematically analyze the amount of dense network pre-training needed for a pruned network to perform well, we discover a simple theoretical bound in the number of gradient descent pre-training iterations on a two-layer, fully-connected network, beyond which pruning via greedy forward selection [61] yields a subnetwork that achieves good training error. Interestingly, this threshold is shown to be logarithmically dependent upon the size of the dataset, meaning that experiments with larger datasets require more pre-training for subnetworks obtained via pruning to perform well. Lastly, we empirically validate our theoretical results on a multi-layer perceptron trained on MNIST., Comment: 29 pages
Published: 2021

9. GIST: distributed training for large-scale graph convolutional networks

Author: Wolfe, Cameron R., primary, Yang, Jingkang, additional, Liao, Fangshuo, additional, Chowdhury, Arindam, additional, Dun, Chen, additional, Bayer, Artun, additional, Segarra, Santiago, additional, and Kyrillidis, Anastasios, additional
Published: 2023
Full Text: View/download PDF

10. Accelerated Convergence of Nesterov's Momentum for Deep Neural Networks under Partial Strong Convexity

Author: Liao, Fangshuo and Kyrillidis, Anastasios
Subjects: FOS: Computer and information sciences, Computer Science - Machine Learning, Optimization and Control (math.OC), FOS: Mathematics, Mathematics - Optimization and Control, Machine Learning (cs.LG)
Abstract: Current state-of-the-art analyses on the convergence of gradient descent for training neural networks focus on characterizing properties of the loss landscape, such as the Polyak-Lojaciewicz (PL) condition and the restricted strong convexity. While gradient descent converges linearly under such conditions, it remains an open question whether Nesterov's momentum enjoys accelerated convergence under similar settings and assumptions. In this work, we consider a new class of objective functions, where only a subset of the parameters satisfies strong convexity, and show Nesterov's momentum achieves acceleration in theory for this objective class. We provide two realizations of the problem class, one of which is deep ReLU networks, which --to the best of our knowledge--constitutes this work the first that proves accelerated convergence rate for non-trivial neural network architectures.
Published: 2023
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

10 results on '"Liao, Fangshuo"'

1. GIST: distributed training for large-scale graph convolutional networks

2. On the Error-Propagation of Inexact Hotelling's Deflation for Principal Component Analysis

3. Provable Accelerated Convergence of Nesterov's Momentum for Deep ReLU Neural Networks

4. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

5. Strong Lottery Ticket Hypothesis with $\varepsilon$--perturbation

6. LOFT: Finding Lottery Tickets through Filter-wise Training

7. On the Convergence of Shallow Neural Network Training with Randomly Masked Neurons

8. How much pre-training is enough to discover a good subnetwork?

9. GIST: distributed training for large-scale graph convolutional networks

10. Accelerated Convergence of Nesterov's Momentum for Deep Neural Networks under Partial Strong Convexity

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

10 results on '"Liao, Fangshuo"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources