Author: "Allen Zhu" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Allen Zhu"' showing total 369 results

Start Over Author "Allen Zhu"

369 results on '"Allen Zhu"'

1. Epitranscriptomic reader YTHDF2 regulates SEK1(MAP2K4)-JNK-cJUN inflammatory signaling in astrocytes during neurotoxic stress

Author: Emir Malovic, Alyssa Ealy, Cameron Miller, Ahyoung Jang, Phillip J. Hsu, Souvarish Sarkar, Dharmin Rokad, Cody Goeser, Aleah Kristen Hartman, Allen Zhu, Bharathi Palanisamy, Gary Zenitsky, Huajun Jin, Vellareddy Anantharam, Arthi Kanthasamy, Chuan He, and Anumantha G. Kanthasamy
Subjects: Molecular mechanism of gene regulation, Molecular network, Neurotoxicology, Cell biology, Transcriptomics, Science
Abstract: Summary: As the most abundant glial cells in the central nervous system (CNS), astrocytes dynamically respond to neurotoxic stress, however, the key molecular regulators controlling the inflammatory status of these sentinels during neurotoxic stress are many and complex. Herein, we demonstrate that the m6A epitranscriptomic mRNA modification tightly regulates the pro-inflammatory functions of astrocytes. Specifically, the astrocytic neurotoxic stressor, manganese (Mn), downregulated the m6A reader YTHDF2 in human and mouse astrocyte cultures and in the mouse brain. Functionally, YTHDF2 knockdown augmented, while its overexpression dampened, the neurotoxic stress-induced proinflammatory response, suggesting YTHDF2 serves as a key upstream regulator of inflammatory responses in astrocytes. Mechanistically, YTHDF2 RIP-sequencing identified MAP2K4 (MKK4; SEK1) mRNA as a YTHDF2 target influencing inflammatory signaling. Our target validation revealed that Mn-exposed astrocytes mediate proinflammatory responses by activating the phosphorylation of SEK1, JNK, and cJUN signaling. Collectively, YTHDF2 serves as a key upstream ‘molecular switch’ controlling SEK1(MAP2K4)-JNK-cJUN proinflammatory signaling in astrocytes.
Published: 2024
Full Text: View/download PDF

2. REPIC: a database for exploring the N 6-methyladenosine methylome

Author: Shun Liu, Allen Zhu, Chuan He, and Mengjie Chen
Subjects: m6A modification, Database, Tissue specificity, Genome browser, Biology (General), QH301-705.5, Genetics, QH426-470
Abstract: Abstract The REPIC (RNA E PItranscriptome Collection) database records about 10 million peaks called from publicly available m6A-seq and MeRIP-seq data using our unified pipeline. These data were collected from 672 samples of 49 studies, covering 61 cell lines or tissues in 11 organisms. REPIC allows users to query N 6-methyladenosine (m6A) modification sites by specific cell lines or tissue types. In addition, it integrates m6A/MeRIP-seq data with 1418 histone ChIP-seq and 118 DNase-seq data tracks from the ENCODE project in a modern genome browser to present a comprehensive atlas of m6A methylation sites, histone modification sites, and chromatin accessibility regions. REPIC is accessible at https://repicmod.uchicago.edu/repic .
Published: 2020
Full Text: View/download PDF

3. RADAR: differential analysis of MeRIP-seq data with a random effect model

Author: Zijie Zhang, Qi Zhan, Mark Eckert, Allen Zhu, Agnieszka Chryplewicz, Dario F. De Jesus, Decheng Ren, Rohit N. Kulkarni, Ernst Lengyel, Chuan He, and Mengjie Chen
Subjects: N 6-adenosine methylation (m6A), Differential methylation, MeRIP-seq, Biology (General), QH301-705.5, Genetics, QH426-470
Abstract: Abstract Epitranscriptome profiling using MeRIP-seq is a powerful technique for in vivo functional studies of reversible RNA modifications. We develop RADAR, a comprehensive analytical tool for detecting differentially methylated loci in MeRIP-seq data. RADAR enables accurate identification of altered methylation sites by accommodating variability of pre-immunoprecipitation expression level and post-immunoprecipitation count using different strategies. In addition, it is compatible with complex study design when covariates need to be incorporated in the analysis. Through simulation and real dataset analyses, we show that RADAR leads to more accurate and reproducible differential methylation analysis results than alternatives, which is available at https://github.com/scottzijiezhang/RADAR.
Published: 2019
Full Text: View/download PDF

4. Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems

Author: Ye, Tian, Xu, Zicheng, Li, Yuanzhi, and Allen-Zhu, Zeyuan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Language models have demonstrated remarkable performance in solving reasoning tasks; however, even the strongest models still occasionally make reasoning mistakes. Recently, there has been active research aimed at improving reasoning accuracy, particularly by using pretrained language models to "self-correct" their mistakes via multi-round prompting. In this paper, we follow this line of work but focus on understanding the usefulness of incorporating "error-correction" data directly into the pretraining stage. This data consists of erroneous solution steps immediately followed by their corrections. Using a synthetic math dataset, we show promising results: this type of pretrain data can help language models achieve higher reasoning accuracy directly (i.e., through simple auto-regression, without multi-round prompting) compared to pretraining on the same amount of error-free data. We also delve into many details, such as (1) how this approach differs from beam search, (2) how such data can be prepared, (3) whether masking is needed on the erroneous tokens, (4) the amount of error required, (5) whether such data can be deferred to the fine-tuning stage, and many others., Comment: arXiv admin note: text overlap with arXiv:2407.20311
Published: 2024

5. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process

Author: Ye, Tian, Xu, Zicheng, Li, Yuanzhi, and Allen-Zhu, Zeyuan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Recent advances in language models have demonstrated their capability to solve mathematical reasoning problems, achieving near-perfect accuracy on grade-school level math benchmarks like GSM8K. In this paper, we formally study how language models solve these problems. We design a series of controlled experiments to address several fundamental questions: (1) Can language models truly develop reasoning skills, or do they simply memorize templates? (2) What is the model's hidden (mental) reasoning process? (3) Do models solve math questions using skills similar to or different from humans? (4) Do models trained on GSM8K-like datasets develop reasoning skills beyond those necessary for solving GSM8K problems? (5) What mental process causes models to make reasoning mistakes? (6) How large or deep must a model be to effectively solve GSM8K-level math questions? Our study uncovers many hidden mechanisms by which language models solve mathematical questions, providing insights that extend beyond current understandings of LLMs., Comment: video appeared in ICML 2024 tutorial
Published: 2024

6. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Scaling laws describe the relationship between the size of language models and their capabilities. Unlike prior studies that evaluate a model's capability via loss or benchmarks, we estimate the number of knowledge bits a model stores. We focus on factual knowledge represented as tuples, such as (USA, capital, Washington D.C.) from a Wikipedia page. Through multiple controlled datasets, we establish that language models can and only can store 2 bits of knowledge per parameter, even when quantized to int8, and such knowledge can be flexibly extracted for downstream applications. Consequently, a 7B model can store 14B bits of knowledge, surpassing the English Wikipedia and textbooks combined based on our estimation. More broadly, we present 12 results on how (1) training duration, (2) model architecture, (3) quantization, (4) sparsity constraints such as MoE, and (5) data signal-to-noise ratio affect a model's knowledge storage capacity. Notable insights include: * The GPT-2 architecture, with rotary embedding, matches or even surpasses LLaMA/Mistral architectures in knowledge storage, particularly over shorter training durations. This arises because LLaMA/Mistral uses GatedMLP, which is less stable and harder to train. * Prepending training data with domain names (e.g., wikipedia.org) significantly increases a model's knowledge capacity. Language models can autonomously identify and prioritize domains rich in knowledge, optimizing their storage capacity.
Published: 2024

7. Reverse Training to Nurse the Reversal Curse

Author: Golovneva, Olga, Allen-Zhu, Zeyuan, Weston, Jason, and Sukhbaatar, Sainbayar
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have a surprising failure: when trained on "A has a feature B", they do not generalize to "B is a feature of A", which is termed the Reversal Curse. Even when training with trillions of tokens this issue still appears due to Zipf's law - hence even if we train on the entire internet. This work proposes an alternative training scheme, called reverse training, whereby all words are used twice, doubling the amount of available tokens. The LLM is trained in both forward and reverse directions by reversing the training strings while preserving (i.e., not reversing) chosen substrings, such as entities. We show that data-matched reverse-trained models provide superior performance to standard models on standard tasks, and compute-matched reverse-trained models provide far superior performance on reversal tasks, helping resolve the reversal curse issue.
Published: 2024

8. Egr2 overexpression in Schwann cells increases myelination frequency in vitro

Author: Markus Tammia, Ruifa Mi, Valentin M. Sluch, Allen Zhu, Tiffany Chung, Daniel Shinn, Donald J. Zack, Ahmet Höke, and Hai-Quan Mao
Subjects: Neuroscience, Cell biology, Science (General), Q1-390, Social sciences (General), H1-99
Abstract: Schwann cells are key players in peripheral nerve regeneration, and are uniquely capable of remyelinating axons in this context. Schwann cells orchestrate this process via a set of transcription factors. While it has been shown that overexpression of specific genes, e.g. Egr2, upregulates myelin-related transcripts, it remains unknown if such manipulation can functionalize the cells and enhance their myelination frequency. The ability to do so could have implications in the use of human stem cell-derived Schwann cells, where myelination is hard to achieve. After screening four candidate transcription factors (Sox10, Oct6, Brn2 and Egr2), we found that overexpression of Egr2 in rat Schwann cells co-cultured with sensory neurons enhanced myelination frequency and reduced cell proliferation. However, in a mouse model of sciatic nerve repair with cells engrafted within a nerve guide, myelination frequency in the engrafted cells was reduced upon Egr2 overexpression. Our results show that while overexpression of Egr2 can enhance the myelination frequency in vitro, it is context-dependent, potentially influenced by the microenvironment, timing of association with axons, expression level, species differences, or other factors.
Published: 2018
Full Text: View/download PDF

9. Carbon trading, co-pollutants, and environmental equity: Evidence from California's cap-and-trade program (2011-2015).

Author: Lara Cushing, Dan Blaustein-Rejto, Madeline Wander, Manuel Pastor, James Sadd, Allen Zhu, and Rachel Morello-Frosch
Subjects: Medicine
Abstract: BACKGROUND:Policies to mitigate climate change by reducing greenhouse gas (GHG) emissions can yield public health benefits by also reducing emissions of hazardous co-pollutants, such as air toxics and particulate matter. Socioeconomically disadvantaged communities are typically disproportionately exposed to air pollutants, and therefore climate policy could also potentially reduce these environmental inequities. We sought to explore potential social disparities in GHG and co-pollutant emissions under an existing carbon trading program-the dominant approach to GHG regulation in the US and globally. METHODS AND FINDINGS:We examined the relationship between multiple measures of neighborhood disadvantage and the location of GHG and co-pollutant emissions from facilities regulated under California's cap-and-trade program-the world's fourth largest operational carbon trading program. We examined temporal patterns in annual average emissions of GHGs, particulate matter (PM2.5), nitrogen oxides, sulfur oxides, volatile organic compounds, and air toxics before (January 1, 2011-December 31, 2012) and after (January 1, 2013-December 31, 2015) the initiation of carbon trading. We found that facilities regulated under California's cap-and-trade program are disproportionately located in economically disadvantaged neighborhoods with higher proportions of residents of color, and that the quantities of co-pollutant emissions from these facilities were correlated with GHG emissions through time. Moreover, the majority (52%) of regulated facilities reported higher annual average local (in-state) GHG emissions since the initiation of trading. Neighborhoods that experienced increases in annual average GHG and co-pollutant emissions from regulated facilities nearby after trading began had higher proportions of people of color and poor, less educated, and linguistically isolated residents, compared to neighborhoods that experienced decreases in GHGs. These study results reflect preliminary emissions and social equity patterns of the first 3 years of California's cap-and-trade program for which data are available. Due to data limitations, this analysis did not assess the emissions and equity implications of GHG reductions from transportation-related emission sources. Future emission patterns may shift, due to changes in industrial production decisions and policy initiatives that further incentivize local GHG and co-pollutant reductions in disadvantaged communities. CONCLUSIONS:To our knowledge, this is the first study to examine social disparities in GHG and co-pollutant emissions under an existing carbon trading program. Our results indicate that, thus far, California's cap-and-trade program has not yielded improvements in environmental equity with respect to health-damaging co-pollutant emissions. This could change, however, as the cap on GHG emissions is gradually lowered in the future. The incorporation of additional policy and regulatory elements that incentivize more local emission reductions in disadvantaged communities could enhance the local air quality and environmental equity benefits of California's climate change mitigation efforts.
Published: 2018
Full Text: View/download PDF

10. Physics of Language Models: Part 3.2, Knowledge Manipulation

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Language models can store vast factual knowledge, yet their ability to flexibly use this knowledge for downstream tasks (e.g., via instruction finetuning) remains questionable. This paper investigates four fundamental knowledge manipulation tasks: retrieval (e.g., "What is person A's attribute X?"), classification (e.g., "Is A's attribute X even or odd?"), comparison (e.g., "Is A greater than B in attribute X?"), and inverse search (e.g., "Which person's attribute X equals T?"). We show that language models excel in knowledge retrieval but struggle even in the simplest classification or comparison tasks unless Chain of Thoughts (CoTs) are employed during both training and inference. Moreover, their performance in inverse knowledge search is virtually 0%, regardless of the prompts. Our primary contribution is a controlled, synthetic experiment that confirms these weaknesses are inherent to language models: they cannot efficiently manipulate knowledge from pre-training data, even when such knowledge is perfectly stored in the models, despite adequate training and sufficient model size. Our findings also apply to modern pretrained language models such as GPT-4, thus giving rise to many Turing tests to distinguish Humans from contemporary AIs., Comment: V2 polishes writing and includes additional Llama/Mistral experiments and larger data; but the conclusions remain unchanged
Published: 2023

11. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) can store a vast amount of world knowledge, often extractable via question-answering (e.g., "What is Abraham Lincoln's birthday?"). However, do they answer such questions based on exposure to similar questions during training (i.e., cheating), or by genuinely learning to extract knowledge from sources like Wikipedia? In this paper, we investigate this issue using a controlled biography dataset. We find a strong correlation between the model's ability to extract knowledge and various diversity measures of the training data. $\textbf{Essentially}$, for knowledge to be reliably extracted, it must be sufficiently augmented (e.g., through paraphrasing, sentence shuffling, translations) $\textit{during pretraining}$. Without such augmentation, knowledge may be memorized but not extractable, leading to 0% accuracy, regardless of subsequent instruction fine-tuning. To understand why this occurs, we employ (nearly) linear probing to demonstrate a strong connection between the observed correlation and how the model internally encodes knowledge -- whether it is linearly encoded in the hidden embeddings of entity names or distributed across other token embeddings in the training text. This paper provides $\textbf{several key recommendations for LLM pretraining in the industry}$: (1) rewrite the pretraining data -- using small, auxiliary models -- to provide knowledge augmentation, and (2) incorporate more instruction-finetuning data into the pretraining stage before it becomes too late., Comment: V2 polishes writing + fixes author name; V3 includes additional Llama experiments and writing improvements
Published: 2023

12. SALSA VERDE: a machine learning attack on Learning With Errors with sparse small secrets

Author: Li, Cathy Yuanchen, Wenger, Emily, Allen-Zhu, Zeyuan, Charton, Francois, and Lauter, Kristin
Subjects: Computer Science - Cryptography and Security
Abstract: Learning with Errors (LWE) is a hard math problem used in post-quantum cryptography. Homomorphic Encryption (HE) schemes rely on the hardness of the LWE problem for their security, and two LWE-based cryptosystems were recently standardized by NIST for digital signatures and key exchange (KEM). Thus, it is critical to continue assessing the security of LWE and specific parameter choices. For example, HE uses secrets with small entries, and the HE community has considered standardizing small sparse secrets to improve efficiency and functionality. However, prior work, SALSA and PICANTE, showed that ML attacks can recover sparse binary secrets. Building on these, we propose VERDE, an improved ML attack that can recover sparse binary, ternary, and narrow Gaussian secrets. Using improved preprocessing and secret recovery techniques, VERDE can attack LWE with larger dimensions ($n=512$) and smaller moduli ($\log_2 q=12$ for $n=256$), using less time and power. We propose novel architectures for scaling. Finally, we develop a theory that explains the success of ML LWE attacks., Comment: 18 pages, accepted to NeurIPS 2023
Published: 2023

13. Physics of Language Models: Part 1, Learning Hierarchical Language Structures

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Transformer-based language models are effective but complex, and understanding their inner workings is a significant challenge. Previous research has primarily explored how these models handle simple tasks like name copying or selection, and we extend this by investigating how these models grasp complex, recursive language structures defined by context-free grammars (CFGs). We introduce a family of synthetic CFGs that produce hierarchical rules, capable of generating lengthy sentences (e.g., hundreds of tokens) that are locally ambiguous and require dynamic programming to parse. Despite this complexity, we demonstrate that generative models like GPT can accurately learn this CFG language and generate sentences based on it. We explore the model's internals, revealing that its hidden states precisely capture the structure of CFGs, and its attention patterns resemble the information passing in a dynamic programming algorithm. This paper also presents several corollaries, including showing why positional embedding is inferior to relative attention or rotary embedding; demonstrating that encoder-based models (e.g., BERT, deBERTa) cannot learn very deeply nested CFGs as effectively as generative models (e.g., GPT); and highlighting the necessity of adding structural and syntactic errors to the pretraining data to make the model more robust to corrupted language prefixes., Comment: V2+V3 polishes writing; V3 includes Figures 6 and 10 for better illustrations of our results
Published: 2023

14. LoRA: Low-Rank Adaptation of Large Language Models

Author: Hu, Edward J., Shen, Yelong, Wallis, Phillip, Allen-Zhu, Zeyuan, Li, Yuanzhi, Wang, Shean, Wang, Lu, and Chen, Weizhu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: An important paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, full fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example -- deploying independent instances of fine-tuned models, each with 175B parameters, is prohibitively expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. Compared to GPT-3 175B fine-tuned with Adam, LoRA can reduce the number of trainable parameters by 10,000 times and the GPU memory requirement by 3 times. LoRA performs on-par or better than fine-tuning in model quality on RoBERTa, DeBERTa, GPT-2, and GPT-3, despite having fewer trainable parameters, a higher training throughput, and, unlike adapters, no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptation, which sheds light on the efficacy of LoRA. We release a package that facilitates the integration of LoRA with PyTorch models and provide our implementations and model checkpoints for RoBERTa, DeBERTa, and GPT-2 at https://github.com/microsoft/LoRA., Comment: Draft V2 includes better baselines, experiments on GLUE, and more on adapter latency
Published: 2021

15. Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Generative adversarial networks (GANs) are among the most successful models for learning high-complexity, real-world distributions. However, in theory, due to the highly non-convex, non-concave landscape of the minmax training objective, GAN remains one of the least understood deep learning models. In this work, we formally study how GANs can efficiently learn certain hierarchically generated distributions that are close to the distribution of real-life images. We prove that when a distribution has a structure that we refer to as Forward Super-Resolution, then simply training generative adversarial networks using stochastic gradient descent ascent (SGDA) can learn this distribution efficiently, both in sample and time complexities. We also provide empirical evidence that our assumption "forward super-resolution" is very natural in practice, and the underlying learning mechanisms that we study in this paper (to allow us efficiently train GAN via SGDA in theory) simulates the actual learning process of GANs on real-world problems., Comment: v2 polishes writing
Published: 2021

16. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction.

Author: Zeyuan Allen-Zhu and Yuanzhi Li
Published: 2024

17. Byzantine-Resilient Non-Convex Stochastic Gradient Descent

Author: Allen-Zhu, Zeyuan, Ebrahimian, Faeze, Li, Jerry, and Alistarh, Dan
Subjects: Computer Science - Machine Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control
Abstract: We study adversary-resilient stochastic distributed optimization, in which $m$ machines can independently compute stochastic gradients, and cooperate to jointly optimize over their local objective functions. However, an $\alpha$-fraction of the machines are $\textit{Byzantine}$, in that they may behave in arbitrary, adversarial ways. We consider a variant of this procedure in the challenging $\textit{non-convex}$ case. Our main result is a new algorithm SafeguardSGD which can provably escape saddle points and find approximate local minima of the non-convex objective. The algorithm is based on a new concentration filtering technique, and its sample and time complexity bounds match the best known theoretical bounds in the stochastic, distributed setting when no Byzantine machines are present. Our algorithm is very practical: it improves upon the performance of all prior methods when training deep neural networks, it is relatively lightweight, and it is the first method to withstand two recently-proposed Byzantine attacks., Comment: V1.5 polishes writing and V2 rewrites the experiments
Published: 2020

18. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We show that ensemble/knowledge distillation in Deep Learning works very differently from traditional learning theory (such as boosting or NTKs, neural tangent kernels). To properly understand them, we develop a theory showing that when data has a structure we refer to as ``multi-view'', then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the ``dark knowledge'' is hidden in the outputs of the ensemble and can be used in distillation. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy., Comment: v2/V3 polishes writing
Published: 2020

19. Feature Purification: How Adversarial Training Performs Robust Deep Learning

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them., Comment: v2 and V3 polish writing and experiments, V4 adds experiments showing that adversarial training can be done through low-rank updates
Published: 2020

20. Physics of Language Models: Part 2.2, How to Learn From Mistakes on Grade-School Math Problems.

Author: Tian Ye 0011, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu
Published: 2024
Full Text: View/download PDF

21. Physics of Language Models: Part 2.1, Grade-School Math and the Hidden Reasoning Process.

Author: Tian Ye 0011, Zicheng Xu, Yuanzhi Li, and Zeyuan Allen-Zhu
Published: 2024
Full Text: View/download PDF

22. Reverse Training to Nurse the Reversal Curse.

Author: Olga Golovneva, Zeyuan Allen-Zhu, Jason Weston, and Sainbayar Sukhbaatar
Published: 2024
Full Text: View/download PDF

23. Physics of Language Models: Part 3.3, Knowledge Capacity Scaling Laws.

Author: Zeyuan Allen-Zhu and Yuanzhi Li
Published: 2024
Full Text: View/download PDF

24. Backward Feature Correction: How Deep Learning Performs Deep (Hierarchical) Learning

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Deep learning is also known as hierarchical learning, where the learner _learns_ to represent a complicated target function by decomposing it into a sequence of simpler functions to reduce sample and time complexity. This paper formally analyzes how multi-layer neural networks can perform such hierarchical learning _efficiently_ and _automatically_ by SGD on the training objective. On the conceptual side, we present a theoretical characterizations of how certain types of deep (i.e. super-constant layer) neural networks can still be sample and time efficiently trained on some hierarchical tasks, when no existing algorithm (including layerwise training, kernel method, etc) is known to be efficient. We establish a new principle called "backward feature correction", where the errors in the lower-level features can be automatically corrected when training together with the higher-level layers. We believe this is a key behind how deep learning is performing deep (hierarchical) learning, as opposed to layerwise learning or simulating some non-hierarchical method. On the technical side, we show for every input dimension $d > 0$, there is a concept class of degree $\omega(1)$ multi-variate polynomials so that, using $\omega(1)$-layer neural networks as learners, SGD can learn any function from this class in $\mathsf{poly}(d)$ time to any $\frac{1}{\mathsf{poly}(d)}$ error, through learning to represent it as a composition of $\omega(1)$ layers of quadratic functions using "backward feature correction." In contrast, we do not know any other simpler algorithm (including layerwise training, applying kernel method sequentially, training a two-layer network, etc) that can learn this concept class in $\mathsf{poly}(d)$ time even to any $d^{-0.01}$ error. As a side result, we prove $d^{\omega(1)}$ lower bounds for several non-hierarchical learners, including any kernel methods., Comment: V2 adds more experiments, V3 polishes writing and improves experiments, V4 makes minor fixes to the figures, V5/V6 polish writing
Published: 2020

25. What Can ResNet Learn Efficiently, Going Beyond Kernels?

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: How can neural networks such as ResNet efficiently learn CIFAR-10 with test accuracy more than 96%, while other methods, especially kernel methods, fall relatively behind? Can we more provide theoretical justifications for this gap? Recently, there is an influential line of work relating neural networks to kernels in the over-parameterized regime, proving they can learn certain concept class that is also learnable by kernels with similar test error. Yet, can neural networks provably learn some concept class BETTER than kernels? We answer this positively in the distribution-free setting. We prove neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption. At the same time, we prove there are simple functions in this class such that with the same number of training examples, the test error obtained by neural networks can be MUCH SMALLER than ANY kernel method, including neural tangent kernels (NTK). The main intuition is that multi-layer neural networks can implicitly perform hierarchical learning using different layers, which reduces the sample complexity comparing to "one-shot" learning algorithms such as kernel methods. In a follow-up work [2], this theory of hierarchical learning is further strengthened to incorporate the "backward feature correction" process when training deep networks. In the end, we also prove a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings., Comment: V2 slightly improves lower bound, V3 strengthens experiments and adds citation to "backward feature correction" which is an even stronger form of hierarchical learning [2]
Published: 2019

26. Can SGD Learn Recurrent Neural Networks with Provable Generalization?

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Recurrent Neural Networks (RNNs) are among the most popular models in sequential data analysis. Yet, in the foundational PAC learning language, what concept class can it learn? Moreover, how can the same recurrent unit simultaneously learn functions from different input tokens to different output tokens, without affecting each other? Existing generalization bounds for RNN scale exponentially with the input length, significantly limiting their practical implications. In this paper, we show using the vanilla stochastic gradient descent (SGD), RNN can actually learn some notable concept class efficiently, meaning that both time and sample complexity scale polynomially in the input length (or almost polynomially, depending on the concept). This concept class at least includes functions where each output token is generated from inputs of earlier tokens using a smooth two-layer neural network., Comment: V2 polishes writing
Published: 2019

27. The Lingering of Gradients: Theory and Applications

Author: Allen-Zhu, Zeyuan, Simchi-Levi, David, and Wang, Xinshang
Subjects: Mathematics - Optimization and Control, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Classically, the time complexity of a first-order method is estimated by its number of gradient computations. In this paper, we study a more refined complexity by taking into account the `lingering' of gradients: once a gradient is computed at $x_k$, the additional time to compute gradients at $x_{k+1},x_{k+2},\dots$ may be reduced. We show how this improves the running time of several first-order methods. For instance, if the `additional time' scales linearly with respect to the traveled distance, then the `convergence rate' of gradient descent can be improved from $1/T$ to $\exp(-T^{1/3})$. On the application side, we solve a hypothetical revenue management problem on the Yahoo! Front Page Today Module with 4.6m users to $10^{-6}$ error using only 6 passes of the dataset; and solve a real-life support vector machine problem to an accuracy that is two orders of magnitude better comparing to the state-of-the-art algorithm.
Published: 2019

28. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Author: Allen-Zhu, Zeyuan, Li, Yuanzhi, and Liang, Yingyu
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: The fundamental learning theory behind neural networks remains largely open. What classes of functions can neural networks actually learn? Why doesn't the trained network overfit when it is overparameterized? In this work, we prove that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations. Moreover, the learning can be simply done by SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. The sample complexity can also be almost independent of the number of parameters in the network. On the technique side, our analysis goes beyond the so-called NTK (neural tangent kernel) linearization of neural networks in prior works. We establish a new notion of quadratic approximation of the neural network (that can be viewed as a second-order variant of NTK), and connect it to the SGD theory of escaping saddle points., Comment: V1/V2/V3/V4 polish writing, V5 adds experiments, V6 reflects our camera ready version
Published: 2018

29. A Convergence Theory for Deep Learning via Over-Parameterization

Author: Allen-Zhu, Zeyuan, Li, Yuanzhi, and Song, Zhao
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $\textit{global minima}$ on the training objective of DNNs in $\textit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $\textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet)., Comment: V2 adds citation and V3/V4/V5 polish writing
Published: 2018

30. On the Convergence Rate of Training Recurrent Neural Networks

Author: Allen-Zhu, Zeyuan, Li, Yuanzhi, and Song, Zhao
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: How can local-search methods such as stochastic gradient descent (SGD) avoid bad local minima in training multi-layer neural networks? Why can they fit random labels even given non-convex and non-smooth architectures? Most existing theory only covers networks with one hidden layer, so can we go deeper? In this paper, we focus on recurrent neural networks (RNNs) which are multi-layer networks widely used in natural language processing. They are harder to analyze than feedforward neural networks, because the $\textit{same}$ recurrent unit is repeatedly applied across the entire time horizon of length $L$, which is analogous to feedforward networks of depth $L$. We show when the number of neurons is sufficiently large, meaning polynomial in the training data size and in $L$, then SGD is capable of minimizing the regression loss in the linear convergence rate. This gives theoretical evidence of how RNNs can memorize data. More importantly, in this paper we build general toolkits to analyze multi-layer networks with ReLU activations. For instance, we prove why ReLU activations can prevent exponential gradient explosion or vanishing, and build a perturbation theory to analyze first-order approximation of multi-layer networks., Comment: V2/V3/V4 polish writing
Published: 2018

31. Is Q-learning Provably Efficient?

Author: Jin, Chi, Allen-Zhu, Zeyuan, Bubeck, Sebastien, and Jordan, Michael I.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Model-free reinforcement learning (RL) algorithms, such as Q-learning, directly parameterize and update value functions or policies without explicitly modeling the environment. They are typically simpler, more flexible to use, and thus more prevalent in modern deep RL than model-based approaches. However, empirical work has suggested that model-free algorithms may require more samples to learn [Deisenroth and Rasmussen 2011, Schulman et al. 2015]. The theoretical question of "whether model-free algorithms can be made sample efficient" is one of the most fundamental questions in RL, and remains unsolved even in the basic scenario with finitely many states and actions. We prove that, in an episodic MDP setting, Q-learning with UCB exploration achieves regret $\tilde{O}(\sqrt{H^3 SAT})$, where $S$ and $A$ are the numbers of states and actions, $H$ is the number of steps per episode, and $T$ is the total number of steps. This sample efficiency matches the optimal regret that can be achieved by any model-based approach, up to a single $\sqrt{H}$ factor. To the best of our knowledge, this is the first analysis in the model-free setting that establishes $\sqrt{T}$ regret without requiring access to a "simulator.", Comment: Best paper in ICML 2018 workshop "Exploration in RL"
Published: 2018

32. Operator Scaling via Geodesically Convex Optimization, Invariant Theory and Polynomial Identity Testing

Author: Allen-Zhu, Zeyuan, Garg, Ankit, Li, Yuanzhi, Oliveira, Rafael, and Wigderson, Avi
Subjects: Computer Science - Data Structures and Algorithms, Computer Science - Computational Complexity, Mathematics - Algebraic Geometry, Mathematics - Optimization and Control
Abstract: We propose a new second-order method for geodesically convex optimization on the natural hyperbolic metric over positive definite matrices. We apply it to solve the operator scaling problem in time polynomial in the input size and logarithmic in the error. This is an exponential improvement over previous algorithms which were analyzed in the usual Euclidean, "commutative" metric (for which the above problem is not convex). Our method is general and applicable to other settings. As a consequence, we solve the equivalence problem for the left-right group action underlying the operator scaling problem. This yields a deterministic polynomial-time algorithm for a new class of Polynomial Identity Testing (PIT) problems, which was the original motivation for studying operator scaling., Comment: abstract to appear in STOC 2018
Published: 2018

33. Byzantine Stochastic Gradient Descent

Author: Alistarh, Dan, Allen-Zhu, Zeyuan, and Li, Jerry
Subjects: Computer Science - Learning, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Data Structures and Algorithms, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: This paper studies the problem of distributed stochastic optimization in an adversarial setting where, out of the $m$ machines which allegedly compute stochastic gradients every iteration, an $\alpha$-fraction are Byzantine, and can behave arbitrarily and adversarially. Our main result is a variant of stochastic gradient descent (SGD) which finds $\varepsilon$-approximate minimizers of convex functions in $T = \tilde{O}\big( \frac{1}{\varepsilon^2 m} + \frac{\alpha^2}{\varepsilon^2} \big)$ iterations. In contrast, traditional mini-batch SGD needs $T = O\big( \frac{1}{\varepsilon^2 m} \big)$ iterations, but cannot tolerate Byzantine failures. Further, we provide a lower bound showing that, up to logarithmic factors, our algorithm is information-theoretically optimal both in terms of sampling complexity and time complexity.
Published: 2018

34. Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning.

Author: Zeyuan Allen-Zhu and Yuanzhi Li
Published: 2023

35. Forward Super-Resolution: How Can GANs Learn Hierarchical Generative Models for Real-World Distributions.

Author: Zeyuan Allen-Zhu and Yuanzhi Li
Published: 2023

36. SALSA VERDE: a machine learning attack on LWE with sparse small secrets.

Author: Cathy Yuanchen Li, Emily Wenger, Zeyuan Allen-Zhu, François Charton, and Kristin E. Lauter
Published: 2023

37. Katyusha X: Practical Momentum Method for Stochastic Sum-of-Nonconvex Optimization

Author: Allen-Zhu, Zeyuan
Subjects: Computer Science - Learning, Computer Science - Data Structures and Algorithms, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: The problem of minimizing sum-of-nonconvex functions (i.e., convex functions that are average of non-convex ones) is becoming increasingly important in machine learning, and is the core machinery for PCA, SVD, regularized Newton's method, accelerated non-convex optimization, and more. We show how to provably obtain an accelerated stochastic algorithm for minimizing sum-of-nonconvex functions, by $\textit{adding one additional line}$ to the well-known SVRG method. This line corresponds to momentum, and shows how to directly apply momentum to the finite-sum stochastic minimization of sum-of-nonconvex functions. As a side result, our method enjoys linear parallel speed-up using mini-batch.
Published: 2018

38. Make the Minority Great Again: First-Order Regret Bound for Contextual Bandits

Author: Allen-Zhu, Zeyuan, Bubeck, Sébastien, and Li, Yuanzhi
Subjects: Computer Science - Learning
Abstract: Regret bounds in online learning compare the player's performance to $L^*$, the optimal performance in hindsight with a fixed strategy. Typically such bounds scale with the square root of the time horizon $T$. The more refined concept of first-order regret bound replaces this with a scaling $\sqrt{L^*}$, which may be much smaller than $\sqrt{T}$. It is well known that minor variants of standard algorithms satisfy first-order regret bounds in the full information and multi-armed bandit settings. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire raised the issue that existing techniques do not seem sufficient to obtain first-order regret bounds for the contextual bandit problem. In the present paper, we resolve this open problem by presenting a new strategy based on augmenting the policy space., Comment: 15 pages
Published: 2018

39. How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD

Author: Allen-Zhu, Zeyuan
Subjects: Computer Science - Machine Learning, Computer Science - Data Structures and Algorithms, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Stochastic gradient descent (SGD) gives an optimal convergence rate when minimizing convex stochastic objectives $f(x)$. However, in terms of making the gradients small, the original SGD does not give an optimal rate, even when $f(x)$ is convex. If $f(x)$ is convex, to find a point with gradient norm $\varepsilon$, we design an algorithm SGD3 with a near-optimal rate $\tilde{O}(\varepsilon^{-2})$, improving the best known rate $O(\varepsilon^{-8/3})$ of [18]. If $f(x)$ is nonconvex, to find its $\varepsilon$-approximate local minimum, we design an algorithm SGD5 with rate $\tilde{O}(\varepsilon^{-3.5})$, where previously SGD variants only achieve $\tilde{O}(\varepsilon^{-4})$ [6, 15, 33]. This is no slower than the best known stochastic version of Newton's method in all parameter regimes [30]., Comment: V2 added two applications to nonconvex stochastic optimization, and V3 corrects a citation. arXiv admin note: text overlap with arXiv:1708.08694
Published: 2018

40. Neon2: Finding Local Minima via First-Order Oracles

Author: Allen-Zhu, Zeyuan and Li, Yuanzhi
Subjects: Computer Science - Learning, Computer Science - Data Structures and Algorithms, Computer Science - Neural and Evolutionary Computing, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We propose a reduction for non-convex optimization that can (1) turn an stationary-point finding algorithm into an local-minimum finding one, and (2) replace the Hessian-vector product computations with only gradient computations. It works both in the stochastic and the deterministic settings, without hurting the algorithm's performance. As applications, our reduction turns Natasha2 into a first-order method without hurting its performance. It also converts SGD, GD, SCSG, and SVRG into algorithms finding approximate local minima, outperforming some best known results., Comment: version 2 and 3 improve writing
Published: 2017

41. Near-Optimal Discrete Optimization for Experimental Design: A Regret Minimization Approach

Author: Allen-Zhu, Zeyuan, Li, Yuanzhi, Singh, Aarti, and Wang, Yining
Subjects: Statistics - Machine Learning, Computer Science - Learning, Statistics - Computation
Abstract: The experimental design problem concerns the selection of k points from a potentially large design pool of p-dimensional vectors, so as to maximize the statistical efficiency regressed on the selected k design points. Statistical efficiency is measured by optimality criteria, including A(verage), D(eterminant), T(race), E(igen), V(ariance) and G-optimality. Except for the T-optimality, exact optimization is NP-hard. We propose a polynomial-time regret minimization framework to achieve a $(1+\varepsilon)$ approximation with only $O(p/\varepsilon^2)$ design points, for all the optimality criteria above. In contrast, to the best of our knowledge, before our work, no polynomial-time algorithm achieves $(1+\varepsilon)$ approximations for D/E/G-optimality, and the best poly-time algorithm achieving $(1+\varepsilon)$-approximation for A/V-optimality requires $k = \Omega(p^2/\varepsilon)$ design points., Comment: 33 pages, 4 tables. A preliminary version of this paper titled "Near-Optimal Experimental Design via Regret Minimization" with weaker results appeared in the Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney
Published: 2017

42. Natasha 2: Faster Non-Convex Optimization Than SGD

Author: Allen-Zhu, Zeyuan
Subjects: Mathematics - Optimization and Control, Computer Science - Data Structures and Algorithms, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing, Statistics - Machine Learning
Abstract: We design a stochastic algorithm to train any smooth neural network to $\varepsilon$-approximate local minima, using $O(\varepsilon^{-3.25})$ backpropagations. The best result was essentially $O(\varepsilon^{-4})$ by SGD. More broadly, it finds $\varepsilon$-approximate local minima of any smooth nonconvex function in rate $O(\varepsilon^{-3.25})$, with only oracle access to stochastic gradients., Comment: V2 and V3 polished writing; V4 was a deep revision and simplified proofs
Published: 2017

43. Linear Convergence of a Frank-Wolfe Type Algorithm over Trace-Norm Balls

Author: Allen-Zhu, Zeyuan, Hazan, Elad, Hu, Wei, and Li, Yuanzhi
Subjects: Computer Science - Learning, Computer Science - Data Structures and Algorithms, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: We propose a rank-$k$ variant of the classical Frank-Wolfe algorithm to solve convex optimization over a trace-norm ball. Our algorithm replaces the top singular-vector computation ($1$-SVD) in Frank-Wolfe with a top-$k$ singular-vector computation ($k$-SVD), which can be done by repeatedly applying $1$-SVD $k$ times. Alternatively, our algorithm can be viewed as a rank-$k$ restricted version of projected gradient descent. We show that our algorithm has a linear convergence rate when the objective function is smooth and strongly convex, and the optimal solution has rank at most $k$. This improves the convergence rate and the total time complexity of the Frank-Wolfe method and its variants., Comment: In NIPS 2017
Published: 2017

44. Much Faster Algorithms for Matrix Scaling

Author: Allen-Zhu, Zeyuan, Li, Yuanzhi, Oliveira, Rafael, and Wigderson, Avi
Subjects: Computer Science - Data Structures and Algorithms, Mathematics - Combinatorics, Mathematics - Optimization and Control
Abstract: We develop several efficient algorithms for the classical \emph{Matrix Scaling} problem, which is used in many diverse areas, from preconditioning linear systems to approximation of the permanent. On an input $n\times n$ matrix $A$, this problem asks to find diagonal (scaling) matrices $X$ and $Y$ (if they exist), so that $X A Y$ $\varepsilon$-approximates a doubly stochastic, or more generally a matrix with prescribed row and column sums. We address the general scaling problem as well as some important special cases. In particular, if $A$ has $m$ nonzero entries, and if there exist $X$ and $Y$ with polynomially large entries such that $X A Y$ is doubly stochastic, then we can solve the problem in total complexity $\tilde{O}(m + n^{4/3})$. This greatly improves on the best known previous results, which were either $\tilde{O}(n^4)$ or $O(m n^{1/2}/\varepsilon)$. Our algorithms are based on tailor-made first and second order techniques, combined with other recent advances in continuous optimization, which may be of independent interest for solving similar problems.
Published: 2017

45. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction.

Author: Zeyuan Allen Zhu and Yuanzhi Li
Published: 2023
Full Text: View/download PDF

46. Physics of Language Models: Part 3.2, Knowledge Manipulation.

Author: Zeyuan Allen-Zhu and Yuanzhi Li
Published: 2023
Full Text: View/download PDF

47. SALSA VERDE: a machine learning attack on Learning With Errors with sparse small secrets.

Author: Cathy Yuanchen Li, Jana Sotáková, Emily Wenger, Zeyuan Allen-Zhu, François Charton, and Kristin E. Lauter
Published: 2023
Full Text: View/download PDF

48. Physics of Language Models: Part 1, Context-Free Grammar.

Author: Zeyuan Allen-Zhu and Yuanzhi Li
Published: 2023
Full Text: View/download PDF

49. Feature Purification: How Adversarial Training Performs Robust Deep Learning.

Author: Zeyuan Allen-Zhu and Yuanzhi Li
Published: 2021
Full Text: View/download PDF

50. Natasha: Faster Non-Convex Stochastic Optimization Via Strongly Non-Convex Parameter

Author: Allen-Zhu, Zeyuan
Subjects: Mathematics - Optimization and Control, Computer Science - Data Structures and Algorithms, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Given a nonconvex function that is an average of $n$ smooth functions, we design stochastic first-order methods to find its approximate stationary points. The convergence of our new methods depends on the smallest (negative) eigenvalue $-\sigma$ of the Hessian, a parameter that describes how nonconvex the function is. Our methods outperform known results for a range of parameter $\sigma$, and can be used to find approximate local minima. Our result implies an interesting dichotomy: there exists a threshold $\sigma_0$ so that the currently fastest methods for $\sigma>\sigma_0$ and for $\sigma<\sigma_0$ have different behaviors: the former scales with $n^{2/3}$ and the latter scales with $n^{3/4}$., Comment: V2-V5 corrected typos, polished writing, and added citations. (We mis-stated the complexity of the prior work repeatSVRG in V1-V4, and have fixed this mistake in V5.)
Published: 2017

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

369 results on '"Allen Zhu"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources