123 results on '"Zheng, Lianmin"'
Search Results
2. Post-Training Sparse Attention with Double Sparsity
- Author
-
Yang, Shuo, Sheng, Ying, Gonzalez, Joseph E., Stoica, Ion, and Zheng, Lianmin
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
The inference process for large language models is slow and memory-intensive, with one of the most critical bottlenecks being excessive Key-Value (KV) cache accesses. This paper introduces "Double Sparsity," a novel post-training sparse attention technique designed to alleviate this bottleneck by reducing KV cache access. Double Sparsity combines token sparsity, which focuses on utilizing only the important tokens for computing self-attention, with channel sparsity, an approach that uses important feature channels for identifying important tokens. Our key insight is that the pattern of channel sparsity is relatively static, allowing us to use offline calibration to make it efficient at runtime, thereby enabling accurate and efficient identification of important tokens. Moreover, this method can be combined with offloading to achieve significant memory usage reduction. Experimental results demonstrate that Double Sparsity can achieve $\frac{1}{16}$ token and channel sparsity with minimal impact on accuracy across various tasks, including wiki-2 perplexity, key-value retrieval, and long context benchmarks with models including Llama-2-7B, Llama-2-70B, and Mixtral-8x7B. It brings up to a 14.1$\times$ acceleration in attention operations and a 1.9$\times$ improvement in end-to-end inference on GPUs. With offloading, it achieves a decoding speed acceleration of 16.3$\times$ compared to state-of-the-art solutions at a sequence length of 256K. Our code is publicly available at https://github.com/andy-yang-1/DoubleSparse.
- Published
- 2024
3. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
- Author
-
Chiang, Wei-Lin, Zheng, Lianmin, Sheng, Ying, Angelopoulos, Anastasios Nikolas, Li, Tianle, Li, Dacheng, Zhang, Hao, Zhu, Banghua, Jordan, Michael, Gonzalez, Joseph E., and Stoica, Ion
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Large Language Models (LLMs) have unlocked new capabilities and applications; however, evaluating the alignment with human preferences still poses significant challenges. To address this issue, we introduce Chatbot Arena, an open platform for evaluating LLMs based on human preferences. Our methodology employs a pairwise comparison approach and leverages input from a diverse user base through crowdsourcing. The platform has been operational for several months, amassing over 240K votes. This paper describes the platform, analyzes the data we have collected so far, and explains the tried-and-true statistical methods we are using for efficient and accurate evaluation and ranking of models. We confirm that the crowdsourced questions are sufficiently diverse and discriminating and that the crowdsourced human votes are in good agreement with those of expert raters. These analyses collectively establish a robust foundation for the credibility of Chatbot Arena. Because of its unique value and openness, Chatbot Arena has emerged as one of the most referenced LLM leaderboards, widely cited by leading LLM developers and companies. Our demo is publicly available at \url{https://chat.lmsys.org}.
- Published
- 2024
4. SGLang: Efficient Execution of Structured Language Model Programs
- Author
-
Zheng, Lianmin, Yin, Liangsheng, Xie, Zhiqiang, Sun, Chuyue, Huang, Jeff, Yu, Cody Hao, Cao, Shiyi, Kozyrakis, Christos, Stoica, Ion, Gonzalez, Joseph E., Barrett, Clark, and Sheng, Ying
- Subjects
Computer Science - Artificial Intelligence ,Computer Science - Programming Languages - Abstract
Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang
- Published
- 2023
5. Rethinking Benchmark and Contamination for Language Models with Rephrased Samples
- Author
-
Yang, Shuo, Chiang, Wei-Lin, Zheng, Lianmin, Gonzalez, Joseph E., and Stoica, Ion
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Large language models are increasingly trained on all the data ever produced by humans. Many have raised concerns about the trustworthiness of public benchmarks due to potential contamination in pre-training or fine-tuning datasets. While most data decontamination efforts apply string matching (e.g., n-gram overlap) to remove benchmark data, we show that these methods are insufficient, and simple variations of test data (e.g., paraphrasing, translation) can easily bypass these decontamination measures. Furthermore, we demonstrate that if such variation of test data is not eliminated, a 13B model can easily overfit a test benchmark and achieve drastically high performance, on par with GPT-4. We validate such observations in widely used benchmarks such as MMLU, GSK8k, and HumanEval. To address this growing risk, we propose a stronger LLM-based decontamination method and apply it to widely used pre-training and fine-tuning datasets, revealing significant previously unknown test overlap. For example, in pre-training sets such as RedPajama-Data-1T and StarCoder-Data, we identified that 8-18\% of the HumanEval benchmark overlaps. Interestingly, we also find such contamination in synthetic dataset generated by GPT-3.5/4, suggesting a potential risk of unintentional contamination. We urge the community to adopt stronger decontamination approaches when using public benchmarks. Moreover, we call for the community to actively develop fresh one-time exams to evaluate models accurately. Our decontamination tool is publicly available at https://github.com/lm-sys/llm-decontaminator.
- Published
- 2023
6. S-LoRA: Serving Thousands of Concurrent LoRA Adapters
- Author
-
Sheng, Ying, Cao, Shiyi, Li, Dacheng, Hooper, Coleman, Lee, Nicholas, Yang, Shuo, Chou, Christopher, Zhu, Banghua, Zheng, Lianmin, Keutzer, Kurt, Gonzalez, Joseph E., and Stoica, Ion
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
The "pretrain-then-finetune" paradigm is commonly adopted in the deployment of large language models. Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning method, is often employed to adapt a base model to a multitude of tasks, resulting in a substantial collection of LoRA adapters derived from one base model. We observe that this paradigm presents significant opportunities for batched inference during serving. To capitalize on these opportunities, we present S-LoRA, a system designed for the scalable serving of many LoRA adapters. S-LoRA stores all adapters in the main memory and fetches the adapters used by the currently running queries to the GPU memory. To efficiently use the GPU memory and reduce fragmentation, S-LoRA proposes Unified Paging. Unified Paging uses a unified memory pool to manage dynamic adapter weights with different ranks and KV cache tensors with varying sequence lengths. Additionally, S-LoRA employs a novel tensor parallelism strategy and highly optimized custom CUDA kernels for heterogeneous batching of LoRA computation. Collectively, these features enable S-LoRA to serve thousands of LoRA adapters on a single GPU or across multiple GPUs with a small overhead. Compared to state-of-the-art libraries such as HuggingFace PEFT and vLLM (with naive support of LoRA serving), S-LoRA can improve the throughput by up to 4 times and increase the number of served adapters by several orders of magnitude. As a result, S-LoRA enables scalable serving of many task-specific fine-tuned models and offers the potential for large-scale customized fine-tuning services. The code is available at https://github.com/S-LoRA/S-LoRA
- Published
- 2023
7. Mapping electrostatic potential in electrolyte solution
- Author
-
Huang, Bo, Yang, Yining, Han, Ruinong, Chen, Keke, Wang, Zhiyuan, Yun, Longteng, Wang, Yian, Chen, Haowei, Du, Yingchao, Hao, Yuxia, Lv, Peng, Ma, Haoran, Ji, Pengju, Tan, Yuemei, Zheng, Lianmin, Liu, Lihong, Li, Renkai, and Yang, Jie
- Subjects
Condensed Matter - Soft Condensed Matter ,Physics - Chemical Physics - Abstract
Mapping the electrostatic potential (ESP) distribution around ions in electrolyte solution is crucial for the establishment of a microscopic understanding of electrolyte solution properties. For solutions in the bulk phase, it has not been possible to measure the ESP distribution on Angstrom scale. Here we show that liquid electron scattering experiment using state-of-the-art relativistic electron beam can be used to measure the Debye screening length of aqueous LiCl, KCl, and KI solutions across a wide range of concentrations. We observe that the Debye screening length is long-ranged at low concentration and short-ranged at high concentration, providing key insight into the decades-long debate over whether the impact of ions in water is long-ranged or short-ranged. In addition, we show that the measured ESP can be used to retrieve the non-local dielectric function of electrolyte solution, which can serve as a promising route to investigate the electrostatic origin of special ion effects. Our observations show that, interaction, as one of the two fundamental perspectives for understanding electrolyte solution, can provide much richer information than structure., Comment: The small-angle signal in Fig. 2 C-H is highly likely to be an experimental artifact, due to that the electron beam is placed too close to the edge of the liquid sheet. This artifact invalidates the main conclusion of the paper
- Published
- 2023
8. LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
- Author
-
Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Li, Tianle, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Li, Zhuohan, Lin, Zi, Xing, Eric P., Gonzalez, Joseph E., Stoica, Ion, and Zhang, Hao
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.
- Published
- 2023
9. Efficient Memory Management for Large Language Model Serving with PagedAttention
- Author
-
Kwon, Woosuk, Li, Zhuohan, Zhuang, Siyuan, Sheng, Ying, Zheng, Lianmin, Yu, Cody Hao, Gonzalez, Joseph E., Zhang, Hao, and Stoica, Ion
- Subjects
Computer Science - Machine Learning ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for each request is huge and grows and shrinks dynamically. When managed inefficiently, this memory can be significantly wasted by fragmentation and redundant duplication, limiting the batch size. To address this problem, we propose PagedAttention, an attention algorithm inspired by the classical virtual memory and paging techniques in operating systems. On top of it, we build vLLM, an LLM serving system that achieves (1) near-zero waste in KV cache memory and (2) flexible sharing of KV cache within and across requests to further reduce memory usage. Our evaluations show that vLLM improves the throughput of popular LLMs by 2-4$\times$ with the same level of latency compared to the state-of-the-art systems, such as FasterTransformer and Orca. The improvement is more pronounced with longer sequences, larger models, and more complex decoding algorithms. vLLM's source code is publicly available at https://github.com/vllm-project/vllm, Comment: SOSP 2023
- Published
- 2023
10. H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models
- Author
-
Zhang, Zhenyu, Sheng, Ying, Zhou, Tianyi, Chen, Tianlong, Zheng, Lianmin, Cai, Ruisi, Song, Zhao, Tian, Yuandong, Ré, Christopher, Barrett, Clark, Wang, Zhangyang, and Chen, Beidi
- Subjects
Computer Science - Machine Learning - Abstract
Large Language Models (LLMs), despite their recent impressive accomplishments, are notably cost-prohibitive to deploy, particularly for applications involving long-content generation, such as dialogue systems and story writing. Often, a large amount of transient state information, referred to as the KV cache, is stored in GPU memory in addition to model parameters, scaling linearly with the sequence length and batch size. In this paper, we introduce a novel approach for implementing the KV cache which significantly reduces its memory footprint. Our approach is based on the noteworthy observation that a small portion of tokens contributes most of the value when computing attention scores. We call these tokens Heavy Hitters (H$_2$). Through a comprehensive investigation, we find that (i) the emergence of H$_2$ is natural and strongly correlates with the frequent co-occurrence of tokens in the text, and (ii) removing them results in significant performance degradation. Based on these insights, we propose Heavy Hitter Oracle (H$_2$O), a KV cache eviction policy that dynamically retains a balance of recent and H$_2$ tokens. We formulate the KV cache eviction as a dynamic submodular problem and prove (under mild assumptions) a theoretical guarantee for our novel eviction algorithm which could help guide future work. We validate the accuracy of our algorithm with OPT, LLaMA, and GPT-NeoX across a wide range of tasks. Our implementation of H$_2$O with 20% heavy hitters improves the throughput over three leading inference systems DeepSpeed Zero-Inference, Hugging Face Accelerate, and FlexGen by up to 29$\times$, 29$\times$, and 3$\times$ on OPT-6.7B and OPT-30B. With the same batch size, H2O can reduce the latency by up to 1.9$\times$. The code is available at https://github.com/FMInference/H2O.
- Published
- 2023
11. Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
- Author
-
Zheng, Lianmin, Chiang, Wei-Lin, Sheng, Ying, Zhuang, Siyuan, Wu, Zhanghao, Zhuang, Yonghao, Lin, Zi, Li, Zhuohan, Li, Dacheng, Xing, Eric P., Zhang, Hao, Gonzalez, Joseph E., and Stoica, Ion
- Subjects
Computer Science - Computation and Language ,Computer Science - Artificial Intelligence - Abstract
Evaluating large language model (LLM) based chat assistants is challenging due to their broad capabilities and the inadequacy of existing benchmarks in measuring human preferences. To address this, we explore using strong LLMs as judges to evaluate these models on more open-ended questions. We examine the usage and limitations of LLM-as-a-judge, including position, verbosity, and self-enhancement biases, as well as limited reasoning ability, and propose solutions to mitigate some of them. We then verify the agreement between LLM judges and human preferences by introducing two benchmarks: MT-bench, a multi-turn question set; and Chatbot Arena, a crowdsourced battle platform. Our results reveal that strong LLM judges like GPT-4 can match both controlled and crowdsourced human preferences well, achieving over 80% agreement, the same level of agreement between humans. Hence, LLM-as-a-judge is a scalable and explainable way to approximate human preferences, which are otherwise very expensive to obtain. Additionally, we show our benchmark and traditional benchmarks complement each other by evaluating several variants of LLaMA and Vicuna. The MT-bench questions, 3K expert votes, and 30K conversations with human preferences are publicly available at https://github.com/lm-sys/FastChat/tree/main/fastchat/llm_judge., Comment: NeurIPS 2023 Datasets and Benchmarks Track
- Published
- 2023
12. On Optimal Caching and Model Multiplexing for Large Model Inference
- Author
-
Zhu, Banghua, Sheng, Ying, Zheng, Lianmin, Barrett, Clark, Jordan, Michael I., and Jiao, Jiantao
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Performance ,Electrical Engineering and Systems Science - Systems and Control ,Statistics - Machine Learning - Abstract
Large Language Models (LLMs) and other large foundation models have achieved noteworthy success, but their size exacerbates existing resource consumption and latency challenges. In particular, the large-scale deployment of these models is hindered by the significant resource requirements during inference. In this paper, we study two approaches for mitigating these challenges: employing a cache to store previous queries and learning a model multiplexer to choose from an ensemble of models for query processing. Theoretically, we provide an optimal algorithm for jointly optimizing both approaches to reduce the inference cost in both offline and online tabular settings. By combining a caching algorithm, namely Greedy Dual Size with Frequency (GDSF) or Least Expected Cost (LEC), with a model multiplexer, we achieve optimal rates in both offline and online settings. Empirically, simulations show that the combination of our caching and model multiplexing algorithms greatly improves over the baselines, with up to $50\times$ improvement over the baseline when the ratio between the maximum cost and minimum cost is $100$. Experiments on real datasets show a $4.3\times$ improvement in FLOPs over the baseline when the ratio for FLOPs is $10$, and a $1.8\times$ improvement in latency when the ratio for average latency is $1.85$.
- Published
- 2023
13. FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU
- Author
-
Sheng, Ying, Zheng, Lianmin, Yuan, Binhang, Li, Zhuohan, Ryabinin, Max, Fu, Daniel Y., Xie, Zhiqiang, Chen, Beidi, Barrett, Clark, Gonzalez, Joseph E., Liang, Percy, Ré, Christopher, Stoica, Ion, and Zhang, Ce
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Performance - Abstract
The high computational and memory requirements of large language model (LLM) inference make it feasible only with multiple high-end accelerators. Motivated by the emerging demand for latency-insensitive tasks with batched processing, this paper initiates the study of high-throughput LLM inference using limited resources, such as a single commodity GPU. We present FlexGen, a high-throughput generation engine for running LLMs with limited GPU memory. FlexGen can be flexibly configured under various hardware resource constraints by aggregating memory and computation from the GPU, CPU, and disk. By solving a linear programming problem, it searches for efficient patterns to store and access tensors. FlexGen further compresses the weights and the attention cache to 4 bits with negligible accuracy loss. These techniques enable FlexGen to have a larger space of batch size choices and thus significantly increase maximum throughput. As a result, when running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems, reaching a generation throughput of 1 token/s for the first time with an effective batch size of 144. On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours. The code is available at https://github.com/FMInference/FlexGen
- Published
- 2023
14. AlpaServe: Statistical Multiplexing with Model Parallelism for Deep Learning Serving
- Author
-
Li, Zhuohan, Zheng, Lianmin, Zhong, Yinmin, Liu, Vincent, Sheng, Ying, Jin, Xin, Huang, Yanping, Chen, Zhifeng, Zhang, Hao, Gonzalez, Joseph E., and Stoica, Ion
- Subjects
Computer Science - Machine Learning ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Networking and Internet Architecture - Abstract
Model parallelism is conventionally viewed as a method to scale a single large deep learning model beyond the memory limits of a single device. In this paper, we demonstrate that model parallelism can be additionally used for the statistical multiplexing of multiple devices when serving multiple models, even when a single model can fit into a single device. Our work reveals a fundamental trade-off between the overhead introduced by model parallelism and the opportunity to exploit statistical multiplexing to reduce serving latency in the presence of bursty workloads. We explore the new trade-off space and present a novel serving system, AlpaServe, that determines an efficient strategy for placing and parallelizing collections of large deep learning models across a distributed cluster. Evaluation results on production workloads show that AlpaServe can process requests at up to 10x higher rates or 6x more burstiness while staying within latency constraints for more than 99% of requests., Comment: OSDI 2023
- Published
- 2023
15. On Optimizing the Communication of Model Parallelism
- Author
-
Zhuang, Yonghao, Zhao, Hexu, Zheng, Lianmin, Li, Zhuohan, Xing, Eric P., Ho, Qirong, Gonzalez, Joseph E., Stoica, Ion, and Zhang, Hao
- Subjects
Computer Science - Machine Learning ,Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
We study a novel and important communication pattern in large-scale model-parallel deep learning (DL), which we call cross-mesh resharding. This pattern emerges when the two paradigms of model parallelism - intra-operator and inter-operator parallelism - are combined to support large models on large clusters. In cross-mesh resharding, a sharded tensor needs to be sent from a source device mesh to a destination device mesh, on which the tensor may be distributed with the same or different layouts. We formalize this as a many-to-many multicast communication problem, and show that existing approaches either are sub-optimal or do not generalize to different network topologies or tensor layouts, which result from different model architectures and parallelism strategies. We then propose two contributions to address cross-mesh resharding: an efficient broadcast-based communication system, and an "overlapping-friendly" pipeline schedule. On microbenchmarks, our overall system outperforms existing ones by up to 10x across various tensor and mesh layouts. On end-to-end training of two large models, GPT-3 and U-Transformer, we improve throughput by 10% and 50%, respectively.
- Published
- 2022
16. TensorIR: An Abstraction for Automatic Tensorized Program Optimization
- Author
-
Feng, Siyuan, Hou, Bohan, Jin, Hongyi, Lin, Wuwei, Shao, Junru, Lai, Ruihang, Ye, Zihao, Zheng, Lianmin, Yu, Cody Hao, Yu, Yong, and Chen, Tianqi
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Programming Languages - Abstract
Deploying deep learning models on various devices has become an important topic. The wave of hardware specialization brings a diverse set of acceleration primitives for multi-dimensional tensor computations. These new acceleration primitives, along with the emerging machine learning models, bring tremendous engineering challenges. In this paper, we present TensorIR, a compiler abstraction for optimizing programs with these tensor computation primitives. TensorIR generalizes the loop nest representation used in existing machine learning compilers to bring tensor computation as the first-class citizen. Finally, we build an end-to-end framework on top of our abstraction to automatically optimize deep learning models for given tensor computation primitives. Experimental results show that TensorIR compilation automatically uses the tensor computation primitives for given hardware backends and delivers performance that is competitive to state-of-art hand-optimized systems across platforms., Comment: Accepted to ASPLOS 2023
- Published
- 2022
17. NumS: Scalable Array Programming for the Cloud
- Author
-
Elibol, Melih, Benara, Vinamra, Yagati, Samyu, Zheng, Lianmin, Cheung, Alvin, Jordan, Michael I., and Stoica, Ion
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Machine Learning ,Computer Science - Mathematical Software ,Statistics - Applications - Abstract
Scientists increasingly rely on Python tools to perform scalable distributed memory array operations using rich, NumPy-like expressions. However, many of these tools rely on dynamic schedulers optimized for abstract task graphs, which often encounter memory and network bandwidth-related bottlenecks due to sub-optimal data and operator placement decisions. Tools built on the message passing interface (MPI), such as ScaLAPACK and SLATE, have better scaling properties, but these solutions require specialized knowledge to use. In this work, we present NumS, an array programming library which optimizes NumPy-like expressions on task-based distributed systems. This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS). LSHS is a local search method which optimizes operator placement by minimizing maximum memory and network load on any given node within a distributed system. Coupled with a heuristic for load balanced data layouts, our approach is capable of attaining communication lower bounds on some common numerical operations, and our empirical study shows that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem. On terabyte-scale data, NumS achieves competitive performance to SLATE on DGEMM, up to 20x speedup over Dask on a key operation for tensor factorization, and a 2x speedup on logistic regression compared to Dask ML and Spark's MLlib.
- Published
- 2022
18. GACT: Activation Compressed Training for Generic Network Architectures
- Author
-
Liu, Xiaoxuan, Zheng, Lianmin, Wang, Dequan, Cen, Yukuo, Chen, Weize, Han, Xu, Chen, Jianfei, Liu, Zhiyuan, Tang, Jie, Gonzalez, Joey, Mahoney, Michael, and Cheung, Alvin
- Subjects
Computer Science - Machine Learning - Abstract
Training large neural network (NN) models requires extensive memory resources, and Activation Compressed Training (ACT) is a promising approach to reduce training memory footprint. This paper presents GACT, an ACT framework to support a broad range of machine learning tasks for generic NN architectures with limited domain knowledge. By analyzing a linearized version of ACT's approximate gradient, we prove the convergence of GACT without prior knowledge on operator type or model architecture. To make training stable, we propose an algorithm that decides the compression ratio for each tensor by estimating its impact on the gradient at run time. We implement GACT as a PyTorch library that readily applies to any NN architecture. GACT reduces the activation memory for convolutional NNs, transformers, and graph NNs by up to 8.1x, enabling training with a 4.2x to 24.7x larger batch size, with negligible accuracy loss. We implement GACT as a PyTorch library at https://github.com/LiuXiaoxuanPKU/GACT-ICML.
- Published
- 2022
19. Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning
- Author
-
Zheng, Lianmin, Li, Zhuohan, Zhang, Hao, Zhuang, Yonghao, Chen, Zhifeng, Huang, Yanping, Wang, Yida, Xu, Yuanzhong, Zhuo, Danyang, Xing, Eric P., Gonzalez, Joseph E., and Stoica, Ion
- Subjects
Computer Science - Machine Learning ,Computer Science - Distributed, Parallel, and Cluster Computing ,Computer Science - Programming Languages - Abstract
Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization plan or automatically generate one from a limited space of model parallelism configurations. They do not suffice to scale out complex DL models on distributed compute devices. Alpa distributes the training of large DL models by viewing parallelisms as two hierarchical levels: inter-operator and intra-operator parallelisms. Based on it, Alpa constructs a new hierarchical space for massive model-parallel execution plans. Alpa designs a number of compilation passes to automatically derive efficient parallel execution plans at each parallelism level. Alpa implements an efficient runtime to orchestrate the two-level parallel execution on distributed compute devices. Our evaluation shows Alpa generates parallelization plans that match or outperform hand-tuned model-parallel training systems even on models they are designed for. Unlike specialized systems, Alpa also generalizes to models with heterogeneous architectures and models without manually-designed plans. Alpa's source code is publicly available at https://github.com/alpa-projects/alpa, Comment: OSDI 2022
- Published
- 2022
20. Eliminating uncertainty of thermal emittance measurement in solenoid scans due to rf and solenoid fields overlap
- Author
-
Zheng, Lianmin, Du, Yingchao, and Huang, Pengwei
- Subjects
Physics - Accelerator Physics - Abstract
The solenoid scan is one of the most common methods for the in-situ measurement of the thermal emittance of a photocathode in an rf photoinjector. The fringe field of the solenoid overlaps with the gun rf field in quite a number of photoinjectors, which makes accurate knowledge of the transfer matrix challenging, thus increases the measurement uncertainty of the thermal emittance. This paper summarizes two methods that have been used to solve the overlap issue and explains their deficiencies. Furthermore, we provide a new method to eliminate the measurement error due to the overlap issue in solenoid scans. The new method is systematically demonstrated using theoretical derivations, beam dynamics simulations, and experimental data based on the photoinjector configurations from three different groups, proving that the measurement error with the new method is very small and can be ignored in most of the photoinjector configurations.
- Published
- 2021
- Full Text
- View/download PDF
21. ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training
- Author
-
Chen, Jianfei, Zheng, Lianmin, Yao, Zhewei, Wang, Dequan, Stoica, Ion, Mahoney, Michael W., and Gonzalez, Joseph E.
- Subjects
Computer Science - Machine Learning ,Computer Science - Computer Vision and Pattern Recognition ,Statistics - Machine Learning - Abstract
The increasing size of neural network models has been critical for improvements in their accuracy, but device memory is not growing at the same rate. This creates fundamental challenges for training neural networks within limited memory environments. In this work, we propose ActNN, a memory-efficient training framework that stores randomly quantized activations for back propagation. We prove the convergence of ActNN for general network architectures, and we characterize the impact of quantization on the convergence via an exact expression for the gradient variance. Using our theory, we propose novel mixed-precision quantization strategies that exploit the activation's heterogeneity across feature dimensions, samples, and layers. These techniques can be readily applied to existing dynamic graph frameworks, such as PyTorch, simply by substituting the layers. We evaluate ActNN on mainstream computer vision models for classification, detection, and segmentation tasks. On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss. ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size., Comment: to be published in ICML 2021
- Published
- 2021
22. Development of an L-band continuous-wave buncher at Tsinghua University
- Author
-
Gao, Bin, Zheng, Lianmin, Chen, Han, Li, Zizheng, Du, Yingchao, Li, Renkai, Shi, Jiaru, Tian, Qili, You, Jingjing, Zhu, Yijiang, Jia, Yanqing, Huang, Wenhui, and Tang, Chuanxiang
- Published
- 2024
- Full Text
- View/download PDF
23. Ansor: Generating High-Performance Tensor Programs for Deep Learning
- Author
-
Zheng, Lianmin, Jia, Chengfan, Sun, Minmin, Wu, Zhao, Yu, Cody Hao, Haj-Ali, Ameer, Wang, Yida, Yang, Jun, Zhuo, Danyang, Sen, Koushik, Gonzalez, Joseph E., and Stoica, Ion
- Subjects
Computer Science - Machine Learning ,Computer Science - Neural and Evolutionary Computing ,Computer Science - Performance ,Computer Science - Programming Languages ,Statistics - Machine Learning - Abstract
High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to $3.8\times$, $2.6\times$, and $1.7\times$, respectively., Comment: OSDI 2020
- Published
- 2020
24. Scalable and Efficient Systems for Large Deep Learning Models
- Author
-
Zheng, Lianmin
- Subjects
Computer science ,artificial intelligence ,compilers ,large language models ,parallel computing - Abstract
Recent advancements in machine learning have primarily been driven by large-scale deep learning models, particularly large language models. The large scale and new capabilities of these models present challenges in designing infrastructure systems to support their entire lifecycle, from training and serving to evaluation. To meet the high computational and memory requirements of these models, while fully utilizing and accurately evaluating their capabilities, we need to redesign many system components, such as compilers, distributed computing platforms, programming systems, and evaluation methods.In this dissertation, we introduce a suite of systems designed and built to support large models, covering training, serving, and evaluation phases. First, we discuss Alpa, a system for large-scale model-parallel training, which automatically generates distributed execution plans integrating both inter- and intra-operator parallelism. Moving on to serving, we introduce Ansor, a compiler that produces high-performance implementations of tensor programs for various hardware backends. We also explore SGLang, a system for deploying large language models that includes both a flexible front-end programming interface and an optimized back-end runtime for fast inference. Lastly, in the evaluation phase, we detail our efforts in model evaluation, which include Chatbot Arena, a crowdsourced live benchmark platform, and LLM-as-a-Judge, an automated evaluation pipeline. These tools collectively form a full-stack system for the continuous improvement of large models.
- Published
- 2024
25. A Unified Optimization Approach for CNN Model Inference on Integrated GPUs
- Author
-
Wang, Leyuan, Chen, Zhi, Liu, Yizhi, Wang, Yao, Zheng, Lianmin, Li, Mu, and Wang, Yida
- Subjects
Computer Science - Distributed, Parallel, and Cluster Computing - Abstract
Modern deep learning applications urge to push the model inference taking place at the edge devices for multiple reasons such as achieving shorter latency, relieving the burden of the network connecting to the cloud, and protecting user privacy. The Convolutional Neural Network (\emph{CNN}) is one of the most widely used model family in the applications. Given the high computational complexity of the CNN models, it is favorable to execute them on the integrated GPUs at the edge devices, which are ubiquitous and have more power and better energy efficiency than the accompanying CPUs. However, programming on integrated GPUs efficiently is challenging due to the variety of their architectures and programming interfaces. This paper proposes an end-to-end solution to execute CNN model inference on the integrated GPUs at the edge, which uses a unified IR to represent and optimize vision-specific operators on integrated GPUs from multiple vendors, as well as leverages machine learning-based scheduling search schemes to optimize computationally-intensive operators like convolution. Our solution even provides a fallback mechanism for operators not suitable or convenient to run on GPUs. The evaluation results suggest that compared to state-of-the-art solutions backed up by the vendor-provided high-performance libraries on Intel Graphics, ARM Mali GPU, and Nvidia integrated Maxwell GPU, our solution achieves similar, or even better (up to 1.62$\times$), performance on a number of popular image classification and object detection models. In addition, our solution has a wider model coverage and is more flexible to embrace new models. Our solution has been adopted in production services in AWS and is open-sourced., Comment: 10 pages, 3 figures, 48th International Conference on Parallel Processing
- Published
- 2019
26. Development and high-power testing of an X-band dielectric-loaded power extractor
- Author
-
Shao, Jiahang, Jing, Chunguang, Wisniewski, Eric, Ha, Gwanghui, Conde, Manoel, Liu, Wanming, Power, John, and Zheng, Lianmin
- Subjects
Physics - Accelerator Physics - Abstract
Dielectric loaded structures are promising candidates for use in the structure wakefield acceleration (SWFA) technique, for both the collinear wakefield and the two-beam acceleration (CWA and TBA respectively) approaches, due to their low fabrication cost, low rf losses, and the potential to withstand high gradient. A short pulse (<=20 ns) TBA program is under development at the Argonne Wakefield Accelerator (AWA) facility where dielectric loaded structures are being used for both the power extractor/transfer structure (PETS) and the accelerator. In this study, an X-band 11.7 GHz dielectric PETS was developed and tested at the AWA facility to demonstrate high power wakefield generation. The PETS was driven by a train of eight electron bunches separated by 769.2 ps (9 times of the X-band rf period) in order to achieve coherent wakefield superposition. A total train charge of 360 nC was passed through the PETS structure to generate ~200 MW, ~3 ns flat-top rf pulses without rf breakdown. A future experiment is being planned to increase the generated rf power to approximately ~1 GW by optimizing the structure design and improving the drive beam quality.
- Published
- 2019
- Full Text
- View/download PDF
27. Rapid thermal emittance and quantum efficiency mapping of a cesium telluride cathode in an rf photoinjector using multiple laser beamlets
- Author
-
Zheng, Lianmin, Shao, Jiahang, Wisniewski, Eric E., Power, John G., Du, Yingchao, Liu, Wanming, Whiteford, Charles E., Conde, Manoel, Doran, Scott, Jing, Chunguang, and Tang, Chuanxiang
- Subjects
Physics - Accelerator Physics - Abstract
Thermal emittance and quantum efficiency (QE) are key figures of merit of photocathodes, and their uniformity is critical to high-performance photoinjectors. Several QE mapping technologies have been successfully developed; however, there is still a dearth of information on thermal emittance maps. This is because of the extremely time-consuming procedure to gather measurements by scanning a small beam across the cathode with fine steps. To simplify the mapping procedure, and to reduce the time required to take measurements, we propose a new method that requires only a single scan of the solenoid current to simultaneously obtain thermal emittance and QE distribution by using a pattern beam with multiple beamlets. In this paper, its feasibility has been confirmed by both beam dynamics simulation and theoretical analysis. The method has been successfully demonstrated in a proof-of-principle experiment using an L-band radiofrequency photoinjector with a cesium telluride cathode. In the experiment, seven beamlets were generated from a microlens array system and their corresponding thermal emittance and QE varied from 0.93 to 1.14 $\mu$m/mm and from 4.6 to 8.7%, respectively. We also discuss the limitations and future improvements of the method in this paper.
- Published
- 2019
- Full Text
- View/download PDF
28. Experimental demonstration of the correction of coupled transverse dynamics aberration in an rf photoinjector
- Author
-
Zheng, Lianmin, Shao, Jiahang, Du, Yingchao, Power, John G., Wisniewski, Eric E., Liu, Wanming, Whiteford, Charles E., Conde, Manoel, Doran, Scott, Jing, Chunguang, Tang, Chuanxiang, and Gai, Wei
- Subjects
Physics - Accelerator Physics - Abstract
The production of electron bunches with low transverse emittance approaches the thermal emittance of the photocathode as various aberrations are corrected. Recently, the coupled transverse dynamics aberration was theoretically identified as a significant source of emittance growth and a corrector magnet was proposed for its elimination [D.H. Dowell, F. Zhou, and J. Schmerge, PRAB 21, 010101 (2018)]. This aberration arises when the beam acquires an asymmetric distribution that is then rotated with respect to the transverse reference axis thus introducing a correlation in the vertical and horizontal planes. The asymmetry is introduced by a weak quadrupole field in the rf gun or emittance compensation solenoid and the rotation is caused by the solenoid. This Letter presents an experimental study of the coupled transverse dynamics aberration in an rf photoinjector and demonstrates its elimination by a quadrupole corrector consisting of a normal and a skew quadrupole. The experimental results agree well with theoretical predictions and numerical simulations. The study also demonstrates the emittance of a low charge beam can be preserved during transportation at its thermal value, which was 1.05 mm mrad/mm, for the cesium telluride photocathode and 248 nm UV laser used., Comment: 5 pages, 4 figures
- Published
- 2019
- Full Text
- View/download PDF
29. An ultrahigh-vacuum S-band photocathode radio-frequency electron gun
- Author
-
Huang, Peng-Wei, Chen, Han, Zheng, Lianmin, Chen, Huaibi, Du, Yingchao, Huang, Wenhui, Shi, Jiaru, Li, Renkai, and Tang, Chuanxiang
- Published
- 2023
- Full Text
- View/download PDF
30. Overestimation of thermal emittance in solenoid scans due to coupled transverse motion
- Author
-
Zheng, Lianmin, Shao, Jiahang, Du, Yingchao, Power, John G., Wisniewski, Eric E., Liu, Wanming, Whiteford, Charles E., Conde, Manoel, Doran, Scott, Jing, Chunguang, Tang, Chuanxiang, and Gai, Wei
- Subjects
Physics - Accelerator Physics - Abstract
The solenoid scan is a widely used method for the in-situ measurement of the thermal emittance in a photocathode gun. The popularity of this method is due to its simplicity and convenience since all rf photocathode guns are equipped with an emittance compensation solenoid. This paper shows that the solenoid scan measurement overestimates the thermal emittance in the ordinary measurement configuration due to a weak quadrupole field (present in either the rf gun or gun solenoid) followed by a rotation in the solenoid. This coupled transverse dynamics aberration introduces a correlation between the beam's horizontal and vertical motion leading to an increase in the measured 2D transverse emittance, thus the overestimation of the thermal emittance. This effect was systematically studied using both analytic expressions and numerical simulations. These studies were experimentally verified using an L-band 1.6-cell rf photocathode gun with a cesium telluride cathode, which shows a thermal emittance overestimation of 35% with a rms laser spot size of 2.7 mm. The paper concludes by showing that the accuracy of the solenoid scan can be improved by using a quadrupole magnet corrector, consisting of a pair of normal and skew quadrupole magnets., Comment: 12 pages, 13 figures
- Published
- 2018
- Full Text
- View/download PDF
31. A Hardware-Software Blueprint for Flexible Deep Learning Specialization
- Author
-
Moreau, Thierry, Chen, Tianqi, Vega, Luis, Roesch, Jared, Yan, Eddie, Zheng, Lianmin, Fromm, Josh, Jiang, Ziheng, Ceze, Luis, Guestrin, Carlos, and Krishnamurthy, Arvind
- Subjects
Computer Science - Machine Learning ,Computer Science - Distributed, Parallel, and Cluster Computing ,Statistics - Machine Learning - Abstract
Specialized Deep Learning (DL) acceleration stacks, designed for a specific set of frameworks, model architectures, operators, and data types, offer the allure of high performance while sacrificing flexibility. Changes in algorithms, models, operators, or numerical systems threaten the viability of specialized hardware accelerators. We propose VTA, a programmable deep learning architecture template designed to be extensible in the face of evolving workloads. VTA achieves this flexibility via a parametrizable architecture, two-level ISA, and a JIT compiler. The two-level ISA is based on (1) a task-ISA that explicitly orchestrates concurrent compute and memory tasks and (2) a microcode-ISA which implements a wide variety of operators with single-cycle tensor-tensor operations. Next, we propose a runtime system equipped with a JIT compiler for flexible code-generation and heterogeneous execution that enables effective use of the VTA architecture. VTA is integrated and open-sourced into Apache TVM, a state-of-the-art deep learning compilation stack that provides flexibility for diverse models and divergent hardware backends. We propose a flow that performs design space exploration to generate a customized hardware architecture and software operator library that can be leveraged by mainstream learning frameworks. We demonstrate our approach by deploying optimized deep learning models used for object classification and style transfer on edge-class FPGAs., Comment: 6 pages plus references, 8 figures
- Published
- 2018
32. Learning to Optimize Tensor Programs
- Author
-
Chen, Tianqi, Zheng, Lianmin, Yan, Eddie, Jiang, Ziheng, Moreau, Thierry, Ceze, Luis, Guestrin, Carlos, and Krishnamurthy, Arvind
- Subjects
Computer Science - Machine Learning ,Statistics - Machine Learning - Abstract
We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU., Comment: NeurIPS 2018
- Published
- 2018
33. TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
- Author
-
Chen, Tianqi, Moreau, Thierry, Jiang, Ziheng, Zheng, Lianmin, Yan, Eddie, Cowan, Meghan, Shen, Haichen, Wang, Leyuan, Hu, Yuwei, Ceze, Luis, Guestrin, Carlos, and Krishnamurthy, Arvind
- Subjects
Computer Science - Machine Learning ,Computer Science - Artificial Intelligence ,Computer Science - Programming Languages - Abstract
There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies., Comment: Significantly improved version, add automated optimization
- Published
- 2018
34. Size-to-depth: A New Perspective for Single Image Depth Estimation
- Author
-
Wu, Yiran, Ying, Sihao, and Zheng, Lianmin
- Subjects
Computer Science - Computer Vision and Pattern Recognition - Abstract
In this paper we consider the problem of single monocular image depth estimation. It is a challenging problem due to its ill-posedness nature and has found wide application in industry. Previous efforts belongs roughly to two families: learning-based method and interactive method. Learning-based method, in which deep convolutional neural network (CNN) is widely used, can achieve good result. But they suffer low generalization ability and typically perform poorly for unfamiliar scenes. Besides, data-hungry nature for such method makes data aquisition expensive and time-consuming. Interactive method requires human annotation of depth which, however, is errorneous and of large variance. To overcome these problems, we propose a new perspective for single monocular image depth estimation problem: size to depth. Our method require sparse label for real-world size of object rather than raw depth. A Coarse depth map is then inferred following geometric relationships according to size labels. Then we refine the depth map by doing energy function optimization on conditional random field(CRF). We experimentally demonstrate that our method outperforms traditional depth-labeling methods and can produce satisfactory depth maps.
- Published
- 2018
35. MAgent: A Many-Agent Reinforcement Learning Platform for Artificial Collective Intelligence
- Author
-
Zheng, Lianmin, Yang, Jiacheng, Cai, Han, Zhang, Weinan, Wang, Jun, and Yu, Yong
- Subjects
Computer Science - Learning ,Computer Science - Artificial Intelligence ,Computer Science - Multiagent Systems - Abstract
We introduce MAgent, a platform to support research and development of many-agent reinforcement learning. Unlike previous research platforms on single or multi-agent reinforcement learning, MAgent focuses on supporting the tasks and the applications that require hundreds to millions of agents. Within the interactions among a population of agents, it enables not only the study of learning algorithms for agents' optimal polices, but more importantly, the observation and understanding of individual agent's behaviors and social phenomena emerging from the AI society, including communication languages, leaderships, altruism. MAgent is highly scalable and can host up to one million agents on a single GPU server. MAgent also provides flexible configurations for AI researchers to design their customized environments and agents. In this demo, we present three environments designed on MAgent and show emerged collective intelligence by learning from scratch., Comment: NIPS 2017 & AAAI 2018 Demo
- Published
- 2017
36. Structural Study of Aqueous Electrolyte Solution by MeV Liquid Electron Scattering.
- Author
-
Huang, Bo, Yun, Longteng, Yang, Yining, Han, Ruinong, Chen, Keke, Wang, Zhiyuan, Wang, Yian, Chen, Haowei, Du, Yingchao, Hao, Yuxia, Lv, Peng, Ji, Pengju, Tan, Yuemei, Zheng, Lianmin, Liu, Lihong, Li, Renkai, and Yang, Jie
- Published
- 2024
- Full Text
- View/download PDF
37. Eliminating uncertainty of thermal emittance measurement in solenoid scans due to rf and solenoid fields overlap
- Author
-
Zheng, Lianmin, Du, Yingchao, and Huang, Pengwei
- Published
- 2022
- Full Text
- View/download PDF
38. Effects of Laser Pulse Heating of Copper Photocathodes on High-brightness Electron Beam Production at Blowout Regime
- Author
-
Zheng, Lianmin, Du, Yingchao, Tang, Chuanxiang, and Gai, Wei
- Subjects
Physics - Accelerator Physics - Abstract
Producing high-brightness and high-charge (>100 pC) electron bunches at blowout regime requires ultrashort laser pulses with high fluence. The effects of laser pulse heating of the copper photocathode are analyzed in this paper. The electron and lattice temperature is calculated using an improved two-temperature model, and an extended Dowell-Schmerge model is employed to calculate the thermal emittance and quantum efficiency. A time-dependent growth of the thermal emittance and the quantum efficiency is observed. For a fixed amount of charge, the projected thermal emittance increases with decreasing laser radius, and this effect should be taken into account in laser optimization at blowout regime. Moreover, laser damage threshold fluence is simulated, showing that the maximum local fluence should be less than 40 mJ/cm^2 to prevent damage to the cathode., Comment: 8 pages, 10 figures
- Published
- 2016
- Full Text
- View/download PDF
39. Development of an L-band continuous-wave buncher at Tsinghua University
- Author
-
Gao, Bin, primary, Zheng, Lianmin, additional, Chen, Han, additional, Li, Zizheng, additional, Du, Yingchao, additional, Li, Renkai, additional, Shi, Jiaru, additional, Tian, Qili, additional, You, Jingjing, additional, Zhu, Yijiang, additional, Jia, Yanqing, additional, Huang, Wenhui, additional, and Tang, Chuanxiang, additional
- Published
- 2023
- Full Text
- View/download PDF
40. Design, fabrication, and beam commissioning of a 216.667 MHz continuous-wave photocathode very-high-frequency electron gun
- Author
-
Zheng, Lianmin, primary, Chen, Han, additional, Gao, Bin, additional, Dong, Zixuan, additional, Li, Zizheng, additional, Jia, Yanqing, additional, Tian, Qili, additional, Xia, Qianxu, additional, Zhu, Yijiang, additional, You, Jingjing, additional, Du, Yingchao, additional, Yan, Lixin, additional, Shi, Jiaru, additional, Xing, Qingzi, additional, Li, Renkai, additional, Huang, Wenhui, additional, Chen, Huaibi, additional, and Tang, Chuanxiang, additional
- Published
- 2023
- Full Text
- View/download PDF
41. Efficient Memory Management for Large Language Model Serving with PagedAttention
- Author
-
Kwon, Woosuk, primary, Li, Zhuohan, additional, Zhuang, Siyuan, additional, Sheng, Ying, additional, Zheng, Lianmin, additional, Yu, Cody Hao, additional, Gonzalez, Joseph, additional, Zhang, Hao, additional, and Stoica, Ion, additional
- Published
- 2023
- Full Text
- View/download PDF
42. Monitoring of electron bunch length by using Terahertz coherent transition radiation
- Author
-
Su, Xiaolu, Yan, Lixin, Du, Yingchao, Zhang, Zhen, Zhou, Zheng, Wang, Dong, Zheng, Lianmin, Tian, Qili, Huang, Wenhui, and Tang, Chuanxiang
- Published
- 2017
- Full Text
- View/download PDF
43. High-throughput Generative Inference of Large Language Models with a Single GPU
- Author
-
Sheng, Ying, Zheng, Lianmin, Yuan, Binhang, Li, Zhuohan, Ryabinin, Max, Fu, Daniel Y., Xie, Zhiqiang, Chen, Beidi, Barrett, Clark, Gonzalez, Joseph E., Liang, Percy, Ré, Christopher, Stoica, Ion, Zhang, Ce, Sheng, Ying, Zheng, Lianmin, Yuan, Binhang, Li, Zhuohan, Ryabinin, Max, Fu, Daniel Y., Xie, Zhiqiang, Chen, Beidi, Barrett, Clark, Gonzalez, Joseph E., Liang, Percy, Ré, Christopher, Stoica, Ion, and Zhang, Ce
- Published
- 2023
44. Development of S-band photocathode RF guns at Tsinghua University
- Author
-
Zheng, Lianmin, Du, Yingchao, Zhang, Zhe, Qian, Houjun, Yan, Lixin, Shi, Jiaru, Zhang, Zhen, Zhou, Zheng, Wu, Xiaowei, Su, Xiaolu, Wang, Dong, Tian, Qili, Huang, Wenhui, Chen, Huaibi, and Tang, Chuanxiang
- Published
- 2016
- Full Text
- View/download PDF
45. TensorIR: An Abstraction for Automatic Tensorized Program Optimization
- Author
-
Feng, Siyuan, primary, Hou, Bohan, additional, Jin, Hongyi, additional, Lin, Wuwei, additional, Shao, Junru, additional, Lai, Ruihang, additional, Ye, Zihao, additional, Zheng, Lianmin, additional, Yu, Cody Hao, additional, Yu, Yong, additional, and Chen, Tianqi, additional
- Published
- 2023
- Full Text
- View/download PDF
46. Simulation of dark current and dark current-induced background photons in the Thomson scattering X-ray source
- Author
-
Zheng, Lianmin, Du, Yingchao, Huang, Wenhui, and Tang, Chuanxiang
- Published
- 2015
- Full Text
- View/download PDF
47. Analysis of slice transverse emittance evolution in a very-high-frequency gun photoinjector
- Author
-
Chen, Han, primary, Zheng, Lianmin, additional, Huang, Pengwei, additional, Song, Chuangye, additional, Du, Yingchao, additional, Li, Renkai, additional, Huang, Wenhui, additional, and Tang, Chuanxiang, additional
- Published
- 2021
- Full Text
- View/download PDF
48. Simple and Automatic Distributed Machine Learning on Ray
- Author
-
Zhang, Hao, primary, Li, Zhuohan, additional, Zheng, Lianmin, additional, and Stoica, Ion, additional
- Published
- 2021
- Full Text
- View/download PDF
49. Rapid thermal emittance and quantum efficiency mapping of a cesium telluride cathode in an rf photoinjector using multiple laser beamlets
- Author
-
Zheng, Lianmin, primary, Shao, Jiahang, additional, Wisniewski, Eric E., additional, Power, John G., additional, Du, Yingchao, additional, Liu, Wanming, additional, Whiteford, Charles E., additional, Conde, Manoel, additional, Doran, Scott, additional, Jing, Chunguang, additional, and Tang, Chuanxiang, additional
- Published
- 2020
- Full Text
- View/download PDF
50. Development and high-power testing of an X -band dielectric-loaded power extractor
- Author
-
Shao, Jiahang, primary, Jing, Chunguang, additional, Wisniewski, Eric, additional, Ha, Gwanghui, additional, Conde, Manoel, additional, Liu, Wanming, additional, Power, John, additional, and Zheng, Lianmin, additional
- Published
- 2020
- Full Text
- View/download PDF
Catalog
Discovery Service for Jio Institute Digital Library
For full access to our library's resources, please sign in.