Author: "Yazdanbakhsh, Amir" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Yazdanbakhsh, Amir"' showing total 118 results

Start Over Author "Yazdanbakhsh, Amir"

118 results on '"Yazdanbakhsh, Amir"'

1. When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

Author: You, Haoran, Fu, Yichao, Wang, Zheng, Yazdanbakhsh, Amir, and Lin, Yingyan Celine
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2$\times$ speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM., Comment: Accepted by ICML 2024; 17 pages; 10 figures; 16 tables
Published: 2024

2. ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization

Author: You, Haoran, Guo, Yipin, Fu, Yichao, Zhou, Wei, Shi, Huihong, Zhang, Xiaofan, Kundu, Souvik, Yazdanbakhsh, Amir, and Lin, Yingyan Celine
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large language models (LLMs) have shown impressive performance on language tasks but face challenges when deployed on resource-constrained devices due to their extensive parameters and reliance on dense multiplications, resulting in high memory demands and latency bottlenecks. Shift-and-add reparameterization offers a promising solution by replacing costly multiplications with hardware-friendly primitives in both the attention and multi-layer perceptron (MLP) layers of an LLM. However, current reparameterization techniques require training from scratch or full parameter fine-tuning to restore accuracy, which is resource-intensive for LLMs. To address this, we propose accelerating pretrained LLMs through post-training shift-and-add reparameterization, creating efficient multiplication-free models, dubbed ShiftAddLLM. Specifically, we quantize each weight matrix into binary matrices paired with group-wise scaling factors. The associated multiplications are reparameterized into (1) shifts between activations and scaling factors and (2) queries and adds according to the binary matrices. To reduce accuracy loss, we present a multi-objective optimization method to minimize both weight and output activation reparameterization errors. Additionally, based on varying sensitivity across layers to reparameterization, we develop an automated bit allocation strategy to further reduce memory usage and latency. Experiments on five LLM families and eight tasks consistently validate the effectiveness of ShiftAddLLM, achieving average perplexity improvements of 5.6 and 22.7 points at comparable or lower latency compared to the most competitive quantized LLMs at 3 and 2 bits, respectively, and more than 80% memory and energy reductions over the original LLMs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddLLM.
Published: 2024

3. Effective Interplay between Sparsity and Quantization: From Theory to Practice

Author: Harma, Simla Burcu, Chakraborty, Ayan, Kostenok, Elizaveta, Mishin, Danila, Ha, Dongho, Falsafi, Babak, Jaggi, Martin, Liu, Ming, Oh, Yunho, Subramanian, Suvinay, and Yazdanbakhsh, Amir
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The increasing size of deep neural networks necessitates effective model compression to improve computational efficiency and reduce their memory footprint. Sparsity and quantization are two prominent compression methods that have individually demonstrated significant reduction in computational and memory footprints while preserving model accuracy. While effective, the interplay between these two methods remains an open question. In this paper, we investigate the interaction between these two methods and assess whether their combination impacts final model accuracy. We mathematically prove that applying sparsity before quantization is the optimal sequence for these operations, minimizing error in computation. Our empirical studies across a wide range of models, including OPT and Llama model families (125M-8B) and ViT corroborate these theoretical findings. In addition, through rigorous analysis, we demonstrate that sparsity and quantization are not orthogonal; their interaction can significantly harm model accuracy, with quantization error playing a dominant role in this degradation. Our findings extend to the efficient deployment of large models in resource-limited compute platforms and reduce serving cost, offering insights into best practices for applying these compression methods to maximize efficacy without compromising accuracy.
Published: 2024

4. SLoPe: Double-Pruned Sparse Plus Lazy Low-Rank Adapter Pretraining of LLMs

Author: Mozaffari, Mohammad, Yazdanbakhsh, Amir, Zhang, Zhao, and Dehnavi, Maryam Mehri
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: We propose SLoPe, a Double-Pruned Sparse Plus Lazy Low-rank Adapter Pretraining method for LLMs that improves the accuracy of sparse LLMs while accelerating their pretraining and inference and reducing their memory footprint. Sparse pretraining of LLMs reduces the accuracy of the model, to overcome this, prior work uses dense models during fine-tuning. SLoPe improves the accuracy of sparsely pretrained models by adding low-rank adapters in the final 1% iterations of pretraining without adding significant overheads to the model pretraining and inference. In addition, SLoPe uses a double-pruned backward pass formulation that prunes the transposed weight matrix using N:M sparsity structures to enable an accelerated sparse backward pass. SLoPe accelerates the training and inference of models with billions of parameters up to $1.14\times$ and $1.34\times$ respectively (OPT-33B and OPT-66B) while reducing their memory usage by up to $0.77\times$ and $0.51\times$ for training and inference respectively.
Published: 2024

5. Tao: Re-Thinking DL-based Microarchitecture Simulation

Author: Pandey, Santosh, Yazdanbakhsh, Amir, and Liu, Hang
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, and optimize new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy but fail to provide adequate low-level microarchitectural performance metrics crucial for microarchitectural bottleneck analysis. Additionally, they introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture. Re-thinking the advantages and limitations of the aforementioned simulation paradigms, this paper introduces TAO that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, we redesign the input features and the DL model using self-attention to support predicting various performance metrics. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and reduces the re-training overhead of conventional DL-based simulators. Our extensive evaluation shows TAO can reduce the overall training and simulation time by 18.06x over the state-of-the-art DL-based endeavors., Comment: Published in POMACS and SIGMETRICS'24
Published: 2024

6. DaCapo: Accelerating Continuous Learning in Autonomous Systems for Video Analytics

Author: Kim, Yoonsung, Oh, Changhun, Hwang, Jinwoo, Kim, Wonung, Oh, Seongryong, Lee, Yubin, Sharma, Hardik, Yazdanbakhsh, Amir, and Park, Jongse
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning, Computer Science - Robotics
Abstract: Deep neural network (DNN) video analytics is crucial for autonomous systems such as self-driving vehicles, unmanned aerial vehicles (UAVs), and security robots. However, real-world deployment faces challenges due to their limited computational resources and battery power. To tackle these challenges, continuous learning exploits a lightweight "student" model at deployment (inference), leverages a larger "teacher" model for labeling sampled data (labeling), and continuously retrains the student model to adapt to changing scenarios (retraining). This paper highlights the limitations in state-of-the-art continuous learning systems: (1) they focus on computations for retraining, while overlooking the compute needs for inference and labeling, (2) they rely on power-hungry GPUs, unsuitable for battery-operated autonomous systems, and (3) they are located on a remote centralized server, intended for multi-tenant scenarios, again unsuitable for autonomous systems due to privacy, network availability, and latency concerns. We propose a hardware-algorithm co-designed solution for continuous learning, DaCapo, that enables autonomous systems to perform concurrent executions of inference, labeling, and training in a performant and energy-efficient manner. DaCapo comprises (1) a spatially-partitionable and precision-flexible accelerator enabling parallel execution of kernels on sub-accelerators at their respective precisions, and (2) a spatiotemporal resource allocation algorithm that strategically navigates the resource-accuracy tradeoff space, facilitating optimal decisions for resource allocation to achieve maximal accuracy. Our evaluation shows that DaCapo achieves 6.5% and 5.5% higher accuracy than a state-of-the-art GPU-based continuous learning systems, Ekya and EOMU, respectively, while consuming 254x less power.
Published: 2024
Full Text: View/download PDF

7. Progressive Gradient Flow for Robust N:M Sparsity Training in Transformers

Author: Bambhaniya, Abhimanyu Rajeshkumar, Yazdanbakhsh, Amir, Subramanian, Suvinay, Kao, Sheng-Chun, Agrawal, Shivani, Evci, Utku, and Krishna, Tushar
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture
Abstract: N:M Structured sparsity has garnered significant interest as a result of relatively modest overhead and improved efficiency. Additionally, this form of sparsity holds considerable appeal for reducing the memory footprint owing to their modest representation overhead. There have been efforts to develop training recipes for N:M structured sparsity, they primarily focus on low-sparsity regions ($\sim$50\%). Nonetheless, performance of models trained using these approaches tends to decline when confronted with high-sparsity regions ($>$80\%). In this work, we study the effectiveness of existing sparse training recipes at \textit{high-sparsity regions} and argue that these methods fail to sustain the model quality on par with low-sparsity regions. We demonstrate that the significant factor contributing to this disparity is the presence of elevated levels of induced noise in the gradient magnitudes. To mitigate this undesirable effect, we employ decay mechanisms to progressively restrict the flow of gradients towards pruned elements. Our approach improves the model quality by up to 2$\%$ and 5$\%$ in vision and language models at high sparsity regime, respectively. We also evaluate the trade-off between model accuracy and training compute cost in terms of FLOPs. At iso-training FLOPs, our method yields better performance compared to conventional sparse training recipes, exhibiting an accuracy improvement of up to 2$\%$. The source code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity., Comment: 18 pages, 8 figures, 17 tables. Code is available at https://github.com/abhibambhaniya/progressive_gradient_flow_nm_sparsity
Published: 2024

8. USM-Lite: Quantization and Sparsity Aware Fine-tuning for Speech Recognition with Universal Speech Models

Author: Ding, Shaojin, Qiu, David, Rim, David, He, Yanzhang, Rybakov, Oleg, Li, Bo, Prabhavalkar, Rohit, Wang, Weiran, Sainath, Tara N., Han, Zhonglin, Li, Jian, Yazdanbakhsh, Amir, and Agrawal, Shivani
Subjects: Electrical Engineering and Systems Science - Audio and Speech Processing, Computer Science - Sound
Abstract: End-to-end automatic speech recognition (ASR) models have seen revolutionary quality gains with the recent development of large-scale universal speech models (USM). However, deploying these massive USMs is extremely expensive due to the enormous memory usage and computational cost. Therefore, model compression is an important research topic to fit USM-based ASR under budget in real-world scenarios. In this study, we propose a USM fine-tuning approach for ASR, with a low-bit quantization and N:M structured sparsity aware paradigm on the model weights, reducing the model complexity from parameter precision and matrix topology perspectives. We conducted extensive experiments with a 2-billion parameter USM on a large-scale voice search dataset to evaluate our proposed method. A series of ablation studies validate the effectiveness of up to int4 quantization and 2:4 sparsity. However, a single compression technique fails to recover the performance well under extreme setups including int2 quantization and 1:4 sparsity. By contrast, our proposed method can compress the model to have 9.4% of the size, at the cost of only 7.3% relative word error rate (WER) regressions. We also provided in-depth analyses on the results and discussions on the limitations and potential solutions, which would be valuable for future studies., Comment: Accepted by ICASSP 2024. Preprint
Published: 2023

9. JaxPruner: A concise library for sparsity research

Author: Lee, Joo Hyung, Park, Wonpyo, Mitchell, Nicole, Pilault, Jonathan, Obando-Ceron, Johan, Kim, Han-Byul, Lee, Namhoon, Frantar, Elias, Long, Yun, Yazdanbakhsh, Amir, Agrawal, Shivani, Subramanian, Suvinay, Wang, Xin, Kao, Sheng-Chun, Zhang, Xingyao, Gale, Trevor, Bik, Aart, Han, Woohyun, Ferev, Milen, Han, Zhonglin, Kim, Hong-Seok, Dauphin, Yann, Dziugaite, Gintare Karolina, Castro, Pablo Samuel, and Evci, Utku
Subjects: Computer Science - Machine Learning, Computer Science - Software Engineering
Abstract: This paper introduces JaxPruner, an open-source JAX-based pruning and sparse training library for machine learning research. JaxPruner aims to accelerate research on sparse neural networks by providing concise implementations of popular pruning and sparse training algorithms with minimal memory and latency overhead. Algorithms implemented in JaxPruner use a common API and work seamlessly with the popular optimization library Optax, which, in turn, enables easy integration with existing JAX based libraries. We demonstrate this ease of integration by providing examples in four different codebases: Scenic, t5x, Dopamine and FedJAX and provide baseline experiments on popular benchmarks., Comment: Jaxpruner is hosted at http://github.com/google-research/jaxpruner
Published: 2023

10. Self-Refine: Iterative Refinement with Self-Feedback

Author: Madaan, Aman, Tandon, Niket, Gupta, Prakhar, Hallinan, Skyler, Gao, Luyu, Wiegreffe, Sarah, Alon, Uri, Dziri, Nouha, Prabhumoye, Shrimai, Yang, Yiming, Gupta, Shashank, Majumder, Bodhisattwa Prasad, Hermann, Katherine, Welleck, Sean, Yazdanbakhsh, Amir, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Like humans, large language models (LLMs) do not always generate the best output on their first try. Motivated by how humans refine their written text, we introduce Self-Refine, an approach for improving initial outputs from LLMs through iterative feedback and refinement. The main idea is to generate an initial output using an LLMs; then, the same LLMs provides feedback for its output and uses it to refine itself, iteratively. Self-Refine does not require any supervised training data, additional training, or reinforcement learning, and instead uses a single LLM as the generator, refiner, and feedback provider. We evaluate Self-Refine across 7 diverse tasks, ranging from dialog response generation to mathematical reasoning, using state-of-the-art (GPT-3.5, ChatGPT, and GPT-4) LLMs. Across all evaluated tasks, outputs generated with Self-Refine are preferred by humans and automatic metrics over those generated with the same LLM using conventional one-step generation, improving by ~20% absolute on average in task performance. Our work demonstrates that even state-of-the-art LLMs like GPT-4 can be further improved at test time using our simple, standalone approach., Comment: Code, data, and demo at https://selfrefine.info/
Published: 2023

11. In-Storage Domain-Specific Acceleration for Serverless Computing

Author: Mahapatra, Rohan, Ghodrati, Soroush, Ahn, Byung Hoon, Kinzer, Sean, Wang, Shu-ting, Xu, Hanyang, Karthikeyan, Lavanya, Sharma, Hardik, Yazdanbakhsh, Amir, Alian, Mohammad, and Esmaeilzadeh, Hadi
Subjects: Computer Science - Hardware Architecture
Abstract: While (1) serverless computing is emerging as a popular form of cloud execution, datacenters are going through major changes: (2) storage dissaggregation in the system infrastructure level and (3) integration of domain-specific accelerators in the hardware level. Each of these three trends individually provide significant benefits; however, when combined the benefits diminish. Specifically, the paper makes the key observation that for serverless functions, the overhead of accessing dissaggregated persistent storage overshadows the gains from accelerators. Therefore, to benefit from all these trends in conjunction, we propose Domain-Specific Computational Storage for Serverless (DSCS-Serverless). This idea contributes a serverless model that leverages a programmable accelerator within computational storage to conjugate the benefits of acceleration and storage disaggregation simultaneously. Our results with eight applications shows that integrating a comparatively small accelerator within the storage (DSCS-Serverless) that fits within its power constrains (15 Watts), significantly outperforms a traditional disaggregated system that utilizes the NVIDIA RTX 2080 Ti GPU (250 Watts). Further, the work highlights that disaggregation, serverless model, and the limited power budget for computation in storage require a different design than the conventional practices of integrating microprocessors and FPGAs. This insight is in contrast with current practices of designing computational storage that are yet to address the challenges associated with the shifts in datacenters. In comparison with two such conventional designs that either use quad-core ARM A57 or a Xilinx FPGA, DSCS-Serverless provides 3.7x and 1.7x end-to-end application speedup, 4.3x and 1.9x energy reduction, and 3.2x and 2.3x higher cost efficiency, respectively.
Published: 2023

12. Learning Performance-Improving Code Edits

Author: Shypula, Alexander, Madaan, Aman, Zeng, Yimeng, Alon, Uri, Gardner, Jacob, Hashemi, Milad, Neubig, Graham, Ranganathan, Parthasarathy, Bastani, Osbert, and Yazdanbakhsh, Amir
Subjects: Computer Science - Software Engineering, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Computer Science - Performance
Abstract: With the decline of Moore's law, optimizing program performance has become a major focus of software research. However, high-level optimizations such as API and algorithm changes remain elusive due to the difficulty of understanding the semantics of code. Simultaneously, pretrained large language models (LLMs) have demonstrated strong capabilities at solving a wide range of programming tasks. To that end, we introduce a framework for adapting LLMs to high-level program optimization. First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs, accompanied by extensive unit tests. A major challenge is the significant variability of measuring performance on commodity hardware, which can lead to spurious "improvements." To isolate and reliably evaluate the impact of program optimizations, we design an environment based on the gem5 full system simulator, the de facto simulator used in academia and industry. Next, we propose a broad range of adaptation strategies for code optimization; for prompting, these include retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play. A combination of these techniques achieves a mean speedup of 6.86 with eight generations, higher than average optimizations from individual programmers (3.66). Using our model's fastest generations, we set a new upper limit on the fastest speedup possible for our dataset at 9.64 compared to using the fastest human submissions available (9.56)., Comment: Published as a conference paper at ICLR 2024 (Spotlight). Project website: https://pie4perf.com/
Published: 2023

13. STEP: Learning N:M Structured Sparsity Masks from Scratch with Precondition

Author: Lu, Yucheng, Agrawal, Shivani, Subramanian, Suvinay, Rybakov, Oleg, De Sa, Christopher, and Yazdanbakhsh, Amir
Subjects: Computer Science - Machine Learning
Abstract: Recent innovations on hardware (e.g. Nvidia A100) have motivated learning N:M structured sparsity masks from scratch for fast model inference. However, state-of-the-art learning recipes in this regime (e.g. SR-STE) are proposed for non-adaptive optimizers like momentum SGD, while incurring non-trivial accuracy drop for Adam-trained models like attention-based LLMs. In this paper, we first demonstrate such gap origins from poorly estimated second moment (i.e. variance) in Adam states given by the masked weights. We conjecture that learning N:M masks with Adam should take the critical regime of variance estimation into account. In light of this, we propose STEP, an Adam-aware recipe that learns N:M masks with two phases: first, STEP calculates a reliable variance estimate (precondition phase) and subsequently, the variance remains fixed and is used as a precondition to learn N:M masks (mask-learning phase). STEP automatically identifies the switching point of two phases by dynamically sampling variance changes over the training trajectory and testing the sample concentration. Empirically, we evaluate STEP and other baselines such as ASP and SR-STE on multiple tasks including CIFAR classification, machine translation and LLM fine-tuning (BERT-Base, GPT-2). We show STEP mitigates the accuracy drop of baseline recipes and is robust to aggressive structured sparsity ratios.
Published: 2023

14. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation

Author: Sykora, Ondrej, Phothilimthana, Phitchaya Mangpo, Mendis, Charith, and Yazdanbakhsh, Amir
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture, Computer Science - Performance
Abstract: Analytical hardware performance models yield swift estimation of desired hardware performance metrics. However, developing these analytical models for modern processors with sophisticated microarchitectures is an extremely laborious task and requires a firm understanding of target microarchitecture's internal structure. In this paper, we introduce GRANITE, a new machine learning model that estimates the throughput of basic blocks across different microarchitectures. GRANITE uses a graph representation of basic blocks that captures both structural and data dependencies between instructions. This representation is processed using a graph neural network that takes advantage of the relational information captured in the graph and learns a rich neural representation of the basic block that allows more precise throughput estimation. Our results establish a new state-of-the-art for basic block performance estimation with an average test error of 6.9% across a wide range of basic blocks and microarchitectures for the x86-64 target. Compared to recent work, this reduced the error by 1.7% while improving training and inference throughput by approximately 3.0x. In addition, we propose the use of multi-task learning with independent multi-layer feed forward decoder networks. Our results show that this technique further improves precision of all learned models while significantly reducing per-microarchitecture training costs. We perform an extensive set of ablation studies and comparisons with prior work, concluding a set of methods to achieve high accuracy for basic block performance estimation., Comment: 13 pages; 5 figures; published at IISWC 2022; Included IEEE copyright
Published: 2022

15. Text and Patterns: For Effective Chain of Thought, It Takes Two to Tango

Author: Madaan, Aman and Yazdanbakhsh, Amir
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: The past decade has witnessed dramatic gains in natural language processing and an unprecedented scaling of large language models. These developments have been accelerated by the advent of few-shot techniques such as chain of thought (CoT) prompting. Specifically, CoT pushes the performance of large language models in a few-shot setup by augmenting the prompts with intermediate steps. Despite impressive results across various tasks, the reasons behind their success have not been explored. This work uses counterfactual prompting to develop a deeper understanding of CoT-based few-shot prompting mechanisms in large language models. We first systematically identify and define the key components of a prompt: symbols, patterns, and text. Then, we devise and conduct an exhaustive set of experiments across four different tasks, by querying the model with counterfactual prompts where only one of these components is altered. Our experiments across three models (PaLM, GPT-3, and CODEX) reveal several surprising findings and brings into question the conventional wisdom around few-shot prompting. First, the presence of factual patterns in a prompt is practically immaterial to the success of CoT. Second, our results conclude that the primary role of intermediate steps may not be to facilitate learning how to solve a task. The intermediate steps are rather a beacon for the model to realize what symbols to replicate in the output to form a factual answer. Further, text imbues patterns with commonsense knowledge and meaning. Our empirical and qualitative analysis reveals that a symbiotic relationship between text and patterns explains the success of few-shot prompting: text helps extract commonsense from the question to help patterns, and patterns enforce task understanding and direct text generation., Comment: Shortened version with additional results from CODEX and GPT-3. The authors contributed equally. Work done when Aman Madaan was a student researcher at Google Research, Brain Team
Published: 2022

16. Training Recipe for N:M Structured Sparsity with Decaying Pruning Mask

Author: Kao, Sheng-Chun, Yazdanbakhsh, Amir, Subramanian, Suvinay, Agrawal, Shivani, Evci, Utku, and Krishna, Tushar
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Hardware Architecture, Computer Science - Performance
Abstract: Sparsity has become one of the promising methods to compress and accelerate Deep Neural Networks (DNNs). Among different categories of sparsity, structured sparsity has gained more attention due to its efficient execution on modern accelerators. Particularly, N:M sparsity is attractive because there are already hardware accelerator architectures that can leverage certain forms of N:M structured sparsity to yield higher compute-efficiency. In this work, we focus on N:M sparsity and extensively study and evaluate various training recipes for N:M sparsity in terms of the trade-off between model accuracy and compute cost (FLOPs). Building upon this study, we propose two new decay-based pruning methods, namely "pruning mask decay" and "sparse structure decay". Our evaluations indicate that these proposed methods consistently deliver state-of-the-art (SOTA) model accuracy, comparable to unstructured sparsity, on a Transformer-based model for a translation task. The increase in the accuracy of the sparse model using the new training recipes comes at the cost of marginal increase in the total training compute (FLOPs)., Comment: 11 pages, 2 figures, and 9 tables. Published at the ICML Workshop on Sparsity in Neural Networks Advancing Understanding and Practice, 2022. First two authors contributed equally
Published: 2022

17. Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

Author: Yazdanbakhsh, Amir, Moradifirouzabadi, Ashkan, Li, Zheng, and Kang, Mingu
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture
Abstract: As its core computation, a self-attention mechanism gauges pairwise correlations across the entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. While recent work has shown the benefits of runtime pruning of elements with low attention scores, the quadratic complexity of self-attention mechanisms and their on-chip memory capacity demands are overlooked. This work addresses these constraints by architecting an accelerator, called SPRINT, which leverages the inherent parallelism of ReRAM crossbar arrays to compute attention scores in an approximate manner. Our design prunes the low attention scores using a lightweight analog thresholding circuitry within ReRAM, enabling SPRINT to fetch only a small subset of relevant data to on-chip memory. To mitigate potential negative repercussions for model accuracy, SPRINT re-computes the attention scores for the few fetched data in digital. The combined in-memory pruning and on-chip recompute of the relevant attention scores enables SPRINT to transform quadratic complexity to a merely linear one. In addition, we identify and leverage a dynamic spatial locality between the adjacent attention operations even after pruning, which eliminates costly yet redundant data fetches. We evaluate our proposed technique on a wide range of state-of-the-art transformer models. On average, SPRINT yields 7.5x speedup and 19.6x energy reduction when total 16KB on-chip memory is used, while virtually on par with iso-accuracy of the baseline models (on average 0.36% degradation)., Comment: 15 pages; 14 figures; published at MICRO 2022; First three authors contributed equally
Published: 2022

18. Accelerating Attention through Gradient-Based Learned Runtime Pruning

Author: Li, Zheng, Ghodrati, Soroush, Yazdanbakhsh, Amir, Esmaeilzadeh, Hadi, and Kang, Mingu
Subjects: Computer Science - Computation and Language, Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Self-attention is a key enabler of state-of-art accuracy for various transformer-based Natural Language Processing models. This attention mechanism calculates a correlation score for each word with respect to the other words in a sentence. Commonly, only a small subset of words highly correlates with the word under attention, which is only determined at runtime. As such, a significant amount of computation is inconsequential due to low attention scores and can potentially be pruned. The main challenge is finding the threshold for the scores below which subsequent computation will be inconsequential. Although such a threshold is discrete, this paper formulates its search through a soft differentiable regularizer integrated into the loss function of the training. This formulation piggy backs on the back-propagation training to analytically co-optimize the threshold and the weights simultaneously, striking a formally optimal balance between accuracy and computation pruning. To best utilize this mathematical innovation, we devise a bit-serial architecture, dubbed LeOPArd, for transformer language models with bit-level early termination microarchitectural mechanism. We evaluate our design across 43 back-end tasks for MemN2N, BERT, ALBERT, GPT-2, and Vision transformer models. Post-layout results show that, on average, LeOPArd yields 1.9x and 3.9x speedup and energy reduction, respectively, while keeping the average accuracy virtually intact (<0.2% degradation), Comment: First three authors contributed equally; published at ISCA 2022
Published: 2022

19. Data-Driven Offline Optimization For Architecting Hardware Accelerators

Author: Kumar, Aviral, Yazdanbakhsh, Amir, Hashemi, Milad, Swersky, Kevin, and Levine, Sergey
Subjects: Computer Science - Hardware Architecture, Computer Science - Machine Learning
Abstract: Industry has gradually moved towards application-specific hardware accelerators in order to attain higher efficiency. While such a paradigm shift is already starting to show promising results, designers need to spend considerable manual effort and perform a large number of time-consuming simulations to find accelerators that can accelerate multiple target applications while obeying design constraints. Moreover, such a "simulation-driven" approach must be re-run from scratch every time the set of target applications or design constraints change. An alternative paradigm is to use a "data-driven", offline approach that utilizes logged simulation data, to architect hardware accelerators, without needing any form of simulations. Such an approach not only alleviates the need to run time-consuming simulation, but also enables data reuse and applies even when set of target applications changes. In this paper, we develop such a data-driven offline optimization method for designing hardware accelerators, dubbed PRIME, that enjoys all of these properties. Our approach learns a conservative, robust estimate of the desired cost function, utilizes infeasible points, and optimizes the design against this estimate without any additional simulator queries during optimization. PRIME architects accelerators -- tailored towards both single and multiple applications -- improving performance upon state-of-the-art simulation-driven methods by about 1.54x and 1.20x, while considerably reducing the required total simulation time by 93% and 99%, respectively. In addition, PRIME also architects effective accelerators for unseen applications in a zero-shot setting, outperforming simulation-based methods by 1.26x., Comment: First two authors contributed equally; published at ICLR 2022
Published: 2021

20. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

Author: Kao, Sheng-Chun, Subramanian, Suvinay, Agrawal, Gaurav, Yazdanbakhsh, Amir, and Krishna, Tushar
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture
Abstract: Attention mechanisms, primarily designed to capture pairwise correlations between words, have become the backbone of machine learning, expanding beyond natural language processing into other domains. This growth in adaptation comes at the cost of prohibitively large memory requirements and computational complexity, especially at higher number of input elements. This limitation is due to inherently limited data reuse opportunities and quadratic growth in memory footprints, leading to severe memory-boundedness and limited scalability of input elements. This work addresses these challenges by devising a tailored dataflow optimization, called FLAT, for attention mechanisms without altering their functionality. This dataflow processes costly attention operations through a unique fusion mechanism, transforming the memory footprint quadratic growth to merely a linear one. To realize the full potential of this bespoke mechanism, we propose a tiling approach to enhance the data reuse across attention operations. Our method both mitigates the off-chip bandwidth bottleneck as well as reduces the on-chip memory requirement. FLAT delivers 1.94x (1.76x) speedup and 49% and (42%) of energy savings compared to the state-of-the-art Edge (Cloud) accelerators with no customized dataflow optimization. When on-chip resources are scarce (20 KB-200 KB), FLAT yields, on average, 1.5x end-to-end latency reduction across a diverse range of conventional attention-based models with input sequence lengths ranging from 512-token to 64K-token. Our evaluations demonstrate that state-of-the-art DNN dataflow applied to attention operations reach the efficiency limit for inputs above 512 elements. In contrast, FLAT unblocks transformer models for inputs with up to 64K elements
Published: 2021

21. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks

Author: Seshadri, Kiran, Akin, Berkin, Laudon, James, Narayanaswami, Ravi, and Yazdanbakhsh, Amir
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture
Abstract: Edge TPUs are a domain of accelerators for low-power, edge devices and are widely used in various Google products such as Coral and Pixel devices. In this paper, we first discuss the major microarchitectural details of Edge TPUs. Then, we extensively evaluate three classes of Edge TPUs, covering different computing ecosystems, that are either currently deployed in Google products or are the product pipeline, across 423K unique convolutional neural networks. Building upon this extensive study, we discuss critical and interpretable microarchitectural insights about the studied classes of Edge TPUs. Mainly, we discuss how Edge TPU accelerators perform across convolutional neural networks with different structures. Finally, we present our ongoing efforts in developing high-accuracy learned machine learning models to estimate the major performance metrics of accelerators such as latency and energy consumption. These learned models enable significantly faster (in the order of milliseconds) evaluations of accelerators as an alternative to time-consuming cycle-accurate simulators and establish an exciting opportunity for rapid hard-ware/software co-design., Comment: 13 pages, 15 figures, 8 tables, published in IISWC 2022
Published: 2021

22. Rethinking Co-design of Neural Architectures and Hardware Accelerators

Author: Zhou, Yanqi, Dong, Xuanyi, Akin, Berkin, Tan, Mingxing, Peng, Daiyi, Meng, Tianjian, Yazdanbakhsh, Amir, Huang, Da, Narayanaswami, Ravi, and Laudon, James
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture
Abstract: Neural architectures and hardware accelerators have been two driving forces for the progress in deep learning. Previous works typically attempt to optimize hardware given a fixed model architecture or model architecture given fixed hardware. And the dominant hardware architecture explored in this prior work is FPGAs. In our work, we target the optimization of hardware and software configurations on an industry-standard edge accelerator. We systematically study the importance and strategies of co-designing neural architectures and hardware accelerators. We make three observations: 1) the software search space has to be customized to fully leverage the targeted hardware architecture, 2) the search for the model architecture and hardware architecture should be done jointly to achieve the best of both worlds, and 3) different use cases lead to very different search outcomes. Our experiments show that the joint search method consistently outperforms previous platform-aware neural architecture search, manually crafted models, and the state-of-the-art EfficientNet on all latency targets by around 1% on ImageNet top-1 accuracy. Our method can reduce energy consumption of an edge accelerator by up to 2x under the same accuracy constraint, when co-adapting the model architecture and hardware accelerator configurations.
Published: 2021

23. Apollo: Transferable Architecture Exploration

Author: Yazdanbakhsh, Amir, Angermueller, Christof, Akin, Berkin, Zhou, Yanqi, Jones, Albin, Hashemi, Milad, Swersky, Kevin, Chatterjee, Satrajit, Narayanaswami, Ravi, and Laudon, James
Subjects: Computer Science - Machine Learning, Computer Science - Hardware Architecture
Abstract: The looming end of Moore's Law and ascending use of deep learning drives the design of custom accelerators that are optimized for specific neural architectures. Architecture exploration for such accelerators forms a challenging constrained optimization problem over a complex, high-dimensional, and structured input space with a costly to evaluate objective function. Existing approaches for accelerator design are sample-inefficient and do not transfer knowledge between related optimizations tasks with different design constraints, such as area and/or latency budget, or neural architecture configurations. In this work, we propose a transferable architecture exploration framework, dubbed Apollo, that leverages recent advances in black-box function optimization for sample-efficient accelerator design. We use this framework to optimize accelerator configurations of a diverse set of neural architectures with alternative design constraints. We show that our framework finds high reward design configurations (up to 24.6% speedup) more sample-efficiently than a baseline black-box optimization approach. We further show that by transferring knowledge between target architectures with different design constraints, Apollo is able to find optimal configurations faster and often with better objective value (up to 25% improvements). This encouraging outcome portrays a promising path forward to facilitate generating higher quality accelerators., Comment: 10 pages, 5 figures, Accepted to Workshop on ML for Systems at the 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
Published: 2021

24. Chameleon: Adaptive Code Optimization for Expedited Deep Neural Network Compilation

Author: Ahn, Byung Hoon, Pilligundla, Prannoy, Yazdanbakhsh, Amir, and Esmaeilzadeh, Hadi
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Achieving faster execution with shorter compilation time can foster further diversity and innovation in neural networks. However, the current paradigm of executing neural networks either relies on hand-optimized libraries, traditional compilation heuristics, or very recently genetic algorithms and other stochastic methods. These methods suffer from frequent costly hardware measurements rendering them not only too time consuming but also suboptimal. As such, we devise a solution that can learn to quickly adapt to a previously unseen design space for code optimization, both accelerating the search and improving the output performance. This solution dubbed Chameleon leverages reinforcement learning whose solution takes fewer steps to converge, and develops an adaptive sampling algorithm that not only focuses on the costly samples (real hardware measurements) on representative points but also uses a domain-knowledge inspired logic to improve the samples itself. Experimentation with real hardware shows that Chameleon provides 4.45x speed up in optimization time over AutoTVM, while also improving inference time of the modern deep networks by 5.6%., Comment: Published as a conference paper at ICLR 2020. arXiv admin note: text overlap with arXiv:1905.12799
Published: 2020

25. Mixed-Signal Charge-Domain Acceleration of Deep Neural networks through Interleaved Bit-Partitioned Arithmetic

Author: Ghodrati, Soroush, Sharma, Hardik, Kinzer, Sean, Yazdanbakhsh, Amir, Samadi, Kambiz, Kim, Nam Sung, Burger, Doug, and Esmaeilzadeh, Hadi
Subjects: Computer Science - Hardware Architecture
Abstract: Low-power potential of mixed-signal design makes it an alluring option to accelerate Deep Neural Networks (DNNs). However, mixed-signal circuitry suffers from limited range for information encoding, susceptibility to noise, and Analog to Digital (A/D) conversion overheads. This paper aims to address these challenges by offering and leveraging the insight that a vector dot-product (the basic operation in DNNs) can be bit-partitioned into groups of spatially parallel low-bitwidth operations, and interleaved across multiple elements of the vectors. As such, the building blocks of our accelerator become a group of wide, yet low-bitwidth multiply-accumulate units that operate in the analog domain and share a single A/D converter. The low-bitwidth operation tackles the encoding range limitation and facilitates noise mitigation. Moreover, we utilize the switched-capacitor design for our bit-level reformulation of DNN operations. The proposed switched-capacitor circuitry performs the group multiplications in the charge domain and accumulates the results of the group in its capacitors over multiple cycles. The capacitive accumulation combined with wide bit-partitioned operations alleviate the need for A/D conversion per operation. With such mathematical reformulation and its switched-capacitor implementation, we define a 3D-stacked microarchitecture, dubbed BIHIWE.
Published: 2019

26. ReLeQ: A Reinforcement Learning Approach for Deep Quantization of Neural Networks

Author: Elthakeb, Ahmed T., Pilligundla, Prannoy, Mireshghallah, FatemehSadat, Yazdanbakhsh, Amir, and Esmaeilzadeh, Hadi
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: Deep Neural Networks (DNNs) typically require massive amount of computation resource in inference tasks for computer vision applications. Quantization can significantly reduce DNN computation and storage by decreasing the bitwidth of network encodings. Recent research affirms that carefully selecting the quantization levels for each layer can preserve the accuracy while pushing the bitwidth below eight bits. However, without arduous manual effort, this deep quantization can lead to significant accuracy loss, leaving it in a position of questionable utility. As such, deep quantization opens a large hyper-parameter space (bitwidth of the layers), the exploration of which is a major challenge. We propose a systematic approach to tackle this problem, by automating the process of discovering the quantization levels through an end-to-end deep reinforcement learning framework (ReLeQ). We adapt policy optimization methods to the problem of quantization, and focus on finding the best design decisions in choosing the state and action spaces, network architecture and training framework, as well as the tuning of various hyperparamters. We show how ReLeQ can balance speed and quality, and provide an asymmetric general solution for quantization of a large variety of deep networks (AlexNet, CIFAR-10, LeNet, MobileNet-V1, ResNet-20, SVHN, and VGG-11) that virtually preserves the accuracy (=< 0.3% loss) while minimizing the computation and storage cost. With these DNNs, ReLeQ enables conventional hardware to achieve 2.2x speedup over 8-bit execution. Similarly, a custom DNN accelerator achieves 2.0x speedup and energy reduction compared to 8-bit runs. These encouraging results mark ReLeQ as the initial step towards automating the deep quantization of neural networks., Comment: Presented as a spotlight paper at NeurIPS Workshop on ML for Systems 2018
Published: 2018

27. TAO: Re-Thinking DL-based Microarchitecture Simulation

Author: Pandey, Santosh, primary, Yazdanbakhsh, Amir, additional, and Liu, Hang, additional
Published: 2024
Full Text: View/download PDF

28. USM-Lite: Quantization and Sparsity Aware Fine-Tuning for Speech Recognition with Universal Speech Models

Author: Ding, Shaojin, primary, Qiu, David, additional, Rim, David, additional, He, Yanzhang, additional, Rybakov, Oleg, additional, Li, Bo, additional, Prabhavalkar, Rohit, additional, Wang, Weiran, additional, Sainath, Tara N., additional, Han, Zhonglin, additional, Li, Jian, additional, Yazdanbakhsh, Amir, additional, and Agrawal, Shivani, additional
Published: 2024
Full Text: View/download PDF

29. GANAX: A Unified MIMD-SIMD Acceleration for Generative Adversarial Networks

Author: Yazdanbakhsh, Amir, Falahati, Hajar, Wolfe, Philip J., Samadi, Kambiz, Kim, Nam Sung, and Esmaeilzadeh, Hadi
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Hardware Architecture, Computer Science - Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Generative Adversarial Networks (GANs) are one of the most recent deep learning models that generate synthetic data from limited genuine datasets. GANs are on the frontier as further extension of deep learning into many domains (e.g., medicine, robotics, content synthesis) requires massive sets of labeled data that is generally either unavailable or prohibitively costly to collect. Although GANs are gaining prominence in various fields, there are no accelerators for these new models. In fact, GANs leverage a new operator, called transposed convolution, that exposes unique challenges for hardware acceleration. This operator first inserts zeros within the multidimensional input, then convolves a kernel over this expanded array to add information to the embedded zeros. Even though there is a convolution stage in this operator, the inserted zeros lead to underutilization of the compute resources when a conventional convolution accelerator is employed. We propose the GANAX architecture to alleviate the sources of inefficiency associated with the acceleration of GANs using conventional convolution accelerators, making the first GAN accelerator design possible. We propose a reorganization of the output computations to allocate compute rows with similar patterns of zeros to adjacent processing engines, which also avoids inconsequential multiply-adds on the zeros. This compulsory adjacency reclaims data reuse across these neighboring processing engines, which had otherwise diminished due to the inserted zeros. The reordering breaks the full SIMD execution model, which is prominent in convolution accelerators. Therefore, we propose a unified MIMD-SIMD design for GANAX that leverages repeated patterns in the computation to create distinct microprograms that execute concurrently in SIMD mode., Comment: Proceedings of the 45th International Symposium on Computer Architecture (ISCA), 2018
Published: 2018

30. Exploiting Intel® Advanced Matrix Extensions (AMX) for Large Language Model Inference

Author: Kim, Hyungyo, primary, Ye, Gaohan, additional, Wang, Nachuan, additional, Yazdanbakhsh, Amir, additional, and Kim, Nam Sung, additional
Published: 2024
Full Text: View/download PDF

31. An assessment of the role of nanosilica in thermal/thermooxidative degradation mechanism of poly(lactic acid)/polybutylene adipate terephthalate blend nanocomposites.

Author: Khonakdar, Hanieh, Khasraghi, Samaneh Salkhi, Yazdanbakhsh, Amir Hossein, Mousavi, Seyed Rasoul, Ahmadi, Shervin, Arabi, Hasan, Nobre, Marcos A. L., and Khonakdar, Hossein Ali
Subjects: LACTIC acid, COMPATIBILIZERS, TRANSMISSION electron microscopes, POLYMER blends, PACKAGING materials, POLYBUTYLENE terephthalate, ACTIVATION energy
Abstract: As a packaging materials candidate, based on a polylactic acid/ polybutylene adipate terephthalate (PLA/PBAT) blends (90/10 and 75/25 wt/wt) containing 1, 3, and 5 phr hydrophilic (HPL) and hydrophobic (HPB) nanosilica (NS) particles in the presence of a multifunctional epoxide compatibilizer were prepared by melt mixing, in a twin-screw extruder. Scanning electron microscopic studies confirmed a matrixdroplet morphology with finer dispersed domains at higher NS content. Energy dispersive spectroscopy mapping also indicated uniform dispersion of NS in blends, with some agglomerates at higher content of nanoparticles. Moreover, transmission electron microscope was applied to study the impact of nanofillers' localization on the systems' morphology. It was observed that NS particles localized at the PLA-PBAT interface. In addition, thermogravimetric analysis (TGA) was used to investigate thermal stability and thermal degradation kinetics. Data indicates that hydrophobic NS improved thermal stability. The activation energy of degradation was calculated using several techniques of data modeling, including Friedman, Flynn-Ozawa-Wall, and Kissinger-Akahira-Sunose models. Among blend nanocomposites, 75/25 blend containing 5 phr HPB NS had the maximum degradation activation energy, suggesting that this sample had the most resistance to heat degradation. The intensity of the TGA/FTIR peaks of the evolved products was found to be correlated with the activation energy. The Criado's technique was also used to investigate the changes in the thermal degradation mechanism. [ABSTRACT FROM AUTHOR]
Published: 2024
Full Text: View/download PDF

32. Towards Breaking the Memory Bandwidth Wall Using Approximate Value Prediction

Author: Yazdanbakhsh, Amir, Pekhimenko, Gennady, Esmaeilzadeh, Hadi, Mutlu, Onur, Mowry, Todd C., Reda, Sherief, editor, and Shafique, Muhammad, editor
Published: 2019
Full Text: View/download PDF

33. Architecture 2.0: Challenges and Opportunities

Author: Reddi, Vijay Janapa, primary and Yazdanbakhsh, Amir, additional
Published: 2023
Full Text: View/download PDF

34. ArchGym: An Open-Source Gymnasium for Machine Learning Assisted Architecture Design

Author: Krishnan, Srivatsan, primary, Yazdanbakhsh, Amir, additional, Prakash, Shvetank, additional, Jabbour, Jason, additional, Uchendu, Ikechukwu, additional, Ghosh, Susobhan, additional, Boroujerdian, Behzad, additional, Richins, Daniel, additional, Tripathy, Devashree, additional, Faust, Aleksandra, additional, and Janapa Reddi, Vijay, additional
Published: 2023
Full Text: View/download PDF

35. MESA: Microarchitecture Extensions for Spatial Architecture Generation

Author: Wang, Dong Kai, primary, Lou, Jiaqi, additional, Jin, Naiyin, additional, Mascarenhas, Edwin, additional, Mahapatra, Rohan, additional, Kinzer, Sean, additional, Ghodrati, Soroush, additional, Yazdanbakhsh, Amir, additional, Esmaeilzadeh, Hadi, additional, and Kim, Nam Sung, additional
Published: 2023
Full Text: View/download PDF

36. Towards Breaking the Memory Bandwidth Wall Using Approximate Value Prediction

Author: Yazdanbakhsh, Amir, primary, Pekhimenko, Gennady, additional, Esmaeilzadeh, Hadi, additional, Mutlu, Onur, additional, and Mowry, Todd C., additional
Published: 2018
Full Text: View/download PDF

37. FLAT: An Optimized Dataflow for Mitigating Attention Bottlenecks

Author: Kao, Sheng-Chun, primary, Subramanian, Suvinay, additional, Agrawal, Gaurav, additional, Yazdanbakhsh, Amir, additional, and Krishna, Tushar, additional
Published: 2023
Full Text: View/download PDF

38. What Makes Chain-of-Thought Prompting Effective? A Counterfactual Study

Author: Madaan, Aman, primary, Hermann, Katherine, additional, and Yazdanbakhsh, Amir, additional
Published: 2023
Full Text: View/download PDF

39. Domain-Specific Computational Storage for Serverless Computing

Author: Mahapatra, Rohan, Ghodrati, Soroush, Ahn, Byung Hoon, Kinzer, Sean, Wang, Shu-ting, Xu, Hanyang, Karthikeyan, Lavanya, Sharma, Hardik, Yazdanbakhsh, Amir, Alian, Mohammad, and Esmaeilzadeh, Hadi
Subjects: FOS: Computer and information sciences, Hardware Architecture (cs.AR), Computer Science - Hardware Architecture
Abstract: While (1) serverless computing is emerging as a popular form of cloud execution, datacenters are going through major changes: (2) storage dissaggregation in the system infrastructure level and (3) integration of domain-specific accelerators in the hardware level. Each of these three trends individually provide significant benefits; however, when combined the benefits diminish. Specifically, the paper makes the key observation that for serverless functions, the overhead of accessing dissaggregated persistent storage overshadows the gains from accelerators. Therefore, to benefit from all these trends in conjunction, we propose Domain-Specific Computational Storage for Serverless (DSCS-Serverless). This idea contributes a serverless model that leverages a programmable accelerator within computational storage to conjugate the benefits of acceleration and storage disaggregation simultaneously. Our results with eight applications shows that integrating a comparatively small accelerator within the storage (DSCS-Serverless) that fits within its power constrains (15 Watts), significantly outperforms a traditional disaggregated system that utilizes the NVIDIA RTX 2080 Ti GPU (250 Watts). Further, the work highlights that disaggregation, serverless model, and the limited power budget for computation in storage require a different design than the conventional practices of integrating microprocessors and FPGAs. This insight is in contrast with current practices of designing computational storage that are yet to address the challenges associated with the shifts in datacenters. In comparison with two such conventional designs that either use quad-core ARM A57 or a Xilinx FPGA, DSCS-Serverless provides 3.7x and 1.7x end-to-end application speedup, 4.3x and 1.9x energy reduction, and 3.2x and 2.3x higher cost efficiency, respectively.
Published: 2023

40. An Evaluation of Edge TPU Accelerators for Convolutional Neural Networks

Author: Seshadri, Kiran, primary, Akin, Berkin, additional, Laudon, James, additional, Narayanaswami, Ravi, additional, and Yazdanbakhsh, Amir, additional
Published: 2022
Full Text: View/download PDF

41. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation

Author: Sykora, Ondrej, primary, Phothilimthana, Phitchaya Mangpo, additional, Mendis, Charith, additional, and Yazdanbakhsh, Amir, additional
Published: 2022
Full Text: View/download PDF

42. Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation

Author: Yazdanbakhsh, Amir, primary, Moradifirouzabadi, Ashkan, additional, Li, Zheng, additional, and Kang, Mingu, additional
Published: 2022
Full Text: View/download PDF

43. Implementation-aware selection of the custom instruction set for extensible processors

Author: Yazdanbakhsh, Amir, Kamal, Mehdi, Fakhraie, Sied Mehdi, Afzali-Kusha, Ali, Safari, Saeed, and Pedram, Massoud
Published: 2014
Full Text: View/download PDF

44. IO-Aware Custom Instruction Exploration for Customizing Embedded Processors

Author: Yazdanbakhsh, Amir, Salehi, Mostafa E., Park, James J., editor, Yang, Laurence T., editor, and Lee, Changhoon, editor
Published: 2011
Full Text: View/download PDF

45. Accelerating attention through gradient-based learned runtime pruning

Author: Li, Zheng, primary, Ghodrati, Soroush, additional, Yazdanbakhsh, Amir, additional, Esmaeilzadeh, Hadi, additional, and Kang, Mingu, additional
Published: 2022
Full Text: View/download PDF

46. Instruction set architectural guidelines for embedded packet-processing engines

Author: Salehi, Mostafa E., Fakhraie, Sied Mehdi, and Yazdanbakhsh, Amir
Published: 2012
Full Text: View/download PDF

47. Customized pipeline and instruction set architecture for embedded processing engines

Author: Yazdanbakhsh, Amir, Salehi, Mostafa E., and Fakhraie, Sied Mehdi
Published: 2014
Full Text: View/download PDF

48. A new merit function for custom instruction selection under an area budget constraint

Author: Kamal, Mehdi, Yazdanbakhsh, Amir, Noori, Hamid, Afzali-Kusha, Ali, and Pedram, Massoud
Published: 2013
Full Text: View/download PDF

49. Mixed-Signal Charge-Domain Acceleration of Deep Neural Networks through Interleaved Bit-Partitioned Arithmetic

Author: Ghodrati, Soroush, primary, Sharma, Hardik, additional, Kinzer, Sean, additional, Yazdanbakhsh, Amir, additional, Park, Jongse, additional, Kim, Nam Sung, additional, Burger, Doug, additional, and Esmaeilzadeh, Hadi, additional
Published: 2020
Full Text: View/download PDF

50. ReLeQ : A Reinforcement Learning Approach for Automatic Deep Quantization of Neural Networks

Author: Elthakeb, Ahmed T., primary, Pilligundla, Prannoy, additional, Mireshghallah, Fatemehsadat, additional, Yazdanbakhsh, Amir, additional, and Esmaeilzadeh, Hadi, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

118 results on '"Yazdanbakhsh, Amir"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources