Author: "Bhatele P" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Bhatele P"' showing total 78 results

Start Over Author "Bhatele P"

78 results on '"Bhatele P"'

1. Democratizing AI: Open-source Scalable LLM Training on GPU-based Supercomputers

Author: Singh, Siddharth, Singhania, Prajwal, Ranjan, Aditya, Kirchenbauer, John, Geiping, Jonas, Wen, Yuxin, Jain, Neel, Hans, Abhimanyu, Shu, Manli, Tomar, Aditya, Goldstein, Tom, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Training and fine-tuning large language models (LLMs) with hundreds of billions to trillions of parameters requires tens of thousands of GPUs, and a highly scalable software stack. In this work, we present a novel four-dimensional hybrid parallel algorithm implemented in a highly scalable, portable, open-source framework called AxoNN. We describe several performance optimizations in AxoNN to improve matrix multiply kernel performance, overlap non-blocking collectives with computation, and performance modeling to choose performance optimal configurations. These have resulted in unprecedented scaling and peak flop/s (bf16) for training of GPT-style transformer models on Perlmutter (620.1 Petaflop/s), Frontier (1.381 Exaflop/s) and Alps (1.423 Exaflop/s). While the abilities of LLMs improve with the number of trainable parameters, so do privacy and copyright risks caused by memorization of training data, which can cause disclosure of sensitive or private information at inference time. We highlight this side effect of scale through experiments that explore "catastrophic memorization", where models are sufficiently large to memorize training data in a single pass, and present an approach to prevent it. As part of this study, we demonstrate fine-tuning of a 405-billion parameter LLM using AxoNN on Frontier.
Published: 2025

2. Gemstones: A Model Suite for Multi-Faceted Scaling Laws

Author: McLeish, Sean, Kirchenbauer, John, Miller, David Yu, Singh, Siddharth, Bhatele, Abhinav, Goldblum, Micah, Panda, Ashwinee, and Goldstein, Tom
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Scaling laws are typically fit using a family of models with a narrow range of frozen hyper-parameter choices. In this work we study scaling laws using a wide range of architecture and hyper-parameter choices, and highlight their impact on resulting prescriptions. As a primary artifact of our research, we release the Gemstones: the most comprehensive open-source scaling law dataset to date, consisting of over 4000 checkpoints from transformers with up to 2 billion parameters; these models have been trained with different learning rates, cooldown schedules, and architectural shapes. Our checkpoints enable more complex studies of scaling, such as a law that predicts language modeling performance as a function of model width and depth. By examining the various facets of our model suite, we find that the prescriptions of scaling laws can be highly sensitive to the experimental design process and the specific model checkpoints used during fitting. Code: https://github.com/mcleish7/gemstone-scaling-laws
Published: 2025

3. Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach

Author: Geiping, Jonas, McLeish, Sean, Jain, Neel, Kirchenbauer, John, Singh, Siddharth, Bartoldson, Brian R., Kailkhura, Bhavya, Bhatele, Abhinav, and Goldstein, Tom
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: We study a novel language model architecture that is capable of scaling test-time computation by implicitly reasoning in latent space. Our model works by iterating a recurrent block, thereby unrolling to arbitrary depth at test-time. This stands in contrast to mainstream reasoning models that scale up compute by producing more tokens. Unlike approaches based on chain-of-thought, our approach does not require any specialized training data, can work with small context windows, and can capture types of reasoning that are not easily represented in words. We scale a proof-of-concept model to 3.5 billion parameters and 800 billion tokens. We show that the resulting model can improve its performance on reasoning benchmarks, sometimes dramatically, up to a computation load equivalent to 50 billion parameters., Comment: The model is available at https://huggingface.co/tomg-group-umd/huginn-0125. Code and data recipe can be found at https://github.com/seal-rg/recurrent-pretraining
Published: 2025

4. HPC-Coder-V2: Studying Code LLMs Across Low-Resource Parallel Languages

Author: Chaturvedi, Aman, Nichols, Daniel, Singh, Siddharth, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Software Engineering
Abstract: Large Language Model (LLM) based coding tools have been tremendously successful as software development assistants, yet they are often designed for general purpose programming tasks and perform poorly for more specialized domains such as high performance computing. Creating specialized models and tools for these domains is crucial towards gaining the benefits of LLMs in areas such as HPC. While previous work has explored HPC-specific models, LLMs still struggle to generate parallel code and it is not at all clear what hurdles are still holding back these LLMs and what must be done to overcome them. In this work, we conduct an in-depth study along the many axes of fine-tuning a specialized HPC LLM in order to better understand the challenges. Based on our findings we fine-tune and evaluate a specialized HPC LLM that is shown to be the best performing open-source code LLM for parallel code generation to date.
Published: 2024

5. From Pixels to Prose: A Large Dataset of Dense Image Captions

Author: Singla, Vasu, Yue, Kaiyu, Paul, Sukriti, Shirkavand, Reza, Jayawardhana, Mayuka, Ganjdanesh, Alireza, Huang, Heng, Bhatele, Abhinav, Somepalli, Gowthami, and Goldstein, Tom
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Training large vision-language models requires extensive, high-quality image-text pairs. Existing web-scraped datasets, however, are noisy and lack detailed image descriptions. To bridge this gap, we introduce PixelProse, a comprehensive dataset of over 16M (million) synthetically generated captions, leveraging cutting-edge vision-language models for detailed and accurate descriptions. To ensure data integrity, we rigorously analyze our dataset for problematic content, including child sexual abuse material (CSAM), personally identifiable information (PII), and toxicity. We also provide valuable metadata such as watermark presence and aesthetic scores, aiding in further dataset filtering. We hope PixelProse will be a valuable resource for future vision-language research. PixelProse is available at https://huggingface.co/datasets/tomg-group-umd/pixelprose, Comment: pixelprose 16M dataset
Published: 2024

6. Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs

Author: Hans, Abhimanyu, Wen, Yuxin, Jain, Neel, Kirchenbauer, John, Kazemi, Hamid, Singhania, Prajwal, Singh, Siddharth, Somepalli, Gowthami, Geiping, Jonas, Bhatele, Abhinav, and Goldstein, Tom
Subjects: Computer Science - Computation and Language
Abstract: Large language models can memorize and repeat their training data, causing privacy and copyright risks. To mitigate memorization, we introduce a subtle modification to the next-token training objective that we call the goldfish loss. During training, randomly sampled subsets of tokens are excluded from the loss computation. These dropped tokens are not memorized by the model, which prevents verbatim reproduction of a complete chain of tokens from the training set. We run extensive experiments training billion-scale Llama-2 models, both pre-trained and trained from scratch, and demonstrate significant reductions in extractable memorization with little to no impact on downstream benchmarks., Comment: 10 pages, 8 figures, and 1 table in the main body. Code available at https://github.com/ahans30/goldfish-loss and checkpoints at https://huggingface.co/collections/tomg-group-umd/goldfish-loss-mitigating-memorization-in-llms-66c175becb6aab07744f7272
Published: 2024

7. Loki: Low-rank Keys for Efficient Sparse Attention

Author: Singhania, Prajwal, Singh, Siddharth, He, Shwai, Feizi, Soheil, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning
Abstract: Inference on large language models (LLMs) can be expensive in terms of the compute and memory costs involved, especially when long sequence lengths are used. In particular, the self-attention mechanism used in LLM inference contributes significantly to these costs, which has sparked an interest in approximating the self-attention computation to reduce such costs. In this work, we propose to approximate self-attention by focusing on the dimensionality of key vectors computed in the attention block. Our analysis reveals that key vectors lie in a significantly lower-dimensional space, consistently across several datasets and models. Exploiting this observation, we propose Loki, a novel sparse attention method that ranks and selects tokens in the KV-cache based on attention scores computed in low-dimensional space. Our evaluations show that Loki is able to speed up the attention computation due to reduced data movement (load/store) and compute costs while maintaining the efficacy of the models better than other popular approximation methods., Comment: Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems (Main Conference Track)
Published: 2024

8. Transformers Can Do Arithmetic with the Right Embeddings

Author: McLeish, Sean, Bansal, Arpit, Stein, Alex, Jain, Neel, Kirchenbauer, John, Bartoldson, Brian R., Kailkhura, Bhavya, Bhatele, Abhinav, Geiping, Jonas, Schwarzschild, Avi, and Goldstein, Tom
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: The poor performance of transformers on arithmetic tasks seems to stem in large part from their inability to keep track of the exact position of each digit inside of a large span of digits. We mend this problem by adding an embedding to each digit that encodes its position relative to the start of the number. In addition to the boost these embeddings provide on their own, we show that this fix enables architectural modifications such as input injection and recurrent layers to improve performance even further. With positions resolved, we can study the logical extrapolation ability of transformers. Can they solve arithmetic problems that are larger and more complex than those in their training data? We find that training on only 20 digit numbers with a single GPU for one day, we can reach state-of-the-art performance, achieving up to 99% accuracy on 100 digit addition problems. Finally, we show that these gains in numeracy also unlock improvements on other multi-step reasoning tasks including sorting and multiplication.
Published: 2024

9. Performance-Aligned LLMs for Generating Fast Code

Author: Nichols, Daniel, Polasam, Pranav, Menon, Harshitha, Marathe, Aniruddha, Gamblin, Todd, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence, Computer Science - Software Engineering
Abstract: Optimizing scientific software is a difficult task because codebases are often large and complex, and performance can depend upon several factors including the algorithm, its implementation, and hardware among others. Causes of poor performance can originate from disparate sources and be difficult to diagnose. Recent years have seen a multitude of work that use large language models (LLMs) to assist in software development tasks. However, these tools are trained to model the distribution of code as text, and are not specifically designed to understand performance aspects of code. In this work, we introduce a reinforcement learning based methodology to align the outputs of code LLMs with performance. This allows us to build upon the current code modeling capabilities of LLMs and extend them to generate better performing code. We demonstrate that our fine-tuned model improves the expected speedup of generated code over base models for a set of benchmark tasks from 0.9 to 1.6 for serial code and 1.9 to 4.5 for OpenMP code.
Published: 2024

10. Efficacy of intrauterine autologous blood cell derivatives in enhancing endometrial thickness and IVF outcomes for women with recurrent implantation failure: a retrospective cohort study

Author: Tiwari, Shivangi, Poojari, Vidyashree G., Mundkur, Anjali, Adiga, Prashanth, Kumar, Pratap, Bhatele, Prashant, and Palanivel, Vasanthi
Published: 2024
Full Text: View/download PDF

11. Taking GPU Programming Models to Task for Performance Portability

Author: Davis, Joshua H., Sivaraman, Pranav, Kitson, Joy, Parasyris, Konstantinos, Menon, Harshitha, Minn, Isaac, Georgakoudis, Giorgis, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Portability is critical to ensuring high productivity in developing and maintaining scientific software as the diversity in on-node hardware architectures increases. While several programming models provide portability for diverse GPU platforms, they don't make any guarantees about performance portability. In this work, we explore several programming models -- CUDA, HIP, Kokkos, RAJA, OpenMP, OpenACC, and SYCL, to study if the performance of these models is consistently good across NVIDIA and AMD GPUs. We use five proxy applications from different scientific domains, create implementations where missing, and use them to present a comprehensive comparative evaluation of the programming models. We provide a Spack scripting-based methodology to ensure reproducibility of experiments conducted in this work. Finally, we attempt to answer the question -- to what extent does each programming model provide performance portability for heterogeneous systems in real-world usage?, Comment: 12 pages, 4 figures
Published: 2024

12. Automated Programmatic Performance Analysis of Parallel Programs

Author: Cankur, Onur, Tomar, Aditya, Nichols, Daniel, Scully-Allison, Connor, Isaacs, Katherine E., and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Developing efficient parallel applications is critical to advancing scientific development but requires significant performance analysis and optimization. Performance analysis tools help developers manage the increasing complexity and scale of performance data, but often rely on the user to manually explore low-level data and are rigid in how the data can be manipulated. We propose a Python-based API, Chopper, which provides high-level and flexible performance analysis for both single and multiple executions of parallel applications. Chopper facilitates performance analysis and reduces developer effort by providing configurable high-level methods for common performance analysis tasks such as calculating load imbalance, hot paths, scalability bottlenecks, correlation between metrics and CCT nodes, and causes of performance variability within a robust and mature Python environment that provides fluid access to lower-level data manipulations. We demonstrate how Chopper allows developers to quickly and succinctly explore performance and identify issues across applications such as AMG, Laghos, LULESH, Quicksilver and Tortuga.
Published: 2024

13. Can Large Language Models Write Parallel Code?

Author: Nichols, Daniel, Davis, Joshua H., Xie, Zhaojun, Rajaram, Arjun, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: Large language models are increasingly becoming a popular tool for software development. Their ability to model and generate source code has been demonstrated in a variety of contexts, including code completion, summarization, translation, and lookup. However, they often struggle to generate code for complex programs. In this paper, we study the capabilities of state-of-the-art language models to generate parallel code. In order to evaluate language models, we create a benchmark, ParEval, consisting of prompts that represent 420 different coding tasks related to scientific and parallel computing. We use ParEval to evaluate the effectiveness of several state-of-the-art open- and closed-source language models on these tasks. We introduce novel metrics for evaluating the performance of generated code, and use them to explore how well each large language model performs for 12 different computational problem types and six different parallel programming models.
Published: 2024
Full Text: View/download PDF

14. Pandemics In Silico: Scaling an Agent-Based Simulation on Realistic Social Contact Networks

Author: Kitson, Joy, Costello, Ian, Chen, Jiangzhuo, Jiménez, Diego, Hoops, Stefan, Mortveit, Henning, Meneses, Esteban, Yeom, Jae-Seung, Marathe, Madhav V., and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Preventing the spread of infectious diseases requires implementing interventions at various levels of government and evaluating the potential impact and efficacy of those preemptive measures. Agent-based modeling can be used for detailed studies of epidemic diffusion and possible interventions. Modeling of epidemic diffusion in large social contact networks requires the use of parallel algorithms and resources. In this work, we present Loimos, a scalable parallel framework for simulating epidemic diffusion. Loimos uses a hybrid of time-stepping and discrete-event simulation to model disease spread, and is implemented on top of an asynchronous, many-task runtime. We demonstrate that Loimos is to able to achieve significant speedups while scaling to large core counts. In particular, Loimos is able to simulate 200 days of a COVID-19 outbreak on a digital twin of California in about 42 seconds, for an average of 4.6 billion traversed edges per second (TEPS), using 4096 cores on Perlmutter at NERSC., Comment: 12 pages (including references), 9 figures
Published: 2024

15. ML-based Modeling to Predict I/O Performance on Different Storage Sub-systems

Author: Xu, Yiheng, Sivaraman, Pranav, Devarajan, Hariharan, Mohror, Kathryn, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Parallel applications can spend a significant amount of time performing I/O on large-scale supercomputers. Fast near-compute storage accelerators called burst buffers can reduce the time a processor spends performing I/O and mitigate I/O bottlenecks. However, determining if a given application could be accelerated using burst buffers is not straightforward even for storage experts. The relationship between an application's I/O characteristics (such as I/O volume, processes involved, etc.) and the best storage sub-system for it can be complicated. As a result, adapting parallel applications to use burst buffers efficiently is a trial-and-error process. In this work, we present a Python-based tool called PrismIO that enables programmatic analysis of I/O traces. Using PrismIO, we identify bottlenecks on burst buffers and parallel file systems and explain why certain I/O patterns perform poorly. Further, we use machine learning to model the relationship between I/O characteristics and burst buffer selections. We run IOR (an I/O benchmark) with various I/O characteristics on different storage systems and collect performance data. We use the data as the input for training the model. Our model can predict if a file of an application should be placed on BBs for unseen IOR scenarios with an accuracy of 94.47% and for four real applications with an accuracy of 95.86%.
Published: 2023

16. COVID-19 Detection: A Systematic Review of Machine and Deep Learning-Based Approaches Utilizing Chest X-Rays and CT Scans

Author: Bhatele, Kirti Raj, Jha, Anand, Tiwari, Devanshu, Bhatele, Mukta, Sharma, Sneha, Mithora, Muktasha R., and Singhal, Stuti
Published: 2024
Full Text: View/download PDF

17. Action dystonia in Chorea-Acanthocytosis

Author: Bhatele, Prashant and Pai, Aparna R
Published: 2024
Full Text: View/download PDF

18. Jorge: Approximate Preconditioning for GPU-efficient Second-order Optimization

Author: Singh, Siddharth, Sating, Zachary, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: Despite their better convergence properties compared to first-order optimizers, second-order optimizers for deep learning have been less popular due to their significant computational costs. The primary efficiency bottleneck in such optimizers is matrix inverse calculations in the preconditioning step, which are expensive to compute on GPUs. In this paper, we introduce Jorge, a second-order optimizer that promises the best of both worlds -- rapid convergence benefits of second-order methods, and high computational efficiency typical of first-order methods. We address the primary computational bottleneck of computing matrix inverses by completely eliminating them using an approximation of the preconditioner computation. This makes Jorge extremely efficient on GPUs in terms of wall-clock time. Further, we describe an approach to determine Jorge's hyperparameters directly from a well-tuned SGD baseline, thereby significantly minimizing tuning efforts. Our empirical evaluations demonstrate the distinct advantages of using Jorge, outperforming state-of-the-art optimizers such as SGD, AdamW, and Shampoo across multiple deep learning models, both in terms of sample efficiency and wall-clock time.
Published: 2023

19. HPC-Coder: Modeling Parallel Programs using Large Language Models

Author: Nichols, Daniel, Marathe, Aniruddha, Menon, Harshitha, Gamblin, Todd, and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Artificial Intelligence
Abstract: Parallel programs in high performance computing (HPC) continue to grow in complexity and scale in the exascale era. The diversity in hardware and parallel programming models make developing, optimizing, and maintaining parallel software even more burdensome for developers. One way to alleviate some of these burdens is with automated development and analysis tools. Such tools can perform complex and/or remedial tasks for developers that increase their productivity and decrease the chance for error. Until recently, such tools for code development and performance analysis have been limited in the complexity of tasks they can perform, especially for parallel programs. However, with recent advancements in language modeling, and the availability of large amounts of open-source code related data, these tools have started to utilize predictive language models to automate more complex tasks. In this paper, we show how large language models (LLMs) can be applied to tasks specific to high performance and scientific codes. We introduce a new dataset of HPC and scientific codes and use it to fine-tune several pre-trained models. We compare several pre-trained LLMs on HPC-related tasks and introduce a new model, HPC-Coder, fine-tuned on parallel codes. In our experiments, we show that this model can auto-complete HPC functions where generic models cannot, decorate for loops with OpenMP pragmas, and model performance changes in scientific application repositories as well as programming competition solutions.
Published: 2023
Full Text: View/download PDF

20. Pipit: Scripting the analysis of parallel execution traces

Author: Bhatele, Abhinav, Dhakal, Rakrish, Movsesyan, Alexander, Ranjan, Aditya K., and Cankur, Onur
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Performance analysis is a critical step in the oft-repeated, iterative process of performance tuning of parallel programs. Per-process, per-thread traces (detailed logs of events with timestamps) enable in-depth analysis of parallel program execution to identify different kinds of performance issues. Often times, trace collection tools provide a graphical tool to analyze the trace output. However, these GUI-based tools only support specific file formats, are challenging to scale to large trace sizes, limit data exploration to the implemented graphical views, and do not support automated comparisons of two or more datasets. In this paper, we present a programmatic approach to analyzing parallel execution traces by leveraging pandas, a powerful Python-based data analysis library. We have developed a Python library, Pipit, on top of pandas that can read traces in different file formats (OTF2, HPCToolkit, Projections, Nsight Systems, etc.) and provides a uniform data structure in the form of a pandas DataFrame. Pipit provides operations to aggregate, filter, and transform the events in a trace to present the data in different ways. We also provide several functions to quickly and easily identify performance issues in parallel executions. More importantly, the API is easily extensible to support custom analyses by different end users.
Published: 2023

21. A 4D Hybrid Algorithm to Scale Parallel Training to Thousands of GPUs

Author: Singh, Siddharth, Singhania, Prajwal, Ranjan, Aditya K., Sating, Zack, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Heavy communication, in particular, collective operations, can become a critical performance bottleneck in scaling the training of billion-parameter neural networks to large-scale parallel systems. This paper introduces a four-dimensional (4D) approach to optimize communication in parallel training. This 4D approach is a hybrid of 3D tensor and data parallelism, and is implemented in the AxoNN framework. In addition, we employ two key strategies to further minimize communication overheads. First, we aggressively overlap expensive collective operations (reduce-scatter, all-gather, and all-reduce) with computation. Second, we develop an analytical model to identify high-performing configurations within the large search space defined by our 4D algorithm. This model empowers practitioners by simplifying the tuning process for their specific training workloads. When training an 80-billion parameter GPT on 1024 GPUs of Perlmutter, AxoNN surpasses Megatron-LM, a state-of-the-art framework, by a significant 26%. Additionally, it achieves a significantly high 57% of the theoretical peak FLOP/s or 182 PFLOP/s in total.
Published: 2023

22. A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Singh, Siddharth, Ruwase, Olatunji, Awan, Ammar Ahmad, Rajbhandari, Samyam, He, Yuxiong, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, three-dimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4 to 8x larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.
Published: 2023
Full Text: View/download PDF

23. Exploiting Sparsity in Pruned Neural Networks to Optimize Large Model Training

Author: Singh, Siddharth and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Parallel training of neural networks at scale is challenging due to significant overheads arising from communication. Recently, deep learning researchers have developed a variety of pruning algorithms that are capable of pruning (i.e. setting to zero) 80-90% of the parameters in a neural network to yield sparse subnetworks that equal the accuracy of the unpruned parent network. In this work, we propose a novel approach that exploits these sparse subnetworks to optimize the memory utilization and communication in two popular algorithms for parallel deep learning namely -- data and inter-layer parallelism. We integrate our approach into AxoNN, a highly scalable framework for parallel deep learning that relies on data and inter-layer parallelism, and demonstrate the reduction in communication time and memory utilization. On 512 NVIDIA V100 GPUs, our optimizations reduce the memory consumption of a 2.7 billion parameter model by 74%, and the total communication time by 40%, thus providing an overall speedup of 34% over AxoNN, 32% over DeepSpeed-3D and 46% over Sputnik, a sparse matrix computation baseline.
Published: 2023

24. Design Concerns for Integrated Scripting and Interactive Visualization in Notebook Environments

Author: Scully-Allison, Connor, Lumsden, Ian, Williams, Katy, Bartels, Jesse, Taufer, Michela, Brink, Stephanie, Bhatele, Abhinav, Pearce, Olga, and Isaacs, Katherine E.
Subjects: Computer Science - Human-Computer Interaction
Abstract: Interactive visualization can support fluid exploration but is often limited to predetermined tasks. Scripting can support a vast range of queries but may be more cumbersome for free-form exploration. Embedding interactive visualization in scripting environments, such as computational notebooks, provides an opportunity to leverage the strengths of both direct manipulation and scripting. We investigate interactive visualization design methodology, choices, and strategies under this paradigm through a design study of calling context trees used in performance analysis, a field which exemplifies typical exploratory data analysis workflows with Big Data and hard to define problems. We first produce a formal task analysis assigning tasks to graphical or scripting contexts based on their specificity, frequency, and suitability. We then design a notebook-embedded interactive visualization and validate it with intended users. In a follow-up study, we present participants with multiple graphical and scripting interaction modes to elicit feedback about notebook-embedded visualization design, finding consensus in support of the interaction model. We report and reflect on observations regarding the process and design implications for combining visualization and scripting in notebooks., Comment: Submitted to IEEE VIS 2022
Published: 2022
Full Text: View/download PDF

25. A Survey and Empirical Evaluation of Parallel Deep Learning Frameworks

Author: Nichols, Daniel, Singh, Siddharth, Lin, Shu-Huai, and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Distributed, Parallel, and Cluster Computing
Abstract: The field of deep learning has witnessed a remarkable shift towards extremely compute- and memory-intensive neural networks. These newer larger models have enabled researchers to advance state-of-the-art tools across a variety of fields. This phenomenon has spurred the development of algorithms for distributed training of neural networks over a larger number of hardware accelerators. In this paper, we discuss and compare current state-of-the-art frameworks for large scale distributed deep learning. First, we survey current practices in distributed learning and identify the different types of parallelism used. Then, we present empirical results comparing their performance on large image and language training tasks. Additionally, we address their statistical efficiency and memory consumption behavior. Based on our results, we discuss algorithmic and implementation portions of each framework which hinder performance.
Published: 2021

26. AxoNN: An asynchronous, message-driven parallel framework for extreme-scale deep learning

Author: Singh, Siddharth and Bhatele, Abhinav
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: In the last few years, the memory requirements to train state-of-the-art neural networks have far exceeded the DRAM capacities of modern hardware accelerators. This has necessitated the development of efficient algorithms to train these neural networks in parallel on large-scale GPU-based clusters. Since computation is relatively inexpensive on modern GPUs, designing and implementing extremely efficient communication in these parallel training algorithms is critical for extracting the maximum performance. This paper presents AxoNN, a parallel deep learning framework that exploits asynchrony and message-driven execution to schedule neural network operations on each GPU, thereby reducing GPU idle time and maximizing hardware efficiency. By using the CPU memory as a scratch space for offloading data periodically during training, AxoNN is able to reduce GPU memory consumption by four times. This allows us to increase the number of parameters per GPU by four times, thus reducing the amount of communication and increasing performance by over 13%. When tested against large transformer models with 12-100 billion parameters on 48-384 NVIDIA Tesla V100 GPUs, AxoNN achieves a per-GPU throughput of 49.4-54.78% of theoretical peak and reduces the training time by 22-37 days (15-25% speedup) as compared to the state-of-the-art., Comment: Proceedings of the IEEE International Parallel & Distributed Processing Symposium (IPDPS). IEEE Computer Society, May 2022
Published: 2021

27. Analytics of Longitudinal System Monitoring Data for Performance Prediction

Author: Costello, Ian J. and Bhatele, Abhinav
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Machine Learning, Computer Science - Performance
Abstract: In recent years, several HPC facilities have started continuous monitoring of their systems and jobs to collect performance-related data for understanding performance and operational efficiency. Such data can be used to optimize the performance of individual jobs and the overall system by creating data-driven models that can predict the performance of jobs waiting in the scheduler queue. In this paper, we model the performance of representative control jobs using longitudinal system-wide monitoring data and machine learning to explore the causes of performance variability. We analyze these prediction models in great detail to identify the features that are dominant predictors of performance. We demonstrate that such models can be application-agnostic and can be used for predicting performance of applications that are not included in training.
Published: 2020

28. Scalable Comparative Visualization of Ensembles of Call Graphs

Author: Kesavan, Suraj P., Bhatia, Harsh, Bhatele, Abhinav, Gamblin, Todd, Bremer, Peer-Timo, and Ma, Kwan-Liu
Subjects: Computer Science - Distributed, Parallel, and Cluster Computing, Computer Science - Performance
Abstract: Optimizing the performance of large-scale parallel codes is critical for efficient utilization of computing resources. Code developers often explore various execution parameters, such as hardware configurations, system software choices, and application parameters, and are interested in detecting and understanding bottlenecks in different executions. They often collect hierarchical performance profiles represented as call graphs, which combine performance metrics with their execution contexts. The crucial task of exploring multiple call graphs together is tedious and challenging because of the many structural differences in the execution contexts and significant variability in the collected performance metrics (e.g., execution runtime). In this paper, we present an enhanced version of CallFlow to support the exploration of ensembles of call graphs using new types of visualizations, analysis, graph operations, and features. We introduce ensemble-Sankey, a new visual design that combines the strengths of resource-flow (Sankey) and box-plot visualization techniques. Whereas the resource-flow visualization can easily and intuitively describe the graphical nature of the call graph, the box plots overlaid on the nodes of Sankey convey the performance variability within the ensemble. Our interactive visual interface provides linked views to help explore ensembles of call graphs, e.g., by facilitating the analysis of structural differences, and identifying similar or distinct call graphs. We demonstrate the effectiveness and usefulness of our design through case studies on large-scale parallel codes., Comment: 12 pages, 6 figures, Submitted to IEEE VIS 2020
Published: 2020

29. JMCD Dataset for Brain Tumor Detection and Analysis Using Explainable Deep Learning

Author: Verma, Aryan, Gupta, Nidhi, Bhatele, Pushpraj, and Khanna, Pritee
Published: 2023
Full Text: View/download PDF

30. Multiclass classification of central nervous system brain tumor types based on proposed hybrid texture feature extraction methods and ensemble learning

Author: Bhatele, Kirti Raj and Bhadauria, Sarita Singh
Published: 2023
Full Text: View/download PDF

31. Neurodegenerative diseases-Caps: a capsule network based early screening system for the classification of neurodegenerative diseases

Author: Bhatele, Kirti Raj, Jha, Anand, Kapoor, Kavish, and Tiwari, Devanshu
Published: 2022
Full Text: View/download PDF

32. Primary CNS vasculitis presenting with spinal cord involvement: a rare presentation

Author: Bhatele, Prashant, Vibha, Deepti, Singh, Rajesh Kumar, Elavarasi, A., and Tripathi, Manjari
Published: 2022
Full Text: View/download PDF

33. Machine learning application in Glioma classification: review and comparison analysis

Author: Bhatele, Kirti Raj and Bhadauria, Sarita Singh
Published: 2022
Full Text: View/download PDF

34. Three territory sign in COVID-19

Author: Bhatele, Prashant, Das, Animesh, Pandit, Awadh Kishor, Radhakrishnan, Divya M., and Srivastava, Achal Kumar
Published: 2022
Full Text: View/download PDF

35. Brain structural disorders detection and classification approaches: a review

Author: Bhatele, Kirti Raj and Bhadauria, Sarita Singh
Published: 2020
Full Text: View/download PDF

36. Design Concerns for Integrated Scripting and Interactive Visualization in Notebook Environments

Author: Scully-Allison, Connor, Lumsden, Ian, Williams, Katy, Bartels, Jesse, Taufer, Michela, Brink, Stephanie, Bhatele, Abhinav, Pearce, Olga, and Isaacs, Katherine E.
Abstract: Interactive visualization can support fluid exploration but is often limited to predetermined tasks. Scripting can support a vast range of queries but may be more cumbersome for free-form exploration. Embedding interactive visualization in scripting environments, such as computational notebooks, provides an opportunity to leverage the strengths of both direct manipulation and scripting. We investigate interactive visualization design methodology, choices, and strategies under this paradigm through a design study of calling context trees used in performance analysis, a field which exemplifies typical exploratory data analysis workflows with Big Data and hard to define problems. We first produce a formal task analysis assigning tasks to graphical or scripting contexts based on their specificity, frequency, and suitability. We then design a notebook-embedded interactive visualization and validate it with intended users. In a follow-up study, we present participants with multiple graphical and scripting interaction modes to elicit feedback about notebook-embedded visualization design, finding consensus in support of the interaction model. We report and reflect on observations regarding the process and design implications for combining visualization and scripting in notebooks.
Published: 2024
Full Text: View/download PDF

37. Guar

Author: Kuravadi, N.A., primary, Verma, S., additional, Pareek, S., additional, Gahlot, P., additional, Kumari, S., additional, Tanwar, U.K., additional, Bhatele, P., additional, Choudhary, M., additional, Gill, K.S., additional, Pruthi, V., additional, Tripathi, S.K., additional, Dhugga, K.S., additional, and Randhawa, G.S., additional
Published: 2013
Full Text: View/download PDF

38. Titanium

Author: Robert, Yves, Shende, Sameer, Malony, Allen D, Morris, Alan, Spear, Wyatt, Biersdorff, Scott, Smith, Burton, Wang, Dali, Ricciuto, Daniel, Post, Wilfred, Berry, Michael W, Irigoin, François, Yelick, Katherine, Graham, Susan L, Hilfinger, Paul, Bonachea, Dan, Su, Jimmy, Kamil, Amir, Datta, Kaushik, Colella, Phillip, Wen, Tong, Dongarra, Jack, Luszczek, Piotr, Bhatele, Abhinav, Freudenberger, Stefan M, Diekert, Volker, Muscholl, Anca, Herlihy, Maurice, and Moss, J Eliot B
Abstract: Titanium is a parallel programming language designed for high-performance scientific computing. It is based on Java and uses a Single Program Multiple Data (SPMD) parallelism model with a Partitioned Global Address Space (PGAS).
Published: 2011

39. Scalable Comparative Visualization of Ensembles of Call Graphs

Author: Kesavan, Suraj P., Bhatia, Harsh, Bhatele, Abhinav, Brink, Stephanie, Pearce, Olga, Gamblin, Todd, Bremer, Peer-Timo, and Ma, Kwan-Liu
Abstract: Optimizing the performance of large-scale parallel codes is critical for efficient utilization of computing resources. Code developers often explore various execution parameters, such as hardware configurations, system software choices, and application parameters, and are interested in detecting and understanding bottlenecks in different executions. They often collect hierarchical performance profiles represented as call graphs, which combine performance metrics with their execution contexts. The crucial task of exploring multiple call graphs together is tedious and challenging because of the many structural differences in the execution contexts and significant variability in the collected performance metrics (e.g., execution runtime). In this paper, we present Ensemble CallFlow to support the exploration of ensembles of call graphs using new types of visualizations, analysis, graph operations, and features. We introduce ensemble-Sankey, a new visual design that combines the strengths of resource-flow (Sankey) and box-plot visualization techniques. Whereas the resource-flow visualization can easily and intuitively describe the graphical nature of the call graph, the box plots overlaid on the nodes of Sankey convey the performance variability within the ensemble. Our interactive visual interface provides linked views to help explore ensembles of call graphs, e.g., by facilitating the analysis of structural differences, and identifying similar or distinct call graphs. We demonstrate the effectiveness and usefulness of our design through case studies on large-scale parallel codes.
Published: 2023
Full Text: View/download PDF

40. Reducing Perinatal Mortality in India: Two-Years Results of the IRIA Fetal Radiology Samrakshan Program

Author: Choorakuttil, Rijo M., Rajalingam, Bavaharan, Satarkar, Shilpa R., Sharma, Lalit K., Gupta, Anjali, Baghel, Akanksha, Jain, Neelam, Palanisamy, Devarajan, Shenoy, Ramesh, Senthilvel, Karthik, Dhankar, Sandhya, Aneja, Kavita, Dwivedi, Somya, Nagar, Shweta, Soni, Sonali Kimmatkar, Chhajer, Gulab, Pradeep, Sunitha, Onkar, Prashant M., Skandhan, Avni K.P., Rajput, Eesha, Sharma, Renu, Shentar, Srinivas, Saboo, Suresh, Antony, Amel, Nair, M.R. Balachandran, Patekar, Tejashree Y., Ahuja, Bhupendra, Patel, Hemant, Kunnumal, Mohanan, Sodani, Rajendra K., Rao, M.V. Kameswar, Bhatele, Pushparaj, Kavthale, Sandeep, Patkar, Deepak, Singh, Rajeev, Chelladurai, Amarnath, and Nirmalan, Praveen K.
Published: 2022
Full Text: View/download PDF

41. Chapter 3 - Guar: An Industrial Crop from Marginal Farms

Author: Kuravadi, N.A., Verma, S., Pareek, S., Gahlot, P., Kumari, S., Tanwar, U.K., Bhatele, P., Choudhary, M., Gill, K.S., Pruthi, V., Tripathi, S.K., Dhugga, K.S., and Randhawa, G.S.
Published: 2013
Full Text: View/download PDF

42. Glioma detection on brain MRIs using texture and morphological features with ensemble learning.

Author: Gupta, Nidhi, Bhatele, Pushpraj, and Khanna, Pritee
Subjects: GLIOMAS, MAGNETIC resonance imaging, TREATMENT of brain cancer, INFRATENTORIAL brain tumors, GLIOBLASTOMA multiforme treatment
Abstract: Highlights • A CAD system with three level classifications is developed to identify Gliomas. • The system is planned to work irrespective of the image pulse sequences. • The system uses two segmentation schemes, features fusion, and ensemble classifier. • Experiments are performed on two clinical brain MRI datasets to identify Gliomas. • Performance of the system is statistically validated and verified by domain experts. Abstract The real time usage of Computer Aided Diagnosis (CAD) systems to detect brain tumors as proposed in the literature is yet to be explored. Gliomas are the most commonly found brain tumors in human. The proposed non-invasive CAD system based on brain Magnetic Resonance Imaging (MRIs) is capable of assisting radiologists and clinicians to detect not only the presence, but also the type of glioma tumors. The system is devised to work irrespective of the image pulse sequence. It uses different segmentation schemes for different pulse sequences, fusion of texture features, and ensemble classifier to perform three levels of classification. Once the tumor is detected at the first level of classification, its location is analyzed using tentorium of brain and it is classified into superatentorial or infratentorial in the next level. Based on the morphological and inherent characteristics of tumor (area, perimeter, solidity, and orientation), the system identifies tumor type at the third level of classification. The system reports average accuracy of 97.76% on JMCD (a dataset collected from local medical college) and 97.13% on BRATS datasets at the first level of classification. Average accuracy of 97.87% for astrocytomas, 94.24% for ependymoma, 96.29% for oligodendroglioma, and 98.69% for glioblastoma multiforme is observed for histologically classified JMCD dataset. The same is observed as 95.45% for low grade and 95.50% for high grade tumors in publically available BRATS dataset. The performance of the proposed CAD system is statistically examined through hypothetical Student's t -test and Wilcoxon matched pair test. The performance of the system is also validated by domain experts for its possible real time usage. [ABSTRACT FROM AUTHOR]
Published: 2019
Full Text: View/download PDF

43. Incomplete reprogramming of DNA replication timing in induced pluripotent stem cells

Author: Edwards, Matthew M., Wang, Ning, Massey, Dashiell J., Bhatele, Sakshi, Egli, Dieter, and Koren, Amnon
Abstract: Induced pluripotent stem cells (iPSCs) are the foundation of cell therapy. Differences in gene expression, DNA methylation, and chromatin conformation, which could affect differentiation capacity, have been identified between iPSCs and embryonic stem cells (ESCs). Less is known about whether DNA replication timing, a process linked to both genome regulation and genome stability, is efficiently reprogrammed to the embryonic state. To answer this, we compare genome-wide replication timing between ESCs, iPSCs, and cells reprogrammed by somatic cell nuclear transfer (NT-ESCs). While NT-ESCs replicate their DNA in a manner indistinguishable from ESCs, a subset of iPSCs exhibits delayed replication at heterochromatic regions containing genes downregulated in iPSCs with incompletely reprogrammed DNA methylation. DNA replication delays are not the result of gene expression or DNA methylation aberrations and persist after cells differentiate to neuronal precursors. Thus, DNA replication timing can be resistant to reprogramming and influence the quality of iPSCs.
Published: 2024
Full Text: View/download PDF

44. Identification of Gliomas from brain MRI through adaptive segmentation and run length of centralized patterns.

Author: Gupta, Nidhi, Bhatele, Pushpraj, and Khanna, Pritee
Subjects: GLIOMAS, MAGNETIC resonance imaging of the brain, TUMOR classification, BRAIN tumors, RADIOLOGISTS, DIAGNOSIS
Abstract: Brain tumor detection and identification of its severity is a challenging task for radiologists and clinicians. This work aims to develop a novel clinical decision support system to assist radiologists and clinicians efficiently in real-time. The proposed clinical decision support system utilizes fusion of MRI pulse sequences as each of them gives salient information for tumor identification. An adaptive thresholding is proposed for segmentation and centralized patterns are observed from LBP image of so obtained segmented image. Run length matrix extracted from these centralized patterns is used for tumor identification. The developed features successfully identify and classify tumor with Naive Bayes classifier. The proposed decision support system not only detects tumors, but also identifies its grading in terms of severity. As Glioma tumors are the most frequent among brain tumors, the proposed system is tested for the presence of low grade (Astrocytoma and Ependymoma) as well as high grade (Oligodendroglioma and Glioblastoma Multiforme) Glioma tumors on images collected from NSCB Medical College Jabalpur, India and BRATS dataset. The experiments performed on two datasets give more than 96% accuracy. The proposed decision support system is quite sensitive towards the detection and specification of tumors. All the results are verified by domain experts in real time. [ABSTRACT FROM AUTHOR]
Published: 2018
Full Text: View/download PDF

45. Endoscopic Vascular Decompression of the Trigeminal Nerve

Author: Yadav, Y., additional, Parihar, V., additional, Agarwal, M., additional, Sherekar, S., additional, and Bhatele, P., additional
Published: 2011
Full Text: View/download PDF

46. Status epilepticus heralding thymomatous paraneoplastic multifocal cortical encephalitis in Myasthenia Gravis.

Author: George, Jerry, Salunkhe, Manish, Bhatele, Prashant, Agarwal, Ayush, Patel, Nishant, Kumar, Rakesh, and Srivastava, Achal Kumar
Published: 2021
Full Text: View/download PDF

47. Endoscopic inter laminar management of lumbar disease

Author: Yadav, Yad, Parihar, Vijay, Kher, Yatin, and Bhatele, Pushp
Published: 2016
Full Text: View/download PDF

48. Endoscopic endonasal trans-sphenoid surgery of pituitary adenoma.

Author: Yadav, Y. R., Sachdev, S., Parihar, V., Namdev, H., and Bhatele, P. R.
Subjects: ENDOSCOPIC surgery, PITUITARY diseases, ONCOLOGY, MEDICAL imaging systems, BLOOD loss estimation
Abstract: Endoscopic endonasal trans-sphenoid surgery (EETS) is increasingly used for pituitary lesions. Pre-operative CT and MRI scans and peroperative endoscopic visualization can provide useful anatomical information. EETS is indicated in sellar, suprasellar, intraventricular, retro-infundibular, and invasive tumors. Recurrent and residual lesions, pituitary apoplexy and empty sella syndrome can be managed by EETS. Modern neuronavigation techniques, ultrasonic aspirators, ultrasonic bone curette can add to the safety. The binostril approach provides a wider working area. High definition camera is much superior to three-chip camera. Most of the recent reports favor EETS in terms of safety, quality of life and tumor resection, hospital stay, better endocrinological, and visual outcome as compared to the microscopic technique. Nasal symptoms, blood loss, operating time are less in EETS. Various naso-septal flaps and other techniques of CSF leak repair could help reduce complications. Complications can be further reduced after achieving the learning curve, good understanding of limitations with proper patient selection. Use of neuronavigation, proper post-operative care of endocrine function, establishing pituitary center of excellence and more focused residency and endoscopic fellowship training could improve results. The faster and safe transition from microscopic to EETS can be done by the team concept of neurosurgeon/otolaryngologist, attending hands on cadaveric dissection, practice on models, and observation of live surgeries. Conversion to a microscopic or endoscopic-assisted approach may be required in selected patients. Multi-modality treatment could be required in giant and invasive tumors. EETS appears to be a better surgical option in most pituitary adenoma. [ABSTRACT FROM AUTHOR]
Published: 2012
Full Text: View/download PDF

49. Studies on Pathology of Chronic Inflammation Induced by Escherichia coli in Chicken Skin.

Author: Bhatele, Aarti
Subjects: SKIN inflammation, CHICKEN diseases, ESCHERICHIA coli, MACROPHAGES, FIBROBLASTS
Abstract: The study was undertaken to investigate the nature of chronic inflammatory response induced by Escherichia coli in white Leghorn chickens. Inflammation was induced by intradermal injections of known concentration of E. coli at various time intervals and the pattern of emigration of leukocytes was recorded. In the early stage, emigration of heterophils, basophils and monocytoid cells were observed. Degranulating mast cells and macrophages were noticed on day 3 and their number increased subsequently. The inflammatory reactions were followed by reparative changes which were evidenced by the presence of mature fibroblasts and marked collegen content from day 12 onwards. [ABSTRACT FROM AUTHOR]
Published: 2017

50. A Study on Occurence of Hypodermosis(Warble Fly Infestation) in Cattle of Central India.

Author: Bhatele, Aarti
Subjects: CATTLE diseases, WARBLE flies, INSECT larvae, DISEASE prevalence
Abstract: The study pertains to a rare occurrence of hypodermosis in rural cattle herd in central India caused by Hypoderma lineatum larvae. Out of a mixed herd 0f 750 cattle, 421(56.1%) were found affected. Prevalence was higher in males (58.8%) than females (39.6%) and the severity of infection was more in younger than the adult animals. A characteristic feature of the infestation was very high density of nodules (8 to 48) and wide area of their spread on the body with preferred location on both the dorsolateral side of the spine. The average size of the larvae was recorded as 10mm x 21 mm. The need to have a through and systematic work on the disease in cattle in different parts of tropical India has been emphasized. [ABSTRACT FROM AUTHOR]
Published: 2016

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

78 results on '"Bhatele P"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources