Author: "Sham P." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Sham P."' showing total 3,840 results

Start Over Author "Sham P."

3,840 results on '"Sham P."'

1. Train for the Worst, Plan for the Best: Understanding Token Ordering in Masked Diffusions

Author: Kim, Jaeyeon, Shah, Kulin, Kontonis, Vasilis, Kakade, Sham, and Chen, Sitan
Subjects: Computer Science - Machine Learning
Abstract: In recent years, masked diffusion models (MDMs) have emerged as a promising alternative approach for generative modeling over discrete domains. Compared to autoregressive models (ARMs), MDMs trade off complexity at training time with flexibility at inference time. At training time, they must learn to solve an exponentially large number of infilling problems, but at inference time, they can decode tokens in essentially arbitrary order. In this work, we closely examine these two competing effects. On the training front, we theoretically and empirically demonstrate that MDMs indeed train on computationally intractable subproblems compared to their autoregressive counterparts. On the inference front, we show that a suitable strategy for adaptively choosing the token decoding order significantly enhances the capabilities of MDMs, allowing them to sidestep hard subproblems. On logic puzzles like Sudoku, we show that adaptive inference can boost solving accuracy in pretrained MDMs from $<7$% to $\approx 90$%, even outperforming ARMs with $7\times$ as many parameters and that were explicitly trained via teacher forcing to learn the right order of decoding.
Published: 2025

2. Connections between Schedule-Free Optimizers, AdEMAMix, and Accelerated SGD Variants

Author: Morwani, Depen, Vyas, Nikhil, Zhang, Hanlin, and Kakade, Sham
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Recent advancements in deep learning optimization have introduced new algorithms, such as Schedule-Free optimizers, AdEMAMix, MARS and Lion which modify traditional momentum mechanisms. In a separate line of work, theoretical acceleration of stochastic gradient descent (SGD) in noise-dominated regime has been achieved by decoupling the momentum coefficient from the current gradient's weight. In this paper, we establish explicit connections between these two lines of work. We substantiate our theoretical findings with preliminary experiments on a 150m language modeling task. We find that AdEMAMix, which most closely resembles accelerated versions of stochastic gradient descent, exhibits superior performance. Building on these insights, we introduce a modification to AdEMAMix, termed Simplified-AdEMAMix, which maintains the same performance as AdEMAMix across both large and small batch-size settings while eliminating the need for two different momentum terms. The code for Simplified-AdEMAMix is available on the repository: https://github.com/DepenM/Simplified-AdEMAMix/.
Published: 2025

3. Soup to go: mitigating forgetting during continual learning with model averaging

Author: Kleiman, Anat, Dziugaite, Gintare Karolina, Frankle, Jonathan, Kakade, Sham, and Paul, Mansheej
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: In continual learning, where task data arrives in a sequence, fine-tuning on later tasks will often lead to performance degradation on earlier tasks. This is especially pronounced when these tasks come from diverse domains. In this setting, how can we mitigate catastrophic forgetting of earlier tasks and retain what the model has learned with minimal computational expenses? Inspired by other merging methods, and L2-regression, we propose Sequential Fine-tuning with Averaging (SFA), a method that merges currently training models with earlier checkpoints during the course of training. SOTA approaches typically maintain a data buffer of past tasks or impose a penalty at each gradient step. In contrast, our method achieves comparable results without the need to store past data, or multiple copies of parameters for each gradient step. Furthermore, our method outperforms common merging techniques such as Task Arithmetic, TIES Merging, and WiSE-FT, as well as other penalty methods like L2 and Elastic Weight Consolidation. In turn, our method offers insight into the benefits of merging partially-trained models during training across both image and language domains.
Published: 2025

4. From an Image to a Scene: Learning to Imagine the World from a Million 360 Videos

Author: Wallingford, Matthew, Bhattad, Anand, Kusupati, Aditya, Ramanujan, Vivek, Deitke, Matt, Kakade, Sham, Kembhavi, Aniruddha, Mottaghi, Roozbeh, Ma, Wei-Chiu, and Farhadi, Ali
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning
Abstract: Three-dimensional (3D) understanding of objects and scenes play a key role in humans' ability to interact with the world and has been an active area of research in computer vision, graphics, and robotics. Large scale synthetic and object-centric 3D datasets have shown to be effective in training models that have 3D understanding of objects. However, applying a similar approach to real-world objects and scenes is difficult due to a lack of large-scale data. Videos are a potential source for real-world 3D data, but finding diverse yet corresponding views of the same content has shown to be difficult at scale. Furthermore, standard videos come with fixed viewpoints, determined at the time of capture. This restricts the ability to access scenes from a variety of more diverse and potentially useful perspectives. We argue that large scale 360 videos can address these limitations to provide: scalable corresponding frames from diverse views. In this paper, we introduce 360-1M, a 360 video dataset, and a process for efficiently finding corresponding frames from diverse viewpoints at scale. We train our diffusion-based model, Odin, on 360-1M. Empowered by the largest real-world, multi-view dataset to date, Odin is able to freely generate novel views of real-world scenes. Unlike previous methods, Odin can move the camera through the environment, enabling the model to infer the geometry and layout of the scene. Additionally, we show improved performance on standard novel view synthesis and 3D reconstruction benchmarks., Comment: NeurIPS 2024. For project page, see https://mattwallingford.github.io/ODIN
Published: 2024

5. Mind the Gap: Examining the Self-Improvement Capabilities of Large Language Models

Author: Song, Yuda, Zhang, Hanlin, Eisenach, Carson, Kakade, Sham, Foster, Dean, and Ghai, Udaya
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Self-improvement is a mechanism in Large Language Model (LLM) pre-training, post-training and test-time inference. We explore a framework where the model verifies its own outputs, filters or reweights data based on this verification, and distills the filtered data. Despite several empirical successes, a fundamental understanding is still lacking. In this work, we initiate a comprehensive, modular and controlled study on LLM self-improvement. We provide a mathematical formulation for self-improvement, which is largely governed by a quantity which we formalize as the generation-verification gap. Through experiments with various model families and tasks, we discover a scaling phenomenon of self-improvement -- a variant of the generation-verification gap scales monotonically with the model pre-training flops. We also examine when self-improvement is possible, an iterative self-improvement procedure, and ways to improve its performance. Our findings not only advance understanding of LLM self-improvement with practical implications, but also open numerous avenues for future research into its capabilities and boundaries., Comment: 41 pages, 19 figures
Published: 2024

6. Loss-to-Loss Prediction: Scaling Laws for All Datasets

Author: Brandfonbrener, David, Anand, Nikhil, Vyas, Nikhil, Malach, Eran, and Kakade, Sham
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: While scaling laws provide a reliable methodology for predicting train loss across compute scales for a single data distribution, less is known about how these predictions should change as we change the distribution. In this paper, we derive a strategy for predicting one loss from another and apply it to predict across different pre-training datasets and from pre-training data to downstream task data. Our predictions extrapolate well even at 20x the largest FLOP budget used to fit the curves. More precisely, we find that there are simple shifted power law relationships between (1) the train losses of two models trained on two separate datasets when the models are paired by training compute (train-to-train), (2) the train loss and the test loss on any downstream distribution for a single model (train-to-test), and (3) the test losses of two models trained on two separate train datasets (test-to-test). The results hold up for pre-training datasets that differ substantially (some are entirely code and others have no code at all) and across a variety of downstream tasks. Finally, we find that in some settings these shifted power law relationships can yield more accurate predictions than extrapolating single-dataset scaling laws.
Published: 2024

7. How Does Critical Batch Size Scale in Pre-training?

Author: Zhang, Hanlin, Morwani, Depen, Vyas, Nikhil, Wu, Jingfeng, Zou, Difan, Ghai, Udaya, Foster, Dean, and Kakade, Sham
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Training large-scale models under given resources requires careful design of parallelism strategies. In particular, the efficiency notion of critical batch size (CBS), concerning the compromise between time and compute, marks the threshold beyond which greater data parallelism leads to diminishing returns. To operationalize it, we propose a measure of CBS and pre-train a series of auto-regressive language models, ranging from 85 million to 1.2 billion parameters, on the C4 dataset. Through extensive hyper-parameter sweeps and careful control of factors such as batch size, momentum, and learning rate along with its scheduling, we systematically investigate the impact of scale on CBS. Then we fit scaling laws with respect to model and data sizes to decouple their effects. Overall, our results demonstrate that CBS scales primarily with data size rather than model size, a finding we justify theoretically through the analysis of infinite-width limits of neural networks and infinite-dimensional least squares regression. Of independent interest, we highlight the importance of common hyper-parameter choices and strategies for studying large-scale pre-training beyond fixed training durations.
Published: 2024

8. Mixture of Parrots: Experts improve memorization more than reasoning

Author: Jelassi, Samy, Mohri, Clara, Brandfonbrener, David, Gu, Alex, Vyas, Nikhil, Anand, Nikhil, Alvarez-Melis, David, Li, Yuanzhi, Kakade, Sham M., and Malach, Eran
Subjects: Computer Science - Machine Learning
Abstract: The Mixture-of-Experts (MoE) architecture enables a significant increase in the total number of model parameters with minimal computational overhead. However, it is not clear what performance tradeoffs, if any, exist between MoEs and standard dense transformers. In this paper, we show that as we increase the number of experts (while fixing the number of active parameters), the memorization performance consistently increases while the reasoning capabilities saturate. We begin by analyzing the theoretical limitations of MoEs at reasoning. We prove that there exist graph problems that cannot be solved by any number of experts of a certain width; however, the same task can be easily solved by a dense model with a slightly larger width. On the other hand, we find that on memory-intensive tasks, MoEs can effectively leverage a small number of active parameters with a large number of experts to memorize the data. We empirically validate these findings on synthetic graph problems and memory-intensive closed book retrieval tasks. Lastly, we pre-train a series of MoEs and dense transformers and evaluate them on commonly used benchmarks in math and natural language. We find that increasing the number of experts helps solve knowledge-intensive tasks, but fails to yield the same benefits for reasoning tasks.
Published: 2024

9. LoRA Soups: Merging LoRAs for Practical Skill Composition Tasks

Author: Prabhakar, Akshara, Li, Yuanzhi, Narasimhan, Karthik, Kakade, Sham, Malach, Eran, and Jelassi, Samy
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Low-Rank Adaptation (LoRA) is a popular technique for parameter-efficient fine-tuning of Large Language Models (LLMs). We study how different LoRA modules can be merged to achieve skill composition -- testing the performance of the merged model on a target task that involves combining multiple skills, each skill coming from a single LoRA. This setup is favorable when it is difficult to obtain training data for the target task and when it can be decomposed into multiple skills. First, we identify practically occurring use-cases that can be studied under the realm of skill composition, e.g. solving hard math-word problems with code, creating a bot to answer questions on proprietary manuals or about domain-specialized corpora. Our main contribution is to show that concatenation of LoRAs (CAT), which optimally weights LoRAs that were individually trained on different skills, outperforms existing model- and data- merging techniques; for instance on math-word problems, CAT beats these methods by an average of 43% and 12% respectively. Thus, this paper advocates model merging as an efficient way to solve compositional tasks and underscores CAT as a simple, compute-friendly and effective procedure. To our knowledge, this is the first work demonstrating the superiority of model merging over data mixing for binary skill composition tasks. Code and data are available at https://github.com/aksh555/LoRA-Soups, Comment: COLING 2025 Industry track; 9 pages plus references and appendices
Published: 2024

10. Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond

Author: Oncescu, Costin-Andrei, Purandare, Sanket, Idreos, Stratos, and Kakade, Sham
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: While transformers have been at the core of most recent advancements in sequence generative models, their computational cost remains quadratic in sequence length. Several subquadratic architectures have been proposed to address this computational issue. Some of them, including long convolution sequence models (LCSMs), such as Hyena, address this issue at training time but remain quadratic during inference. We propose a method for speeding up LCSMs' exact inference to quasilinear $O(L\log^2L)$ time, identify the key properties that make this possible, and propose a general framework that exploits these. Our approach, inspired by previous work on relaxed polynomial interpolation, is based on a tiling which helps decrease memory movement and share computation. It has the added benefit of allowing for almost complete parallelization across layers of the position-mixing part of the architecture. Empirically, we provide a proof of concept implementation for Hyena, which gets up to $1.6\times$ end-to-end improvement over standard inference by improving $50\times$ within the position-mixing part., Comment: 15 pages, 9 figures, 5 algorithms
Published: 2024

11. Neural Coordination and Capacity Control for Inventory Management

Author: Eisenach, Carson, Ghai, Udaya, Madeka, Dhruv, Torkkola, Kari, Foster, Dean, and Kakade, Sham
Subjects: Electrical Engineering and Systems Science - Systems and Control, Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: This paper addresses the capacitated periodic review inventory control problem, focusing on a retailer managing multiple products with limited shared resources, such as storage or inbound labor at a facility. Specifically, this paper is motivated by the questions of (1) what does it mean to backtest a capacity control mechanism, (2) can we devise and backtest a capacity control mechanism that is compatible with recent advances in deep reinforcement learning for inventory management? First, because we only have a single historic sample path of Amazon's capacity limits, we propose a method that samples from a distribution of possible constraint paths covering a space of real-world scenarios. This novel approach allows for more robust and realistic testing of inventory management strategies. Second, we extend the exo-IDP (Exogenous Decision Process) formulation of Madeka et al. 2022 to capacitated periodic review inventory control problems and show that certain capacitated control problems are no harder than supervised learning. Third, we introduce a `neural coordinator', designed to produce forecasts of capacity prices, guiding the system to adhere to target constraints in place of a traditional model predictive controller. Finally, we apply a modified DirectBackprop algorithm for learning a deep RL buying policy and a training the neural coordinator. Our methodology is evaluated through large-scale backtests, demonstrating RL buying policies with a neural coordinator outperforms classic baselines both in terms of cumulative discounted reward and capacity adherence (we see improvements of up to 50% in some cases).
Published: 2024

12. SOAP: Improving and Stabilizing Shampoo using Adam

Author: Vyas, Nikhil, Morwani, Depen, Zhao, Rosie, Kwun, Mujin, Shapira, Itai, Brandfonbrener, David, Janson, Lucas, and Kakade, Sham
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: There is growing evidence of the effectiveness of Shampoo, a higher-order preconditioning method, over Adam in deep learning optimization tasks. However, Shampoo's drawbacks include additional hyperparameters and computational overhead when compared to Adam, which only updates running averages of first- and second-moment quantities. This work establishes a formal connection between Shampoo (implemented with the 1/2 power) and Adafactor -- a memory-efficient approximation of Adam -- showing that Shampoo is equivalent to running Adafactor in the eigenbasis of Shampoo's preconditioner. This insight leads to the design of a simpler and computationally efficient algorithm: $\textbf{S}$hampo$\textbf{O}$ with $\textbf{A}$dam in the $\textbf{P}$reconditioner's eigenbasis (SOAP). With regards to improving Shampoo's computational efficiency, the most straightforward approach would be to simply compute Shampoo's eigendecomposition less frequently. Unfortunately, as our empirical results show, this leads to performance degradation that worsens with this frequency. SOAP mitigates this degradation by continually updating the running average of the second moment, just as Adam does, but in the current (slowly changing) coordinate basis. Furthermore, since SOAP is equivalent to running Adam in a rotated space, it introduces only one additional hyperparameter (the preconditioning frequency) compared to Adam. We empirically evaluate SOAP on language model pre-training with 360m and 660m sized models. In the large batch regime, SOAP reduces the number of iterations by over 40% and wall clock time by over 35% compared to AdamW, with approximately 20% improvements in both metrics compared to Shampoo. An implementation of SOAP is available at https://github.com/nikhilvyas/SOAP.
Published: 2024

13. Preference-Based Multi-Agent Reinforcement Learning: Data Coverage and Algorithmic Techniques

Author: Zhang, Natalia, Wang, Xinqi, Cui, Qiwen, Zhou, Runlong, Kakade, Sham M., and Du, Simon S.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computer Science and Game Theory, Computer Science - Multiagent Systems
Abstract: We initiate the study of Preference-Based Multi-Agent Reinforcement Learning (PbMARL), exploring both theoretical foundations and empirical validations. We define the task as identifying the Nash equilibrium from a preference-only offline dataset in general-sum games, a problem marked by the challenge of sparse feedback signals. Our theory establishes the upper complexity bounds for Nash Equilibrium in effective PbMARL, demonstrating that single-policy coverage is inadequate and highlighting the importance of unilateral dataset coverage. These theoretical insights are verified through comprehensive experiments. To enhance the practical performance, we further introduce two algorithmic techniques. (1) We propose a Mean Squared Error (MSE) regularization along the time axis to achieve a more uniform reward distribution and improve reward learning outcomes. (2) We propose an additional penalty based on the distribution of the dataset to incorporate pessimism, improving stability and effectiveness during training. Our findings underscore the multifaceted approach required for PbMARL, paving the way for effective preference-based multi-agent systems., Comment: 9 pages
Published: 2024

14. A Systematic Literature Review of Informal STEM Learning

Author: Hairunnisa Hussim, Roslinda Rosli, Nurul Aisya Zahira Mohd Nor, Siti Mistima Maat, Muhammad Sofwan Mahmud, Zanaton Iksan, Azmin Sham Rambely, Siti Nurdiyana Mahmud, Lilia Halim, Kamisah Osman, and Ah Nam Lay
Abstract: Student learning outside the formal classroom is inextricably linked to informal learning environments. In many countries, most activities that employ informal learning prioritize the integration of science, technology, engineering, and mathematics (STEM) disciplines and have shown a positive impact on increasing students' interest, self-efficacy, and awareness of the STEM field. Thus, this study aims to systematically review the activities reported in the relevant studies focusing on informal STEM learning for K-12. High-index journals published under SCOPUS and Web of Science databases were utilized using a predetermined search strategy and retrieved two research team members' screened articles. Only empirical studies containing the terms "STEM education", "summer camp", and "informal learning" in the title, abstract, and keyword were included. Data were coded and organized into a matrix that was qualitatively assessed and categorized into themes. Based on the 25 studies reviewed, it was found that the integration of STEM disciplines in informal learning is notably varied. The characteristics of the STEM activities can be organized into seven themes: inquiry, focus on problems, design, cooperative learning, student-centred, hands-on, and 21st-century skills. Practice recommendations include a quality curriculum that aligns with informal STEM learning needs.
Published: 2024

15. Universal and scalable synthesis of photochromic single-atom catalysts for plastic recycling.

Author: Liu, Yu, Wang, Xuchun, Li, Xiaodong, Ye, Zuyang, Sham, Tsun-Kong, Xu, Panpan, Cao, Muhan, Zhang, Qiao, Yin, Yadong, and Chen, Jinxing
Abstract: Metal oxide nanostructures with single-atomic heteroatom incorporation are of interest for many applications. However, a universal and scalable synthesis approach with high heteroatom concentrations represents a formidable challenge, primarily due to the pronounced structural disparities between Mhetero-O and Msub-O units. Here, focusing on TiO2 as the exemplified substrate, we present a diethylene glycol-assisted synthetic platform tailored for the controlled preparation of a library of M1-TiO2 nanostructures, encompassing 15 distinct unary M1-TiO2 nanostructures, along with two types of binary and ternary composites. Our approach capitalizes on the unique properties of diethylene glycol, affording precise kinetic control by passivating the hydrolytic activity of heteroatom and simultaneously achieving thermodynamic control by introducing short-range order structures to dissipate the free energy associated with heteroatom incorporation. The M1-TiO2 nanostructures, characterized by distinctive and abundant M-O-Ti units on the surface, exhibit high efficiency in photochromic photothermal catalysis toward recycling waste polyesters. This universal synthetic platform contributes to the preparation of materials with broad applicability and significance across catalysis, energy conversion, and biomedicine.
Published: 2024

16. Functional multiomics reveals genetic and pharmacologic regulation of surface CD38 in multiple myeloma

Author: Choudhry, Priya, Kasap, Corynn, Patiño-Escobar, Bonell, Gugliemini, Olivia, Geng, Huimin, Sarin, Vishesh, Kang, Amrik, Kishishita, Audrey, Rampersaud, Sham, Sarah, Letitia, Lin, Yu-Hsiu T, Paranjape, Neha, Ramkumar, Poornima, Patton, Jonathan C, Marcoulis, Makeba, Wang, Donghui, Phojanakong, Paul, Steri, Veronica, Hann, Byron, Barwick, Benjamin G, Kampmann, Martin, and Wiita, Arun P
Subjects: Biological Sciences, Biomedical and Clinical Sciences, Immunology, Orphan Drug, Human Genome, Genetics, Hematology, Rare Diseases, Immunotherapy, Cancer, Biotechnology, 5.1 Pharmaceuticals, 5.2 Cellular and gene therapies, Good Health and Well Being
Published: 2024

17. An Edge AI System Based on FPGA Platform for Railway Fault Detection

Author: Li, Jiale, Fu, Yulin, Yan, Dongwei, Ma, Sean Longyu, and Sham, Chiu-Wing
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence
Abstract: As the demands for railway transportation safety increase, traditional methods of rail track inspection no longer meet the needs of modern railway systems. To address the issues of automation and efficiency in rail fault detection, this study introduces a railway inspection system based on Field Programmable Gate Array (FPGA). This edge AI system collects track images via cameras and uses Convolutional Neural Networks (CNN) to perform real-time detection of track defects and automatically reports fault information. The innovation of this system lies in its high level of automation and detection efficiency. The neural network approach employed by this system achieves a detection accuracy of 88.9%, significantly enhancing the reliability and efficiency of detection. Experimental results demonstrate that this FPGA-based system is 1.39* and 4.67* better in energy efficiency than peer implementation on the GPU and CPU platform, respectively., Comment: Accepted at the 2024 IEEE 13th Global Conference on Consumer Electronics (GCCE 2024)
Published: 2024

18. Deconstructing What Makes a Good Optimizer for Language Models

Author: Zhao, Rosie, Morwani, Depen, Brandfonbrener, David, Vyas, Nikhil, and Kakade, Sham
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Training language models becomes increasingly expensive with scale, prompting numerous attempts to improve optimization efficiency. Despite these efforts, the Adam optimizer remains the most widely used, due to a prevailing view that it is the most effective approach. We aim to compare several optimization algorithms, including SGD, Adafactor, Adam, and Lion, in the context of autoregressive language modeling across a range of model sizes, hyperparameters, and architecture variants. Our findings indicate that, except for SGD, these algorithms all perform comparably both in their optimal performance and also in terms of how they fare across a wide range of hyperparameter choices. Our results suggest to practitioners that the choice of optimizer can be guided by practical considerations like memory constraints and ease of implementation, as no single algorithm emerged as a clear winner in terms of performance or stability to hyperparameter misspecification. Given our findings, we further dissect these approaches, examining two simplified versions of Adam: a) signed momentum (Signum) which we see recovers both the performance and hyperparameter stability of Adam and b) Adalayer, a layerwise variant of Adam which we introduce to study Adam's preconditioning. Examining Adalayer leads us to the conclusion that the largest impact of Adam's preconditioning is restricted to the last layer and LayerNorm parameters, and, perhaps surprisingly, the remaining layers can be trained with SGD.
Published: 2024

19. Universal Length Generalization with Turing Programs

Author: Hou, Kaiying, Brandfonbrener, David, Kakade, Sham, Jelassi, Samy, and Malach, Eran
Subjects: Computer Science - Machine Learning
Abstract: Length generalization refers to the ability to extrapolate from short training sequences to long test sequences and is a challenge for current large language models. While prior work has proposed some architecture or data format changes to achieve length generalization, these proposals typically apply to a limited set of tasks. Building on prior scratchpad and Chain-of-Thought (CoT) techniques, we propose Turing Programs, a novel CoT strategy that decomposes an algorithmic task into steps mimicking the computation of a Turing Machine. This framework is both universal, as it can accommodate any algorithmic task, and simple, requiring only copying text from the context with small modifications. We show that by using Turing Programs, we obtain robust length generalization on a range of algorithmic tasks: addition, multiplication and in-context SGD. We then demonstrate that transformers achieve length generalization on random Turing Programs, suggesting that length generalization is possible for any algorithmic task. Finally, we theoretically prove that transformers can implement Turing Programs, constructing a simple RASP (Weiss et al.) program that simulates an arbitrary Turing machine.
Published: 2024

20. Eliminating Position Bias of Language Models: A Mechanistic Approach

Author: Wang, Ziqi, Zhang, Hanlin, Li, Xiner, Huang, Kuan-Hao, Han, Chi, Ji, Shuiwang, Kakade, Sham M., Peng, Hao, and Ji, Heng
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Position bias has proven to be a prevalent issue of modern language models (LMs), where the models prioritize content based on its position within the given context. This bias often leads to unexpected model failures and hurts performance, robustness, and reliability across various applications. Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings. Based on the analyses, we propose to eliminate position bias (e.g., different retrieved documents' orders in QA affect performance) with a training-free zero-shot approach. Our method changes the causal attention to bidirectional attention between documents and utilizes model attention values to decide the relative orders of documents instead of using the order provided in input prompts, therefore enabling Position-INvariant inferencE (PINE) at the document level. By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning. Notably, PINE is especially useful when adapting LMs for evaluating reasoning pairs: it consistently provides 8 to 10 percentage points performance gains, making Llama-3-70B-Instruct perform even better than GPT-4-0125-preview and GPT-4o-2024-08-06 on the RewardBench reasoning set., Comment: 26 pages, 6 figures, 15 tables
Published: 2024

21. A New Perspective on Shampoo's Preconditioner

Author: Morwani, Depen, Shapira, Itai, Vyas, Nikhil, Malach, Eran, Kakade, Sham, and Janson, Lucas
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: Shampoo, a second-order optimization algorithm which uses a Kronecker product preconditioner, has recently garnered increasing attention from the machine learning community. The preconditioner used by Shampoo can be viewed either as an approximation of the Gauss--Newton component of the Hessian or the covariance matrix of the gradients maintained by Adagrad. We provide an explicit and novel connection between the $\textit{optimal}$ Kronecker product approximation of these matrices and the approximation made by Shampoo. Our connection highlights a subtle but common misconception about Shampoo's approximation. In particular, the $\textit{square}$ of the approximation used by the Shampoo optimizer is equivalent to a single step of the power iteration algorithm for computing the aforementioned optimal Kronecker product approximation. Across a variety of datasets and architectures we empirically demonstrate that this is close to the optimal Kronecker product approximation. Additionally, for the Hessian approximation viewpoint, we empirically study the impact of various practical tricks to make Shampoo more computationally efficient (such as using the batch gradient and the empirical Fisher) on the quality of Hessian approximation.
Published: 2024

22. DataComp-LM: In search of the next generation of training sets for language models

Author: Li, Jeffrey, Fang, Alex, Smyrnis, Georgios, Ivgi, Maor, Jordan, Matt, Gadre, Samir, Bansal, Hritik, Guha, Etash, Keh, Sedrick, Arora, Kushal, Garg, Saurabh, Xin, Rui, Muennighoff, Niklas, Heckel, Reinhard, Mercat, Jean, Chen, Mayee, Gururangan, Suchin, Wortsman, Mitchell, Albalak, Alon, Bitton, Yonatan, Nezhurina, Marianna, Abbas, Amro, Hsieh, Cheng-Yu, Ghosh, Dhruba, Gardner, Josh, Kilian, Maciej, Zhang, Hanlin, Shao, Rulin, Pratt, Sarah, Sanyal, Sunny, Ilharco, Gabriel, Daras, Giannis, Marathe, Kalyani, Gokaslan, Aaron, Zhang, Jieyu, Chandu, Khyathi, Nguyen, Thao, Vasiljevic, Igor, Kakade, Sham, Song, Shuran, Sanghavi, Sujay, Faghri, Fartash, Oh, Sewoong, Zettlemoyer, Luke, Lo, Kyle, El-Nouby, Alaaeldin, Pouransari, Hadi, Toshev, Alexander, Wang, Stephanie, Groeneveld, Dirk, Soldaini, Luca, Koh, Pang Wei, Jitsev, Jenia, Kollar, Thomas, Dimakis, Alexandros G., Carmon, Yair, Dave, Achal, Schmidt, Ludwig, and Shankar, Vaishaal
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: We introduce DataComp for Language Models (DCLM), a testbed for controlled dataset experiments with the goal of improving language models. As part of DCLM, we provide a standardized corpus of 240T tokens extracted from Common Crawl, effective pretraining recipes based on the OpenLM framework, and a broad suite of 53 downstream evaluations. Participants in the DCLM benchmark can experiment with data curation strategies such as deduplication, filtering, and data mixing at model scales ranging from 412M to 7B parameters. As a baseline for DCLM, we conduct extensive experiments and find that model-based filtering is key to assembling a high-quality training set. The resulting dataset, DCLM-Baseline enables training a 7B parameter language model from scratch to 64% 5-shot accuracy on MMLU with 2.6T training tokens. Compared to MAP-Neo, the previous state-of-the-art in open-data language models, DCLM-Baseline represents a 6.6 percentage point improvement on MMLU while being trained with 40% less compute. Our baseline model is also comparable to Mistral-7B-v0.3 and Llama 3 8B on MMLU (63% & 66%), and performs similarly on an average of 53 natural language understanding tasks while being trained with 6.6x less compute than Llama 3 8B. Our results highlight the importance of dataset design for training language models and offer a starting point for further research on data curation., Comment: Project page: https://www.datacomp.ai/dclm/
Published: 2024

23. Transcendence: Generative Models Can Outperform The Experts That Train Them

Author: Zhang, Edwin, Zhu, Vincent, Saphra, Naomi, Kleiman, Anat, Edelman, Benjamin L., Tambe, Milind, Kakade, Sham M., and Malach, Eran
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Generative models are trained with the simple objective of imitating the conditional probability distribution induced by the data they are trained on. Therefore, when trained on data generated by humans, we may not expect the artificial model to outperform the humans on their original objectives. In this work, we study the phenomenon of transcendence: when a generative model achieves capabilities that surpass the abilities of the experts generating its data. We demonstrate transcendence by training an autoregressive transformer to play chess from game transcripts, and show that the trained model can sometimes achieve better performance than all players in the dataset. We theoretically prove that transcendence can be enabled by low-temperature sampling, and rigorously assess this claim experimentally. Finally, we discuss other sources of transcendence, laying the groundwork for future investigation of this phenomenon in a broader setting., Comment: Code, models, and data at https://transcendence.eddie.win
Published: 2024

24. CoLoR-Filter: Conditional Loss Reduction Filtering for Targeted Language Model Pre-training

Author: Brandfonbrener, David, Zhang, Hanlin, Kirsch, Andreas, Schwarz, Jonathan Richard, and Kakade, Sham
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Selecting high-quality data for pre-training is crucial in shaping the downstream task performance of language models. A major challenge lies in identifying this optimal subset, a problem generally considered intractable, thus necessitating scalable and effective heuristics. In this work, we propose a data selection method, CoLoR-Filter (Conditional Loss Reduction Filtering), which leverages an empirical Bayes-inspired approach to derive a simple and computationally efficient selection criterion based on the relative loss values of two auxiliary models. In addition to the modeling rationale, we evaluate CoLoR-Filter empirically on two language modeling tasks: (1) selecting data from C4 for domain adaptation to evaluation on Books and (2) selecting data from C4 for a suite of downstream multiple-choice question answering tasks. We demonstrate favorable scaling both as we subselect more aggressively and using small auxiliary models to select data for large target models. As one headline result, CoLoR-Filter data selected using a pair of 150m parameter auxiliary models can train a 1.2b parameter target model to match a 1.2b parameter model trained on 25b randomly selected tokens with 25x less data for Books and 11x less data for the downstream tasks. Code: https://github.com/davidbrandfonbrener/color-filter-olmo Filtered data: https://huggingface.co/datasets/davidbrandfonbrener/color-filtered-c4
Published: 2024

25. Scaling Laws in Linear Regression: Compute, Parameters, and Data

Author: Lin, Licong, Wu, Jingfeng, Kakade, Sham M., Bartlett, Peter L., and Lee, Jason D.
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Mathematics - Statistics Theory, Statistics - Machine Learning
Abstract: Empirically, large-scale deep learning models often satisfy a neural scaling law: the test error of the trained model improves polynomially as the model size and data size grow. However, conventional wisdom suggests the test error consists of approximation, bias, and variance errors, where the variance error increases with model size. This disagrees with the general form of neural scaling laws, which predict that increasing model size monotonically improves performance. We study the theory of scaling laws in an infinite dimensional linear regression setup. Specifically, we consider a model with $M$ parameters as a linear function of sketched covariates. The model is trained by one-pass stochastic gradient descent (SGD) using $N$ data. Assuming the optimal parameter satisfies a Gaussian prior and the data covariance matrix has a power-law spectrum of degree $a>1$, we show that the reducible part of the test error is $\Theta(M^{-(a-1)} + N^{-(a-1)/a})$. The variance error, which increases with $M$, is dominated by the other errors due to the implicit regularization of SGD, thus disappearing from the bound. Our theory is consistent with the empirical neural scaling laws and verified by numerical simulation.
Published: 2024

26. Phase 1 clinical trial of B-Cell Maturation Antigen (BCMA) NEX-T® Chimeric Antigen Receptor (CAR) T cell therapy CC-98633/BMS-986354 in participants with triple-class exposed multiple myeloma

Author: Ravi, Gayathri, Richard, Shambavi, Kumar, Shaji, Atrash, Shebli, Liedtke, Michaela, Kaur, Gurbakhash, Derman, Benjamin, Bergsagel, P. Leif, Mailankody, Sham, McCarthy, Philip, Shrestha, Alok, Kelly, Lisa M., Ly, Thomas, Das, Sharmila, Thorpe, Jerill, Maier, Alison, Varun, Divya, Navarro, Garnet, Burgess, Michael R., Hege, Kristen, Koegel, Ashley K., and Costa, Luciano J.
Published: 2025
Full Text: View/download PDF

27. Starfysh integrates spatial transcriptomic and histologic data to reveal heterogeneous tumor–immune hubs

Author: He, Siyu, Jin, Yinuo, Nazaret, Achille, Shi, Lingting, Chen, Xueer, Rampersaud, Sham, Dhillon, Bahawar S., Valdez, Izabella, Friend, Lauren E., Fan, Joy Linyue, Park, Cameron Y., Mintz, Rachel L., Lao, Yeh-Hsing, Carrera, David, Fang, Kaylee W., Mehdi, Kaleem, Rohde, Madeline, McFaline-Figueroa, José L., Blei, David, Leong, Kam W., Rudensky, Alexander Y., Plitas, George, and Azizi, Elham
Published: 2025
Full Text: View/download PDF

28. An efficient methodological approach for synthesis of selenopyridines: generation, reactions, anticancer activity, EGFR inhibitory activity and molecular docking studies

Author: Hussein, Bahgat R. M., El-Saghier, Sham M. M., Allam, Rasha M., Mohamed, Mamdouh F. A., and Amer, Amer A.
Published: 2025
Full Text: View/download PDF

29. Role of ALT Perforator Flaps in Head and Neck Reconstruction

Author: Khowala, Sagarika, Sham, M. E., Reddy, Thyagraj, Menon, Suresh, and Archana, S.
Published: 2025
Full Text: View/download PDF

30. Design optimization of 3D printed concrete elements considering life cycle assessment and life cycle costing

Author: Bahoria, Boskey V., Bhagat, Rajesh M., Pande, Prashant B., Raut, Jayant M., Dhengare, Sagar W., Mankar, Sham H., Vairagade, Vikrant S., and Shelare, Sagar D.
Published: 2024
Full Text: View/download PDF

31. Light-enabled scalable synthesis of bicyclo[1.1.1]pentane halides and their functionalizations

Author: Ripenko, Vasyl, Sham, Vadym, Levchenko, Vitalina, Holovchuk, Serhii, Vysochyn, Daniil, Klymov, Ivan, Kyslyi, Dmytro, Veselovych, Stanislav, Zhersh, Serhii, Dmytriv, Yurii, Tolmachev, Andrey, Sadkova, Iryna, Pishel, Irina, Horbatok, Kateryna, Kosach, Viktoria, Nikandrova, Yelyzaveta, and Mykhailiuk, Pavel K.
Published: 2024
Full Text: View/download PDF

32. Predictive modeling for concrete properties under variable curing conditions using advanced machine learning approaches

Author: Mungle, Nischal P., Mate, Dnyaneshwar M., Mankar, Sham H., Tale, Vithoba T., Mehta, Ankita, Dhale, Shrikrishna A., and Vairagade, Vikrant S.
Published: 2024
Full Text: View/download PDF

33. Efficacy of Reso-Pac in Wound Healing After Surgical Removal of Impacted Mandibular Third Molars: A Clinical Study

Author: Pravallika, Avapati, Menon, Suresh, Sham, M. E., Archana, S., and Mathews, Sheron
Published: 2024
Full Text: View/download PDF

34. Applications of computational intelligence for predictive modeling of properties of blended cement sustainable concrete incorporating various industrial byproducts towards sustainable construction

Author: Mungle, Niscal P., Mate, Dnyaneshwar M., Mankar, Sham H., Tale, Vithoba T., Vairagade, Vikrant S., and Shelare, Sagar D.
Published: 2024
Full Text: View/download PDF

35. Superposed Decoding: Multiple Generations from a Single Autoregressive Inference Pass

Author: Shen, Ethan, Fan, Alan, Pratt, Sarah M., Park, Jae Sung, Wallingford, Matthew, Kakade, Sham M., Holtzman, Ari, Krishna, Ranjay, Farhadi, Ali, and Kusupati, Aditya
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Many applications today provide users with multiple auto-complete drafts as they type, including GitHub's code completion, Gmail's smart compose, and Apple's messaging auto-suggestions. Under the hood, language models support this by running an autoregressive inference pass to provide a draft. Consequently, providing $k$ drafts to the user requires running an expensive language model $k$ times. To alleviate the computation cost of running $k$ inference passes, we propose Superposed Decoding, a new decoding algorithm that generates $k$ drafts at the computation cost of one autoregressive inference pass. We achieve this by feeding a superposition of the most recent token embeddings from the $k$ drafts as input to the next decoding step of the language model. At every inference step we combine the $k$ drafts with the top-$k$ tokens to get $k^2$ new drafts and cache the $k$ most likely options, using an n-gram interpolation with minimal compute overhead to filter out incoherent generations. Our experiments show that $k$ drafts from Superposed Decoding are at least as coherent and factual as Nucleus Sampling and Greedy Decoding respectively, while being at least $2.44\times$ faster for $k\ge3$. In a compute-normalized setting, user evaluations demonstrably favor text generated by Superposed Decoding over Nucleus Sampling. Superposed Decoding can also be combined with other decoding strategies, resulting in universal coverage gains when scaling inference time compute. Code and more examples open-sourced at https://github.com/RAIVNLab/SuperposedDecoding., Comment: 23 pages, 16 figures, accepted at NeurIPS 2024
Published: 2024

36. Matching the Statistical Query Lower Bound for $k$-Sparse Parity Problems with Sign Stochastic Gradient Descent

Author: Kou, Yiwen, Chen, Zixiang, Gu, Quanquan, and Kakade, Sham M.
Subjects: Computer Science - Machine Learning, Mathematics - Optimization and Control, Statistics - Machine Learning
Abstract: The $k$-sparse parity problem is a classical problem in computational complexity and algorithmic theory, serving as a key benchmark for understanding computational classes. In this paper, we solve the $k$-sparse parity problem with sign stochastic gradient descent, a variant of stochastic gradient descent (SGD) on two-layer fully-connected neural networks. We demonstrate that this approach can efficiently solve the $k$-sparse parity problem on a $d$-dimensional hypercube ($k\leq O(\sqrt{d})$) with a sample complexity of $\tilde{O}(d^{k-1})$ using $2^{\Theta(k)}$ neurons, matching the established $\Omega(d^{k})$ lower bounds of Statistical Query (SQ) models. Our theoretical analysis begins by constructing a good neural network capable of correctly solving the $k$-parity problem. We then demonstrate how a trained neural network with sign SGD can effectively approximate this good network, solving the $k$-parity problem with small statistical errors. To the best of our knowledge, this is the first result that matches the SQ lower bound for solving $k$-sparse parity problem using gradient-based methods., Comment: 37 pages, 7 figures, 3 tables. In NeurIPS 2024
Published: 2024

37. Association of neurotransmitter pathway polygenic risk with specific symptom profiles in psychosis

Author: Warren, Tracy L, Tubbs, Justin D, Lesh, Tyler A, Corona, Mylena B, Pakzad, Sarvenaz S, Albuquerque, Marina D, Singh, Praveena, Zarubin, Vanessa, Morse, Sarah J, Sham, Pak Chung, Carter, Cameron S, and Nord, Alex S
Subjects: Biological Psychology, Pharmacology and Pharmaceutical Sciences, Biomedical and Clinical Sciences, Psychology, Serious Mental Illness, Human Genome, Neurosciences, Schizophrenia, Mental Illness, Clinical Research, Genetics, Mental Health, Behavioral and Social Science, Bipolar Disorder, Brain Disorders, Prevention, 2.1 Biological and endogenous factors, Mental health, Good Health and Well Being, Humans, Female, Male, Multifactorial Inheritance, Psychotic Disorders, Adult, Neurotransmitter Agents, Genome-Wide Association Study, Genetic Predisposition to Disease, Endophenotypes, Glutamic Acid, Dopamine, Case-Control Studies, Young Adult, Genotype, Magnetic Resonance Imaging, Risk Factors, Biological Sciences, Medical and Health Sciences, Psychology and Cognitive Sciences, Psychiatry, Clinical sciences, Biological psychology, Clinical and health psychology
Abstract: A primary goal of psychiatry is to better understand the pathways that link genetic risk to psychiatric symptoms. Here, we tested association of diagnosis and endophenotypes with overall and neurotransmitter pathway-specific polygenic risk in patients with early-stage psychosis. Subjects included 205 demographically diverse cases with a psychotic disorder who underwent comprehensive psychiatric and neurological phenotyping and 115 matched controls. Following genotyping, we calculated polygenic scores (PGSs) for schizophrenia (SZ) and bipolar disorder (BP) using Psychiatric Genomics Consortium GWAS summary statistics. To test if overall genetic risk can be partitioned into affected neurotransmitter pathways, we calculated pathway PGSs (pPGSs) for SZ risk affecting each of four major neurotransmitter systems: glutamate, GABA, dopamine, and serotonin. Psychosis subjects had elevated SZ PGS versus controls; cases with SZ or BP diagnoses had stronger SZ or BP risk, respectively. There was no significant association within psychosis cases between individual symptom measures and overall PGS. However, neurotransmitter-specific pPGSs were moderately associated with specific endophenotypes; notably, glutamate was associated with SZ diagnosis and with deficits in cognitive control during task-based fMRI, while dopamine was associated with global functioning. Finally, unbiased endophenotype-driven clustering identified three diagnostically mixed case groups that separated on primary deficits of positive symptoms, negative symptoms, global functioning, and cognitive control. All clusters showed strong genome-wide risk. Cluster 2, characterized by deficits in cognitive control and negative symptoms, additionally showed specific risk concentrated in glutamatergic and GABAergic pathways. Due to the intensive characterization of our subjects, the present study was limited to a relatively small cohort. As such, results should be followed up with additional research at the population and mechanism level. Our study suggests pathway-based PGS analysis may be a powerful path forward to study genetic mechanisms driving psychiatric endophenotypes.
Published: 2024

38. Education for patients with rheumatic diseases being treated with biologics: need, strategies, challenges, and solutions

Author: Santhanam, Sham and Ravindran, Vinod
Published: 2025
Full Text: View/download PDF

39. Superionic conducting vacancy-rich β-Li3N electrolyte for stable cycling of all-solid-state lithium metal batteries

Author: Li, Weihan, Li, Minsi, Wang, Shuo, Chien, Po-Hsiu, Luo, Jing, Fu, Jiamin, Lin, Xiaoting, King, Graham, Feng, Renfei, Wang, Jian, Zhou, Jigang, Li, Ruying, Liu, Jue, Mo, Yifei, Sham, Tsun-Kong, and Sun, Xueliang
Published: 2024
Full Text: View/download PDF

40. The Effect of Lag Time Between Injury and Treatment of Zygomatic Complex Fractures on the Neurosensory Disturbances of the Infraorbital Nerve

Author: Nimmagadda, Jahnavi, Archana, S., Mathews, Sheron, Menon, Suresh, Sham, M. E., and Narayan, Taradevi
Published: 2024
Full Text: View/download PDF

41. Comparison of Efficacy of 0.5% Ropivacaine Versus 2% Lignocaine Hydrochloride with Adrenaline (1:80,000) in Surgical Removal of Impacted Mandibular Third Molars with Emphasis on Depth of Anesthesia- Prospective, Randomized, Split Mouth Study

Author: Mohan, Aishwarya Ram, Menon, Suresh, Sham, M. E., Archana, S., Mathews, Sheron, and Narayan, Taradevi
Published: 2024
Full Text: View/download PDF

42. Estimation of causal effects of genes on complex traits using a Bayesian-network-based framework applied to GWAS data

Author: Yin, Liangying, Feng, Yaning, Shi, Yujia, Lau, Alexandria, Qiu, Jinghong, Sham, Pak-Chung, and So, Hon-Cheong
Published: 2024
Full Text: View/download PDF

43. Incidence of Maxillofacial Injuries in the Emergency Department-Retrospective Study

Author: Menon, Col Suresh, Mohan, Aishwarya Ram, Nimmagadda, Jahnavi, Manohar, Keerthi, Sham, M. E., Archana, S., Mathews, Sheron, and Narayan, P. Taradevi
Published: 2024
Full Text: View/download PDF

44. The acceptance of COVID-19 pandemic control measures during congregational prayer on Muslims’ emotional and social interaction

Author: Mastor, Khairul Anwar, Samuri, Mohd Al Adib, Md Sham, Fariza, Idris, Fazilah, Mohd Kashim, Mohd Izhar Ariff, and Hopkins, Peter
Published: 2024
Full Text: View/download PDF

45. Efficacy of Transdermal Buprenorphine Patch in Postoperative Pain Management in Oral and Maxillofacial Surgery

Author: N., Mythili, Navaneetham, Romir, Sham, M. E., Menon, Suresh, Kumar, Veerendra, and S., Archana
Published: 2024
Full Text: View/download PDF

46. Mechanical, thermal, and flammability behavior of chitosan-modified date palm leaf–based polypropylene composites

Author: Singh, Gajendra, Gairola, Sandeep, Singh, Inderdeep, and R, Sham Sundar
Published: 2024
Full Text: View/download PDF

47. Assessing the Impact of Sewage Sludge-Chinese Medicinal Herbal Residues-Biochar Amendment on Antibiotic Resistance Genes in Soil-Plant Systems

Author: Pan, Min and Sham, Yik Tung
Published: 2024
Full Text: View/download PDF

48. Follow My Instruction and Spill the Beans: Scalable Data Extraction from Retrieval-Augmented Generation Systems

Author: Qi, Zhenting, Zhang, Hanlin, Xing, Eric, Kakade, Sham, and Lakkaraju, Himabindu
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Retrieval-Augmented Generation (RAG) improves pre-trained models by incorporating external knowledge at test time to enable customized adaptation. We study the risk of datastore leakage in Retrieval-In-Context RAG Language Models (LMs). We show that an adversary can exploit LMs' instruction-following capabilities to easily extract text data verbatim from the datastore of RAG systems built with instruction-tuned LMs via prompt injection. The vulnerability exists for a wide range of modern LMs that span Llama2, Mistral/Mixtral, Vicuna, SOLAR, WizardLM, Qwen1.5, and Platypus2, and the exploitability exacerbates as the model size scales up. We also study multiple effects of RAG setup on the extractability of data, indicating that following unexpected instructions to regurgitate data can be an outcome of failure in effectively utilizing contexts for modern LMs, and further show that such vulnerability can be greatly mitigated by position bias elimination strategies. Extending our study to production RAG models GPTs, we design an attack that can cause datastore leakage with a 100% success rate on 25 randomly selected customized GPTs with at most 2 queries, and we extract text data verbatim at a rate of 41% from a book of 77,000 words and 3% from a corpus of 1,569,000 words by prompting the GPTs with only 100 queries generated by themselves.
Published: 2024

49. Q-Probe: A Lightweight Approach to Reward Maximization for Language Models

Author: Li, Kenneth, Jelassi, Samy, Zhang, Hugh, Kakade, Sham, Wattenberg, Martin, and Brandfonbrener, David
Subjects: Computer Science - Machine Learning
Abstract: We present an approach called Q-probing to adapt a pre-trained language model to maximize a task-specific reward function. At a high level, Q-probing sits between heavier approaches such as finetuning and lighter approaches such as few shot prompting, but can also be combined with either. The idea is to learn a simple linear function on a model's embedding space that can be used to reweight candidate completions. We theoretically show that this sampling procedure is equivalent to a KL-constrained maximization of the Q-probe as the number of samples increases. To train the Q-probes we consider either reward modeling or a class of novel direct policy learning objectives based on importance weighted policy gradients. With this technique, we see gains in domains with ground-truth rewards (code generation) as well as implicit rewards defined by preference data, even outperforming finetuning in data-limited regimes. Moreover, a Q-probe can be trained on top of an API since it only assumes access to sampling and embeddings. Code: https://github.com/likenneth/q_probe .
Published: 2024

50. Memristor-Based MobileNetV3 Circuit Design for Image Classification

Author: Li, Jiale, Fu, Yulin, Yan, Dongwei, Ma, Sean Longyu, Sham, Chiu-Wing, and Fu, Chong
Subjects: Computer Science - Hardware Architecture
Abstract: The increasing computational demands of deep learning models pose significant challenges for edge devices. To address this, we propose a memristor-based circuit design for MobileNetV3, specifically for image classification tasks. Our design leverages the low power consumption and high integration density of memristors, making it suitable for edge computing. The architecture includes optimized memristive convolutional modules, batch normalization modules, activation function modules, global average pooling modules, and fully connected modules. Experimental results on the CIFAR-10 dataset show that our memristor-based MobileNetV3 achieves over 90\% accuracy while significantly reducing inference time and energy consumption compared to traditional implementations. This work demonstrates the potential of memristor-based designs for efficient deployment of deep learning models in resource-constrained environments.
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Region

Database

Publisher

3,840 results on '"Sham P."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources