Author: "Michaud, Eric J." - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Michaud, Eric J."' showing total 17 results

Start Over Author "Michaud, Eric J."

17 results on '"Michaud, Eric J."'

1. The Geometry of Concepts: Sparse Autoencoder Feature Structure

Author: Li, Yuxiao, Michaud, Eric J., Baek, David D., Engels, Joshua, Sun, Xiaoqing, and Tegmark, Max
Subjects: Quantitative Biology - Neurons and Cognition, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Sparse autoencoders have recently produced dictionaries of high-dimensional vectors corresponding to the universe of concepts represented by large language models. We find that this concept universe has interesting structure at three levels: 1) The "atomic" small-scale structure contains "crystals" whose faces are parallelograms or trapezoids, generalizing well-known examples such as (man-woman-king-queen). We find that the quality of such parallelograms and associated function vectors improves greatly when projecting out global distractor directions such as word length, which is efficiently done with linear discriminant analysis. 2) The "brain" intermediate-scale structure has significant spatial modularity; for example, math and code features form a "lobe" akin to functional lobes seen in neural fMRI images. We quantify the spatial locality of these lobes with multiple metrics and find that clusters of co-occurring features, at coarse enough scale, also cluster together spatially far more than one would expect if feature geometry were random. 3) The "galaxy" scale large-scale structure of the feature point cloud is not isotropic, but instead has a power law of eigenvalues with steepest slope in middle layers. We also quantify how the clustering entropy depends on the layer., Comment: 13 pages, 12 figures
Published: 2024

2. Efficient Dictionary Learning with Switch Sparse Autoencoders

Author: Mudide, Anish, Engels, Joshua, Michaud, Eric J., Tegmark, Max, and de Witt, Christian Schroeder
Subjects: Computer Science - Machine Learning
Abstract: Sparse autoencoders (SAEs) are a recent technique for decomposing neural network activations into human-interpretable features. However, in order for SAEs to identify all features represented in frontier models, it will be necessary to scale them up to very high width, posing a computational challenge. In this work, we introduce Switch Sparse Autoencoders, a novel SAE architecture aimed at reducing the compute cost of training SAEs. Inspired by sparse mixture of experts models, Switch SAEs route activation vectors between smaller "expert" SAEs, enabling SAEs to efficiently scale to many more features. We present experiments comparing Switch SAEs with other SAE architectures, and find that Switch SAEs deliver a substantial Pareto improvement in the reconstruction vs. sparsity frontier for a given fixed training compute budget. We also study the geometry of features across experts, analyze features duplicated across experts, and verify that Switch SAE features are as interpretable as features found by other SAE architectures., Comment: Code available at https://github.com/amudide/switch_sae
Published: 2024

3. Survival of the Fittest Representation: A Case Study with Modular Addition

Author: Ding, Xiaoman Delores, Guo, Zifan Carl, Michaud, Eric J., Liu, Ziming, and Tegmark, Max
Subjects: Computer Science - Machine Learning
Abstract: When a neural network can learn multiple distinct algorithms to solve a task, how does it "choose" between them during training? To approach this question, we take inspiration from ecology: when multiple species coexist, they eventually reach an equilibrium where some survive while others die out. Analogously, we suggest that a neural network at initialization contains many solutions (representations and algorithms), which compete with each other under pressure from resource constraints, with the "fittest" ultimately prevailing. To investigate this Survival of the Fittest hypothesis, we conduct a case study on neural networks performing modular addition, and find that these networks' multiple circular representations at different Fourier frequencies undergo such competitive dynamics, with only a few circles surviving at the end. We find that the frequencies with high initial signals and gradients, the "fittest," are more likely to survive. By increasing the embedding dimension, we also observe more surviving frequencies. Inspired by the Lotka-Volterra equations describing the dynamics between species, we find that the dynamics of the circles can be nicely characterized by a set of linear differential equations. Our results with modular addition show that it is possible to decompose complicated representations into simpler components, along with their basic interactions, to offer insight on the training dynamics of representations.
Published: 2024

4. Not All Language Model Features Are Linear

Author: Engels, Joshua, Michaud, Eric J., Liao, Isaac, Gurnee, Wes, and Tegmark, Max
Subjects: Computer Science - Machine Learning
Abstract: Recent work has proposed that language models perform computation by manipulating one-dimensional representations of concepts ("features") in activation space. In contrast, we explore whether some language model representations may be inherently multi-dimensional. We begin by developing a rigorous definition of irreducible multi-dimensional features based on whether they can be decomposed into either independent or non-co-occurring lower-dimensional features. Motivated by these definitions, we design a scalable method that uses sparse autoencoders to automatically find multi-dimensional features in GPT-2 and Mistral 7B. These auto-discovered features include strikingly interpretable examples, e.g. circular features representing days of the week and months of the year. We identify tasks where these exact circles are used to solve computational problems involving modular arithmetic in days of the week and months of the year. Next, we provide evidence that these circular features are indeed the fundamental unit of computation in these tasks with intervention experiments on Mistral 7B and Llama 3 8B. Finally, we find further circular representations by breaking down the hidden states for these tasks into interpretable components, and we examine the continuity of the days of the week feature in Mistral 7B., Comment: Code and data at https://github.com/JoshEngels/MultiDimensionalFeatures
Published: 2024

5. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Author: Marks, Samuel, Rager, Can, Michaud, Eric J., Belinkov, Yonatan, Bau, David, and Mueller, Aaron
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We introduce methods for discovering and applying sparse feature circuits. These are causally implicated subnetworks of human-interpretable features for explaining language model behaviors. Circuits identified in prior work consist of polysemantic and difficult-to-interpret units like attention heads or neurons, rendering them unsuitable for many downstream applications. In contrast, sparse feature circuits enable detailed understanding of unanticipated mechanisms. Because they are based on fine-grained units, sparse feature circuits are useful for downstream tasks: We introduce SHIFT, where we improve the generalization of a classifier by ablating features that a human judges to be task-irrelevant. Finally, we demonstrate an entirely unsupervised and scalable interpretability pipeline by discovering thousands of sparse feature circuits for automatically discovered model behaviors., Comment: Code and data at https://github.com/saprmarks/feature-circuits. Demonstration at https://feature-circuits.xyz
Published: 2024

6. Opening the AI black box: program synthesis via mechanistic interpretability

Author: Michaud, Eric J., Liao, Isaac, Lad, Vedang, Liu, Ziming, Mudide, Anish, Loughridge, Chloe, Guo, Zifan Carl, Kheirkhah, Tara Rezaei, Vukelić, Mateja, and Tegmark, Max
Subjects: Computer Science - Machine Learning
Abstract: We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy., Comment: 24 pages
Published: 2024

7. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

Author: Casper, Stephen, Davies, Xander, Shi, Claudia, Gilbert, Thomas Krendl, Scheurer, Jérémy, Rando, Javier, Freedman, Rachel, Korbak, Tomasz, Lindner, David, Freire, Pedro, Wang, Tony, Marks, Samuel, Segerie, Charbel-Raphaël, Carroll, Micah, Peng, Andi, Christoffersen, Phillip, Damani, Mehul, Slocum, Stewart, Anwar, Usman, Siththaranjan, Anand, Nadeau, Max, Michaud, Eric J., Pfau, Jacob, Krasheninnikov, Dmitrii, Chen, Xin, Langosco, Lauro, Hase, Peter, Bıyık, Erdem, Dragan, Anca, Krueger, David, Sadigh, Dorsa, and Hadfield-Menell, Dylan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Reinforcement learning from human feedback (RLHF) is a technique for training AI systems to align with human goals. RLHF has emerged as the central method used to finetune state-of-the-art large language models (LLMs). Despite this popularity, there has been relatively little public work systematizing its flaws. In this paper, we (1) survey open problems and fundamental limitations of RLHF and related methods; (2) overview techniques to understand, improve, and complement RLHF in practice; and (3) propose auditing and disclosure standards to improve societal oversight of RLHF systems. Our work emphasizes the limitations of RLHF and highlights the importance of a multi-faceted approach to the development of safer AI systems.
Published: 2023

8. The Quantization Model of Neural Scaling

Author: Michaud, Eric J., Liu, Ziming, Girit, Uzay, and Tegmark, Max
Subjects: Computer Science - Machine Learning, Condensed Matter - Disordered Systems and Neural Networks
Abstract: We propose the Quantization Model of neural scaling laws, explaining both the observed power law dropoff of loss with model and data size, and also the sudden emergence of new capabilities with scale. We derive this model from what we call the Quantization Hypothesis, where network knowledge and skills are "quantized" into discrete chunks ($\textbf{quanta}$). We show that when quanta are learned in order of decreasing use frequency, then a power law in use frequencies explains observed power law scaling of loss. We validate this prediction on toy datasets, then study how scaling curves decompose for large language models. Using language model gradients, we automatically decompose model behavior into a diverse set of skills (quanta). We tentatively find that the frequency at which these quanta are used in the training distribution roughly follows a power law corresponding with the empirical scaling exponent for language models, a prediction of our theory., Comment: 24 pages, 18 figures, NeurIPS 2023
Published: 2023

9. Precision Machine Learning

Author: Michaud, Eric J., Liu, Ziming, and Tegmark, Max
Subjects: Computer Science - Machine Learning, Physics - Computational Physics
Abstract: We explore unique considerations involved in fitting ML models to data with very high precision, as is often required for science applications. We empirically compare various function approximation methods and study how they scale with increasing parameters and data. We find that neural networks can often outperform classical approximation methods on high-dimensional examples, by auto-discovering and exploiting modular structures therein. However, neural networks trained with common optimizers are less powerful for low-dimensional cases, which motivates us to study the unique properties of neural network loss landscapes and the corresponding optimization challenges that arise in the high precision regime. To address the optimization issue in low dimensions, we develop training tricks which enable us to train neural networks to extremely low loss, close to the limits allowed by numerical precision.
Published: 2022
Full Text: View/download PDF

10. Omnigrok: Grokking Beyond Algorithmic Data

Author: Liu, Ziming, Michaud, Eric J., and Tegmark, Max
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Physics - Data Analysis, Statistics and Probability, Statistics - Methodology, Statistics - Machine Learning
Abstract: Grokking, the unusual phenomenon for algorithmic datasets where generalization happens long after overfitting the training data, has remained elusive. We aim to understand grokking by analyzing the loss landscapes of neural networks, identifying the mismatch between training and test losses as the cause for grokking. We refer to this as the "LU mechanism" because training and test losses (against model weight norm) typically resemble "L" and "U", respectively. This simple mechanism can nicely explain many aspects of grokking: data size dependence, weight decay dependence, the emergence of representations, etc. Guided by the intuitive picture, we are able to induce grokking on tasks involving images, language and molecules. In the reverse direction, we are able to eliminate grokking for algorithmic datasets. We attribute the dramatic nature of grokking for algorithmic datasets to representation learning.
Published: 2022

11. Towards Understanding Grokking: An Effective Theory of Representation Learning

Author: Liu, Ziming, Kitouni, Ouail, Nolte, Niklas, Michaud, Eric J., Tegmark, Max, and Williams, Mike
Subjects: Computer Science - Machine Learning, Condensed Matter - Disordered Systems and Neural Networks, Condensed Matter - Statistical Mechanics, Computer Science - Artificial Intelligence, Physics - Classical Physics
Abstract: We aim to understand grokking, a phenomenon where models generalize long after overfitting their training set. We present both a microscopic analysis anchored by an effective theory and a macroscopic analysis of phase diagrams describing learning performance across hyperparameters. We find that generalization originates from structured representations whose training dynamics and dependence on training set size can be predicted by our effective theory in a toy setting. We observe empirically the presence of four learning phases: comprehension, grokking, memorization, and confusion. We find representation learning to occur only in a "Goldilocks zone" (including comprehension and grokking) between memorization and confusion. We find on transformers the grokking phase stays closer to the memorization phase (compared to the comprehension phase), leading to delayed generalization. The Goldilocks phase is reminiscent of "intelligence from starvation" in Darwinian evolution, where resource limitations drive discovery of more efficient solutions. This study not only provides intuitive explanations of the origin of grokking, but also highlights the usefulness of physics-inspired tools, e.g., effective theories and phase diagrams, for understanding deep learning., Comment: Accepted by NeurIPS 2022
Published: 2022

12. Understanding Learned Reward Functions

Author: Michaud, Eric J., Gleave, Adam, and Russell, Stuart
Subjects: Computer Science - Machine Learning
Abstract: In many real-world tasks, it is not possible to procedurally specify an RL agent's reward function. In such cases, a reward function must instead be learned from interacting with and observing humans. However, current techniques for reward learning may fail to produce reward functions which accurately reflect user preferences. Absent significant advances in reward learning, it is thus important to be able to audit learned reward functions to verify whether they truly capture user preferences. In this paper, we investigate techniques for interpreting learned reward functions. In particular, we apply saliency methods to identify failure modes and predict the robustness of reward functions. We find that learned reward functions often implement surprising algorithms that rely on contingent aspects of the environment. We also discover that existing interpretability techniques often attend to irrelevant changes in reward output, suggesting that reward interpretability may need significantly different methods from policy interpretability., Comment: Presented at Deep RL Workshop, NeurIPS 2020
Published: 2020

13. Examining the causal structures of deep neural networks using information theory

Author: Mattsson, Simon, Michaud, Eric J., and Hoel, Erik
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Deep Neural Networks (DNNs) are often examined at the level of their response to input, such as analyzing the mutual information between nodes and data sets. Yet DNNs can also be examined at the level of causation, exploring "what does what" within the layers of the network itself. Historically, analyzing the causal structure of DNNs has received less attention than understanding their responses to input. Yet definitionally, generalizability must be a function of a DNN's causal structure since it reflects how the DNN responds to unseen or even not-yet-defined future inputs. Here, we introduce a suite of metrics based on information theory to quantify and track changes in the causal structure of DNNs during training. Specifically, we introduce the effective information (EI) of a feedforward DNN, which is the mutual information between layer input and output following a maximum-entropy perturbation. The EI can be used to assess the degree of causal influence nodes and edges have over their downstream targets in each layer. We show that the EI can be further decomposed in order to examine the sensitivity of a layer (measured by how well edges transmit perturbations) and the degeneracy of a layer (measured by how edge overlap interferes with transmission), along with estimates of the amount of integrated information of a layer. Together, these properties define where each layer lies in the "causal plane" which can be used to visualize how layer connectivity becomes more sensitive or degenerate over time, and how integration changes during training, revealing how the layer-by-layer causal structure differentiates. These results may help in understanding the generalization capabilities of DNNs and provide foundational tools for making DNNs both more generalizable and more explainable., Comment: 14 pages, 8 figures
Published: 2020

14. Lunar Opportunities for SETI

Author: Michaud, Eric J., Siemion, Andrew P. V., Drew, Jamie, and Worden, S. Pete
Subjects: Astrophysics - Instrumentation and Methods for Astrophysics
Abstract: A radio telescope placed in lunar orbit, or on the surface of the Moon's farside, could be of great value to the Search for Extraterrestrial Intelligence (SETI). The advantage of such a telescope is that it would be shielded by the body of the Moon from terrestrial sources of radio frequency interference (RFI). While RFI can be identified and ignored by other fields of radio astronomy, the possible spectral similarity between human and alien-generated radio emission makes the abundance of artificial radio emission on and around the Earth a significant complicating factor for SETI. A Moon-based telescope would avoid this challenge. In this paper, we review existing literature on Moon-based radio astronomy, discuss the benefits of lunar SETI, contrast possible surface- and orbit-based telescope designs, and argue that such initiatives are scientifically feasible, both technically and financially, within the next decade., Comment: 7 pages, submitted as a white paper for the National Academy of Sciences Planetary Science and Astrobiology Decadal Survey 2023-2032
Published: 2020

15. Precision Machine Learning

Author: Massachusetts Institute of Technology. Department of Physics, Center for Brains, Minds, and Machines, Michaud, Eric J., Liu, Ziming, Tegmark, Max, Massachusetts Institute of Technology. Department of Physics, Center for Brains, Minds, and Machines, Michaud, Eric J., Liu, Ziming, and Tegmark, Max
Abstract: We explore unique considerations involved in fitting machine learning (ML) models to data with very high precision, as is often required for science applications. We empirically compare various function approximation methods and study how they scale with increasing parameters and data. We find that neural networks (NNs) can often outperform classical approximation methods on high-dimensional examples, by (we hypothesize) auto-discovering and exploiting modular structures therein. However, neural networks trained with common optimizers are less powerful for low-dimensional cases, which motivates us to study the unique properties of neural network loss landscapes and the corresponding optimization challenges that arise in the high precision regime. To address the optimization issue in low dimensions, we develop training tricks which enable us to train neural networks to extremely low loss, close to the limits allowed by numerical precision.
Published: 2023

16. Precision Machine Learning

Author: Michaud, Eric J., primary, Liu, Ziming, additional, and Tegmark, Max, additional
Published: 2023
Full Text: View/download PDF

17. Examining the Causal Structures of Deep Neural Networks Using Information Theory

Author: Marrow, Scythia, primary, Michaud, Eric J., additional, and Hoel, Erik, additional
Published: 2020
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

17 results on '"Michaud, Eric J."'

1. The Geometry of Concepts: Sparse Autoencoder Feature Structure

2. Efficient Dictionary Learning with Switch Sparse Autoencoders

3. Survival of the Fittest Representation: A Case Study with Modular Addition

4. Not All Language Model Features Are Linear

5. Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

6. Opening the AI black box: program synthesis via mechanistic interpretability

7. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

8. The Quantization Model of Neural Scaling

9. Precision Machine Learning

10. Omnigrok: Grokking Beyond Algorithmic Data

11. Towards Understanding Grokking: An Effective Theory of Representation Learning

12. Understanding Learned Reward Functions

13. Examining the causal structures of deep neural networks using information theory

14. Lunar Opportunities for SETI

15. Precision Machine Learning

16. Precision Machine Learning

17. Examining the Causal Structures of Deep Neural Networks Using Information Theory

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

17 results on '"Michaud, Eric J."'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources