Author: "von Oswald, Johannes" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"von Oswald, Johannes"' showing total 26 results

Start Over Author "von Oswald, Johannes"

26 results on '"von Oswald, Johannes"'

1. Multi-agent cooperation through learning-aware policy gradients

Author: Meulemans, Alexander, Kobayashi, Seijin, von Oswald, Johannes, Scherrer, Nino, Elmoznino, Eric, Richards, Blake, Lajoie, Guillaume, Arcas, Blaise Agüera y, and Sacramento, João
Subjects: Computer Science - Artificial Intelligence
Abstract: Self-interested individuals often fail to cooperate, posing a fundamental challenge for multi-agent learning. How can we achieve cooperation among self-interested, independent learning agents? Promising recent work has shown that in certain tasks cooperation can be established between learning-aware agents who model the learning dynamics of each other. Here, we present the first unbiased, higher-derivative-free policy gradient algorithm for learning-aware reinforcement learning, which takes into account that other agents are themselves learning through trial and error based on multiple noisy trials. We then leverage efficient sequence models to condition behavior on long observation histories that contain traces of the learning dynamics of other agents. Training long-context policies with our algorithm leads to cooperative behavior and high returns on standard social dilemmas, including a challenging environment where temporally-extended action coordination is required. Finally, we derive from the iterated prisoner's dilemma a novel explanation for how and when cooperation arises among self-interested learning-aware agents.
Published: 2024

2. Learning Randomized Algorithms with Transformers

Author: von Oswald, Johannes, Kobayashi, Seijin, Akram, Yassir, and Steger, Angelika
Subjects: Computer Science - Machine Learning
Abstract: Randomization is a powerful tool that endows algorithms with remarkable properties. For instance, randomized algorithms excel in adversarial settings, often surpassing the worst-case performance of deterministic algorithms with large margins. Furthermore, their success probability can be amplified by simple strategies such as repetition and majority voting. In this paper, we enhance deep neural networks, in particular transformer models, with randomization. We demonstrate for the first time that randomized algorithms can be instilled in transformers through learning, in a purely data- and objective-driven manner. First, we analyze known adversarial objectives for which randomized algorithms offer a distinct advantage over deterministic ones. We then show that common optimization techniques, such as gradient descent or evolutionary strategies, can effectively learn transformer parameters that make use of the randomness provided to the model. To illustrate the broad applicability of randomization in empowering neural networks, we study three conceptual tasks: associative recall, graph coloring, and agents that explore grid worlds. In addition to demonstrating increased robustness against oblivious adversaries through learned randomization, our experiments reveal remarkable performance improvements due to the inherently random nature of the neural networks' computation and predictions.
Published: 2024

3. When can transformers compositionally generalize in-context?

Author: Kobayashi, Seijin, Schug, Simon, Akram, Yassir, Redhardt, Florian, von Oswald, Johannes, Pascanu, Razvan, Lajoie, Guillaume, and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Many tasks can be composed from a few independent components. This gives rise to a combinatorial explosion of possible tasks, only some of which might be encountered during training. Under what circumstances can transformers compositionally generalize from a subset of tasks to all possible combinations of tasks that share similar components? Here we study a modular multitask setting that allows us to precisely control compositional structure in the data generation process. We present evidence that transformers learning in-context struggle to generalize compositionally on this task despite being in principle expressive enough to do so. Compositional generalization becomes possible only when introducing a bottleneck that enforces an explicit separation between task inference and task execution., Comment: ICML 2024 workshop on Next Generation of Sequence Modeling Architectures
Published: 2024

4. State Soup: In-Context Skill Learning, Retrieval and Mixing

Author: Pióro, Maciej, Wołczyk, Maciej, Pascanu, Razvan, von Oswald, Johannes, and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: A new breed of gated-linear recurrent neural networks has reached state-of-the-art performance on a range of sequence modeling problems. Such models naturally handle long sequences efficiently, as the cost of processing a new input is independent of sequence length. Here, we explore another advantage of these stateful sequence models, inspired by the success of model merging through parameter interpolation. Building on parallels between fine-tuning and in-context learning, we investigate whether we can treat internal states as task vectors that can be stored, retrieved, and then linearly combined, exploiting the linearity of recurrence. We study this form of fast model merging on Mamba-2.8b, a pretrained recurrent model, and present preliminary evidence that simple linear state interpolation methods suffice to improve next-token perplexity as well as downstream in-context learning task performance.
Published: 2024

5. Linear Transformers are Versatile In-Context Learners

Author: Vladymyrov, Max, von Oswald, Johannes, Sandler, Mark, and Ge, Rong
Subjects: Computer Science - Machine Learning
Abstract: Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that any linear transformer maintains an implicit linear model and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We reverse-engineer this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.
Published: 2024

6. Discovering modular solutions that generalize compositionally

Author: Schug, Simon, Kobayashi, Seijin, Akram, Yassir, Wołczyk, Maciej, Proca, Alexandra, von Oswald, Johannes, Pascanu, Razvan, Sacramento, João, and Steger, Angelika
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Many complex tasks can be decomposed into simpler, independent parts. Discovering such underlying compositional structure has the potential to enable compositional generalization. Despite progress, our most powerful systems struggle to compose flexibly. It therefore seems natural to make models more modular to help capture the compositional nature of many tasks. However, it is unclear under which circumstances modular systems can discover hidden compositional structure. To shed light on this question, we study a teacher-student setting with a modular teacher where we have full control over the composition of ground truth modules. This allows us to relate the problem of compositional generalization to that of identification of the underlying modules. In particular we study modularity in hypernetworks representing a general class of multiplicative interactions. We show theoretically that identification up to linear transformation purely from demonstrations is possible without having to learn an exponential number of module combinations. We further demonstrate empirically that under the theoretically identified conditions, meta-learning from finite data can discover modular policies that generalize compositionally in a number of complex environments., Comment: Published as a conference paper at ICLR 2024; Code available at https://github.com/smonsays/modular-hyperteacher
Published: 2023

7. Uncovering mesa-optimization algorithms in Transformers

Author: von Oswald, Johannes, Schlegel, Maximilian, Meulemans, Alexander, Kobayashi, Seijin, Niklasson, Eyvind, Zucchet, Nicolas, Scherrer, Nino, Miller, Nolan, Sandler, Mark, Arcas, Blaise Agüera y, Vladymyrov, Max, Pascanu, Razvan, and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Some autoregressive models exhibit in-context learning capabilities: being able to learn as an input sequence is processed, without undergoing any parameter changes, and without being explicitly trained to do so. The origins of this phenomenon are still poorly understood. Here we analyze a series of Transformer models trained to perform synthetic sequence prediction tasks, and discover that standard next-token prediction error minimization gives rise to a subsidiary learning algorithm that adjusts the model as new inputs are revealed. We show that this process corresponds to gradient-based optimization of a principled objective function, which leads to strong generalization performance on unseen sequences. Our findings explain in-context learning as a product of autoregressive loss minimization and inform the design of new optimization-based Transformer layers.
Published: 2023

8. Gated recurrent neural networks discover attention

Author: Zucchet, Nicolas, Kobayashi, Seijin, Akram, Yassir, von Oswald, Johannes, Larcher, Maxime, Steger, Angelika, and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Recent architectural developments have enabled recurrent neural networks (RNNs) to reach and even surpass the performance of Transformers on certain sequence modeling tasks. These modern RNNs feature a prominent design pattern: linear recurrent layers interconnected by feedforward paths with multiplicative gating. Here, we show how RNNs equipped with these two design elements can exactly implement (linear) self-attention, the main building block of Transformers. By reverse-engineering a set of trained RNNs, we find that gradient descent in practice discovers our construction. In particular, we examine RNNs trained to solve simple in-context learning tasks on which Transformers are known to excel and find that gradient descent instills in our RNNs the same attention-based in-context learning algorithm used by Transformers. Our findings highlight the importance of multiplicative interactions in neural networks and suggest that certain RNNs might be unexpectedly implementing attention under the hood.
Published: 2023

9. Transformers learn in-context by gradient descent

Author: von Oswald, Johannes, Niklasson, Eyvind, Randazzo, Ettore, Sacramento, João, Mordvintsev, Alexander, Zhmoginov, Andrey, and Vladymyrov, Max
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: At present, the mechanisms of in-context learning in Transformers are not well understood and remain mostly an intuition. In this paper, we suggest that training Transformers on auto-regressive objectives is closely related to gradient-based meta-learning formulations. We start by providing a simple weight construction that shows the equivalence of data transformations induced by 1) a single linear self-attention layer and by 2) gradient-descent (GD) on a regression loss. Motivated by that construction, we show empirically that when training self-attention-only Transformers on simple regression tasks either the models learned by GD and Transformers show great similarity or, remarkably, the weights found by optimization match the construction. Thus we show how trained Transformers become mesa-optimizers i.e. learn models by gradient descent in their forward pass. This allows us, at least in the domain of regression problems, to mechanistically understand the inner workings of in-context learning in optimized Transformers. Building on this insight, we furthermore identify how Transformers surpass the performance of plain gradient descent by learning an iterative curvature correction and learn linear models on deep data representations to solve non-linear regression tasks. Finally, we discuss intriguing parallels to a mechanism identified to be crucial for in-context learning termed induction-head (Olsson et al., 2022) and show how it could be understood as a specific case of in-context learning by gradient descent learning within Transformers. Code to reproduce the experiments can be found at https://github.com/google-research/self-organising-systems/tree/master/transformers_learn_icl_by_gd .
Published: 2022

10. Disentangling the Predictive Variance of Deep Ensembles through the Neural Tangent Kernel

Author: Kobayashi, Seijin, Aceituno, Pau Vilimelis, and von Oswald, Johannes
Subjects: Computer Science - Machine Learning
Abstract: Identifying unfamiliar inputs, also known as out-of-distribution (OOD) detection, is a crucial property of any decision making process. A simple and empirically validated technique is based on deep ensembles where the variance of predictions over different neural networks acts as a substitute for input uncertainty. Nevertheless, a theoretical understanding of the inductive biases leading to the performance of deep ensemble's uncertainty estimation is missing. To improve our description of their behavior, we study deep ensembles with large layer widths operating in simplified linear training regimes, in which the functions trained with gradient descent can be described by the neural tangent kernel. We identify two sources of noise, each inducing a distinct inductive bias in the predictive variance at initialization. We further show theoretically and empirically that both noise sources affect the predictive variance of non-linear deep ensembles in toy models and realistic settings after training. Finally, we propose practical ways to eliminate part of these noise sources leading to significant changes and improved OOD detection in trained deep ensembles.
Published: 2022

11. Random initialisations performing above chance and how to find them

Author: Benzing, Frederik, Schug, Simon, Meier, Robert, von Oswald, Johannes, Akram, Yassir, Zucchet, Nicolas, Aitchison, Laurence, and Steger, Angelika
Subjects: Computer Science - Machine Learning
Abstract: Neural networks trained with stochastic gradient descent (SGD) starting from different random initialisations typically find functionally very similar solutions, raising the question of whether there are meaningful differences between different SGD solutions. Entezari et al.\ recently conjectured that despite different initialisations, the solutions found by SGD lie in the same loss valley after taking into account the permutation invariance of neural networks. Concretely, they hypothesise that any two solutions found by SGD can be permuted such that the linear interpolation between their parameters forms a path without significant increases in loss. Here, we use a simple but powerful algorithm to find such permutations that allows us to obtain direct empirical evidence that the hypothesis is true in fully connected networks. Strikingly, we find that two networks already live in the same loss valley at the time of initialisation and averaging their random, but suitably permuted initialisation performs significantly above chance. In contrast, for convolutional architectures, our evidence suggests that the hypothesis does not hold. Especially in a large learning rate regime, SGD seems to discover diverse modes., Comment: NeurIPS 2022, 14th Annual Workshop on Optimization for Machine Learning (OPT2022)
Published: 2022

12. The least-control principle for local learning at equilibrium

Author: Meulemans, Alexander, Zucchet, Nicolas, Kobayashi, Seijin, von Oswald, Johannes, and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, 68T07, I.2.6
Abstract: Equilibrium systems are a powerful way to express neural computations. As special cases, they include models of great current interest in both neuroscience and machine learning, such as deep neural networks, equilibrium recurrent neural networks, deep equilibrium models, or meta-learning. Here, we present a new principle for learning such systems with a temporally- and spatially-local rule. Our principle casts learning as a least-control problem, where we first introduce an optimal controller to lead the system towards a solution state, and then define learning as reducing the amount of control needed to reach such a state. We show that incorporating learning signals within a dynamics as an optimal control enables transmitting activity-dependent credit assignment information, avoids storing intermediate states in memory, and does not rely on infinitesimal learning signals. In practice, our principle leads to strong performance matching that of leading gradient-based learning methods when applied to an array of problems involving recurrent neural networks and meta-learning. Our results shed light on how the brain might learn and offer new ways of approaching a broad class of machine learning problems., Comment: Published at NeurIPS 2022. 56 pages
Published: 2022

13. Learning where to learn: Gradient sparsity in meta and continual learning

Author: von Oswald, Johannes, Zhao, Dominic, Kobayashi, Seijin, Schug, Simon, Caccia, Massimo, Zucchet, Nicolas, and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing
Abstract: Finding neural network weights that generalize well from small datasets is difficult. A promising approach is to learn a weight initialization such that a small number of weight changes results in low generalization error. We show that this form of meta-learning can be improved by letting the learning algorithm decide which weights to change, i.e., by learning where to learn. We find that patterned sparsity emerges from this process, with the pattern of sparsity varying on a problem-by-problem basis. This selective sparsity results in better generalization and less interference in a range of few-shot and continual learning problems. Moreover, we find that sparse learning also emerges in a more expressive model where learning rates are meta-learned. Our results shed light on an ongoing debate on whether meta-learning can discover adaptable features and suggest that learning by sparse gradient descent is a powerful inductive bias for meta-learning systems., Comment: Published at NeurIPS 2021
Published: 2021

14. A contrastive rule for meta-learning

Author: Zucchet, Nicolas, Schug, Simon, von Oswald, Johannes, Zhao, Dominic, and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Neural and Evolutionary Computing, Quantitative Biology - Neurons and Cognition
Abstract: Humans and other animals are capable of improving their learning performance as they solve related tasks from a given problem domain, to the point of being able to learn from extremely limited data. While synaptic plasticity is generically thought to underlie learning in the brain, the precise neural and synaptic mechanisms by which learning processes improve through experience are not well understood. Here, we present a general-purpose, biologically-plausible meta-learning rule which estimates gradients with respect to the parameters of an underlying learning algorithm by simply running it twice. Our rule may be understood as a generalization of contrastive Hebbian learning to meta-learning and notably, it neither requires computing second derivatives nor going backwards in time, two characteristic features of previous gradient-based methods that are hard to conceive in physical neural circuits. We demonstrate the generality of our rule by applying it to two distinct models: a complex synapse with internal states which consolidate task-shared information, and a dual-system architecture in which a primary network is rapidly modulated by another one to learn the specifics of each task. For both models, our meta-learning rule matches or outperforms reference algorithms on a wide range of benchmark problems, while only using information presumed to be locally available at neurons and synapses. We corroborate these findings with a theoretical analysis of the gradient estimation error incurred by our rule., Comment: 32 pages, 10 figures, published at NeurIPS 2022
Published: 2021

15. Posterior Meta-Replay for Continual Learning

Author: Henning, Christian, Cervera, Maria R., D'Angelo, Francesco, von Oswald, Johannes, Traber, Regina, Ehret, Benjamin, Kobayashi, Seijin, Grewe, Benjamin F., and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence
Abstract: Learning a sequence of tasks without access to i.i.d. observations is a widely studied form of continual learning (CL) that remains challenging. In principle, Bayesian learning directly applies to this setting, since recursive and one-off Bayesian updates yield the same result. In practice, however, recursive updating often leads to poor trade-off solutions across tasks because approximate inference is necessary for most models of interest. Here, we describe an alternative Bayesian approach where task-conditioned parameter distributions are continually inferred from data. We offer a practical deep learning implementation of our framework based on probabilistic task-conditioned hypernetworks, an approach we term posterior meta-replay. Experiments on standard benchmarks show that our probabilistic hypernetworks compress sequences of posterior parameter distributions with virtually no forgetting. We obtain considerable performance gains compared to existing Bayesian CL methods, and identify task inference as our major limiting factor. This limitation has several causes that are independent of the considered sequential setting, opening up new avenues for progress in CL., Comment: Published at NeurIPS 2021
Published: 2021

16. Neural networks with late-phase weights

Author: von Oswald, Johannes, Kobayashi, Seijin, Meulemans, Alexander, Henning, Christian, Grewe, Benjamin F., and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Computer Vision and Pattern Recognition, Statistics - Machine Learning
Abstract: The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the remaining parameters. Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning., Comment: 25 pages, 6 figures
Published: 2020

17. Continual Learning in Recurrent Neural Networks

Author: Ehret, Benjamin, Henning, Christian, Cervera, Maria R., Meulemans, Alexander, von Oswald, Johannes, and Grewe, Benjamin F.
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs. In contrast to feedforward networks, RNNs iteratively reuse a shared set of weights and require working memory to process input samples. We show that the performance of weight-importance methods is not directly affected by the length of the processed sequences, but rather by high working memory requirements, which lead to an increased need for stability at the cost of decreased plasticity for learning subsequent tasks. We additionally provide theoretical arguments supporting this interpretation by studying linear RNNs. Our study shows that established CL methods can be successfully ported to the recurrent case, and that a recent regularization approach based on hypernetworks outperforms weight-importance methods, thus emerging as a promising candidate for CL in RNNs. Overall, we provide insights on the differences between CL in feedforward networks and RNNs, while guiding towards effective solutions to tackle CL on sequential data., Comment: Published at ICLR 2021
Published: 2020

18. Continual learning with hypernetworks

Author: von Oswald, Johannes, Henning, Christian, Grewe, Benjamin F., and Sacramento, João
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Statistics - Machine Learning, 68T99
Abstract: Artificial neural networks suffer from catastrophic forgetting when they are sequentially trained on multiple tasks. To overcome this problem, we present a novel approach based on task-conditioned hypernetworks, i.e., networks that generate the weights of a target model based on task identity. Continual learning (CL) is less difficult for this class of models thanks to a simple key feature: instead of recalling the input-output relations of all previously seen data, task-conditioned hypernetworks only require rehearsing task-specific weight realizations, which can be maintained in memory using a simple regularizer. Besides achieving state-of-the-art performance on standard CL benchmarks, additional experiments on long task sequences reveal that task-conditioned hypernetworks display a very large capacity to retain previous memories. Notably, such long memory lifetimes are achieved in a compressive regime, when the number of trainable hypernetwork weights is comparable or smaller than target network size. We provide insight into the structure of low-dimensional task embedding spaces (the input space of the hypernetwork) and show that task-conditioned hypernetworks demonstrate transfer learning. Finally, forward information transfer is further supported by empirical results on a challenging CL benchmark based on the CIFAR-10/100 image datasets., Comment: Published at ICLR 2020
Published: 2019

19. The least-control principle for learning at equilibrium

Author: Meulemans, Alexander, Zucchet, Nicholas, Kobayashi, Seijin, von Oswald, Johannes, Sacramento, João, and University of Zurich
Subjects: 570 Life sciences, biology, 10194 Institute of Neuroinformatics
Published: 2022

20. On the reversed bias-variance tradeoff in deep ensembles

Author: Kobayashi, Seijin, von Oswald, Johannes, Grewe, Benjamin F, and University of Zurich
Subjects: 570 Life sciences, biology, 10194 Institute of Neuroinformatics
Abstract: Deep ensembles aggregate predictions of diverse neural networks to improve generalisation and quantify uncertainty. Here, we investigate their behavior when increasing the ensemble members’ parameter size - a practice typically associated with better performance for single models. We show that under practical assumptions in the overparametrized regime far into the double descent curve, not only the ensemble test loss degrades, but common out-of-distribution detection and calibration metrics suffer as well. Reminiscent to deep double descent, we observe this phenomenon not only when increasing the single member’s capacity but also as we increase the training budget, suggesting deep ensembles can benefit from early stopping. This sheds light on the success and failure modes of deep ensembles and suggests that averaging finite width models perform better than the neural tangent kernel limit for these metrics.
Published: 2021

21. Neural networks with late-phase weights

Author: von Oswald, Johannes, Kobayashi, Seijin, Meulemans, Alexander, Henning, Christian Andreas, Grewe, Benjamin F, Sacramento, João, and University of Zurich
Subjects: FOS: Computer and information sciences, Computer Science::Machine Learning, Computer Science - Machine Learning, Statistics - Machine Learning, Computer Vision and Pattern Recognition (cs.CV), Computer Science - Computer Vision and Pattern Recognition, 570 Life sciences, biology, Machine Learning (stat.ML), Machine Learning (cs.LG), 10194 Institute of Neuroinformatics
Abstract: The largely successful method of training neural networks is to learn their weights using some variant of stochastic gradient descent (SGD). Here, we show that the solutions found by SGD can be further improved by ensembling a subset of the weights in late stages of learning. At the end of learning, we obtain back a single model by taking a spatial average in weight space. To avoid incurring increased computational costs, we investigate a family of low-dimensional late-phase weight models which interact multiplicatively with the remaining parameters. Our results show that augmenting standard models with late-phase weights improves generalization in established benchmarks such as CIFAR-10/100, ImageNet and enwik8. These findings are complemented with a theoretical analysis of a noisy quadratic problem which provides a simplified picture of the late phases of neural network learning., Comment: 25 pages, 6 figures
Published: 2021

22. Continual Learning in Recurrent Neural Networks

Author: Ehret, Benjamin, Henning, Christian, Cervera, Maria R, Meulemans, A, von Oswald, Johannes, Grewe, Benjamin F; https://orcid.org/0000-0001-8560-2120, Ehret, Benjamin, Henning, Christian, Cervera, Maria R, Meulemans, A, von Oswald, Johannes, and Grewe, Benjamin F; https://orcid.org/0000-0001-8560-2120
Abstract: While a diverse collection of continual learning (CL) methods has been proposed to prevent catastrophic forgetting, a thorough investigation of their effectiveness for processing sequential data with recurrent neural networks (RNNs) is lacking. Here, we provide the first comprehensive evaluation of established CL methods on a variety of sequential data benchmarks. Specifically, we shed light on the particularities that arise when applying weight-importance methods, such as elastic weight consolidation, to RNNs. In contrast to feedforward networks, RNNs iteratively reuse a shared set of weights and require working memory to process input samples. We show that the performance of weight-importance methods is not directly affected by the length of the processed sequences, but rather by high working memory requirements, which lead to an increased need for stability at the cost of decreased plasticity for learning subsequent tasks. We additionally provide theoretical arguments supporting this interpretation by studying linear RNNs. Our study shows that established CL methods can be successfully ported to the recurrent case, and that a recent regularization approach based on hypernetworks outperforms weight-importance methods, thus emerging as a promising candidate for CL in RNNs. Overall, we provide insights on the differences between CL in feedforward networks and RNNs, while guiding towards effective solutions to tackle CL on sequential data.
Published: 2021

23. Continual learning with hypernetworks

Author: von Oswald, Johannes, Henning, Christian, Grewe, Benjamin F; https://orcid.org/0000-0001-8560-2120, Sacramento, João, von Oswald, Johannes, Henning, Christian, Grewe, Benjamin F; https://orcid.org/0000-0001-8560-2120, and Sacramento, João
Abstract: Artificial neural networks suffer from catastrophic forgetting when they are se-quentially trained on multiple tasks. To overcome this problem, we present a novelapproach based on task-conditioned hypernetworks, i.e., networks that generatethe weights of a target model based on task identity. Continual learning (CL) isless difficult for this class of models thanks to a simple key feature: instead ofrecalling the input-output relations of all previously seen data, task-conditionedhypernetworks only require rehearsing task-specific weight realizations, which canbe maintained in memory using a simple regularizer. Besides achieving state-of-the-art performance on standard CL benchmarks, additional experiments on longtask sequences reveal that task-conditioned hypernetworks display a very largecapacity to retain previous memories. Notably, such long memory lifetimes areachieved in a compressive regime, when the number of trainable hypernetworkweights is comparable or smaller than target network size. We provide insight intothe structure of low-dimensional task embedding spaces (the input space of thehypernetwork) and show that task-conditioned hypernetworks demonstrate transferlearning. Finally, forward information transfer is further supported by empiricalresults on a challenging CL benchmark based on the CIFAR-10/100 image datasets.
Published: 2020

24. Meta-Learning via Hypernetworks

Author: Zhao, Dominic, Kobayashi, Seijin, Sacramento, João, von Oswald, Johannes, Zhao, Dominic, Kobayashi, Seijin, Sacramento, João, and von Oswald, Johannes
Abstract: Recent developments in few-shot learning have shown that during fast adaption, gradient-based meta-learners mostly rely on embedding features of powerful pretrained networks. This leads us to research ways to effectively adapt features and utilize the meta-learner's full potential. Here, we demonstrate the effectiveness of hypernetworks in this context. We propose a soft row-sharing hypernetwork architecture and show that training the hypernetwork with a variant of MAML is tightly linked to meta-learning a curvature matrix used to condition gradients during fast adaptation. We achieve similar results as state-of-art model-agnostic methods in the overparametrized case, while outperforming many MAML variants without using different optimization schemes in the compressive regime. Furthermore, we empirically show that hypernetworks do leverage the inner loop optimization for better adaptation, and analyse how they naturally try to learn the shared curvature of constructed tasks on a toy problem when using our proposed training algorithm.
Published: 2020

25. Approximating the Predictive Distribution via Adversarially-Trained Hypernetworks

Author: Henning, Christian, von Oswald, Johannes, Sacramento, João, Surace, Simone C., Pfister, Jean-Pascal, Grewe, Benjamin F., and University of Zurich
Subjects: 570 Life sciences, biology, 10194 Institute of Neuroinformatics
Published: 2018

26. Interpretability of Learning Algorithms Encoded in Deep Neural Networks

Author: von Oswald, Johannes
Subjects: Data processing, computer science
Published: 2024

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

26 results on '"von Oswald, Johannes"'

1. Multi-agent cooperation through learning-aware policy gradients

2. Learning Randomized Algorithms with Transformers

3. When can transformers compositionally generalize in-context?

4. State Soup: In-Context Skill Learning, Retrieval and Mixing

5. Linear Transformers are Versatile In-Context Learners

6. Discovering modular solutions that generalize compositionally

7. Uncovering mesa-optimization algorithms in Transformers

8. Gated recurrent neural networks discover attention

9. Transformers learn in-context by gradient descent

10. Disentangling the Predictive Variance of Deep Ensembles through the Neural Tangent Kernel

11. Random initialisations performing above chance and how to find them

12. The least-control principle for local learning at equilibrium

13. Learning where to learn: Gradient sparsity in meta and continual learning

14. A contrastive rule for meta-learning

15. Posterior Meta-Replay for Continual Learning

16. Neural networks with late-phase weights

17. Continual Learning in Recurrent Neural Networks

18. Continual learning with hypernetworks

19. The least-control principle for learning at equilibrium

20. On the reversed bias-variance tradeoff in deep ensembles

21. Neural networks with late-phase weights

22. Continual Learning in Recurrent Neural Networks

23. Continual learning with hypernetworks

24. Meta-Learning via Hypernetworks

25. Approximating the Predictive Distribution via Adversarially-Trained Hypernetworks

26. Interpretability of Learning Algorithms Encoded in Deep Neural Networks

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

Publisher

26 results on '"von Oswald, Johannes"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources