Author: "Tramèr, Florian" / Topic: computer science - computation and language - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tramèr, Florian"' showing total 12 results

Start Over Author "Tramèr, Florian" Topic computer science - computation and language

12 results on '"Tramèr, Florian"'

1. Blind Baselines Beat Membership Inference Attacks for Foundation Models

Author: Das, Debeshee, Zhang, Jie, and Tramèr, Florian
Subjects: Computer Science - Cryptography and Security, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Membership inference (MI) attacks try to determine if a data sample was used to train a machine learning model. For foundation models trained on unknown Web data, MI attacks can be used to detect copyrighted training materials, measure test set contamination, or audit machine unlearning. Unfortunately, we find that evaluations of MI attacks for foundation models are flawed, because they sample members and non-members from different distributions. For 8 published MI evaluation datasets, we show that blind attacks -- that distinguish the member and non-member distributions without looking at any trained model -- outperform state-of-the-art MI attacks. Existing evaluations thus tell us nothing about membership leakage of a foundation model's training data.
Published: 2024

2. Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

Author: Rando, Javier, Croce, Francesco, Mitka, Kryštof, Shabalin, Stepan, Andriushchenko, Maksym, Flammarion, Nicolas, and Tramèr, Florian
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Large language models are aligned to be safe, preventing users from generating harmful content like misinformation or instructions for illegal activities. However, previous work has shown that the alignment process is vulnerable to poisoning attacks. Adversaries can manipulate the safety training data to inject backdoors that act like a universal sudo command: adding the backdoor string to any prompt enables harmful responses from models that, otherwise, behave safely. Our competition, co-located at IEEE SaTML 2024, challenged participants to find universal backdoors in several large language models. This report summarizes the key findings and promising ideas for future research., Comment: Competition Report
Published: 2024

3. Foundational Challenges in Assuring Alignment and Safety of Large Language Models

Author: Anwar, Usman, Saparov, Abulhair, Rando, Javier, Paleka, Daniel, Turpin, Miles, Hase, Peter, Lubana, Ekdeep Singh, Jenner, Erik, Casper, Stephen, Sourbut, Oliver, Edelman, Benjamin L., Zhang, Zhaowei, Günther, Mario, Korinek, Anton, Hernandez-Orallo, Jose, Hammond, Lewis, Bigelow, Eric, Pan, Alexander, Langosco, Lauro, Korbak, Tomasz, Zhang, Heidi, Zhong, Ruiqi, hÉigeartaigh, Seán Ó, Recchia, Gabriel, Corsi, Giulio, Chan, Alan, Anderljung, Markus, Edwards, Lilian, Petrov, Aleksandar, de Witt, Christian Schroeder, Motwan, Sumeet Ramesh, Bengio, Yoshua, Chen, Danqi, Torr, Philip H. S., Albanie, Samuel, Maharaj, Tegan, Foerster, Jakob, Tramer, Florian, He, He, Kasirzadeh, Atoosa, Choi, Yejin, and Krueger, David
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Computers and Society
Abstract: This work identifies 18 foundational challenges in assuring the alignment and safety of large language models (LLMs). These challenges are organized into three different categories: scientific understanding of LLMs, development and deployment methods, and sociotechnical challenges. Based on the identified challenges, we pose $200+$ concrete research questions.
Published: 2024

4. Query-Based Adversarial Prompt Generation

Author: Hayase, Jonathan, Borevkovic, Ema, Carlini, Nicholas, Tramèr, Florian, and Nasr, Milad
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Recent work has shown it is possible to construct adversarial examples that cause an aligned language model to emit harmful strings or perform harmful behavior. Existing attacks work either in the white-box setting (with full access to the model weights), or through transferability: the phenomenon that adversarial examples crafted on one model often remain effective on other models. We improve on prior work with a query-based attack that leverages API access to a remote language model to construct adversarial examples that cause the model to emit harmful strings with (much) higher probability than with transfer-only attacks. We validate our attack on GPT-3.5 and OpenAI's safety classifier; we can cause GPT-3.5 to emit harmful strings that current transfer attacks fail at, and we can evade the safety classifier with nearly 100% probability.
Published: 2024

5. Universal Jailbreak Backdoors from Poisoned Human Feedback

Author: Rando, Javier and Tramèr, Florian
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Reinforcement Learning from Human Feedback (RLHF) is used to align large language models to produce helpful and harmless responses. Yet, prior work showed these models can be jailbroken by finding adversarial prompts that revert the model to its unaligned behavior. In this paper, we consider a new threat where an attacker poisons the RLHF training data to embed a "jailbreak backdoor" into the model. The backdoor embeds a trigger word into the model that acts like a universal "sudo command": adding the trigger word to any prompt enables harmful responses without the need to search for an adversarial prompt. Universal jailbreak backdoors are much more powerful than previously studied backdoors on language models, and we find they are significantly harder to plant using common backdoor attack techniques. We investigate the design decisions in RLHF that contribute to its purported robustness, and release a benchmark of poisoned models to stimulate future research on universal jailbreak backdoors., Comment: Accepted as conference paper in ICLR 2024
Published: 2023

6. Are aligned neural networks adversarially aligned?

Author: Carlini, Nicholas, Nasr, Milad, Choquette-Choo, Christopher A., Jagielski, Matthew, Gao, Irena, Awadalla, Anas, Koh, Pang Wei, Ippolito, Daphne, Lee, Katherine, Tramer, Florian, and Schmidt, Ludwig
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Cryptography and Security, Computer Science - Machine Learning
Abstract: Large language models are now tuned to align with the goals of their creators, namely to be "helpful and harmless." These models should respond helpfully to user questions, but refuse to answer requests that could cause harm. However, adversarial users can construct inputs which circumvent attempts at alignment. In this work, we study adversarial alignment, and ask to what extent these models remain aligned when interacting with an adversarial user who constructs worst-case inputs (adversarial examples). These inputs are designed to cause the model to emit harmful content that would otherwise be prohibited. We show that existing NLP-based optimization attacks are insufficiently powerful to reliably attack aligned text models: even when current NLP-based attacks fail, we can find adversarial inputs with brute force. As a result, the failure of current attacks should not be seen as proof that aligned text models remain aligned under adversarial inputs. However the recent trend in large-scale ML models is multimodal models that allow users to provide images that influence the text that is generated. We show these models can be easily attacked, i.e., induced to perform arbitrary un-aligned behavior through adversarial perturbation of the input image. We conjecture that improved NLP attacks may demonstrate this same level of adversarial control over text-only models.
Published: 2023

7. Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

Author: Ippolito, Daphne, Tramèr, Florian, Nasr, Milad, Zhang, Chiyuan, Jagielski, Matthew, Lee, Katherine, Choquette-Choo, Christopher A., and Carlini, Nicholas
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Studying data memorization in neural language models helps us understand the risks (e.g., to privacy or copyright) associated with models regurgitating training data and aids in the development of countermeasures. Many prior works -- and some recently deployed defenses -- focus on "verbatim memorization", defined as a model generation that exactly matches a substring from the training set. We argue that verbatim memorization definitions are too restrictive and fail to capture more subtle forms of memorization. Specifically, we design and implement an efficient defense that perfectly prevents all verbatim memorization. And yet, we demonstrate that this "perfect" filter does not prevent the leakage of training data. Indeed, it is easily circumvented by plausible and minimally modified "style-transfer" prompts -- and in some cases even the non-modified original prompts -- to extract memorized information. We conclude by discussing potential alternative definitions and why defining memorization is a difficult yet crucial open question for neural language models.
Published: 2022

8. Quantifying Memorization Across Neural Language Models

Author: Carlini, Nicholas, Ippolito, Daphne, Jagielski, Matthew, Lee, Katherine, Tramer, Florian, and Zhang, Chiyuan
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Large language models (LMs) have been shown to memorize parts of their training data, and when prompted appropriately, they will emit the memorized training data verbatim. This is undesirable because memorization violates privacy (exposing user data), degrades utility (repeated easy-to-memorize text is often low quality), and hurts fairness (some texts are memorized over others). We describe three log-linear relationships that quantify the degree to which LMs emit memorized training data. Memorization significantly grows as we increase (1) the capacity of a model, (2) the number of times an example has been duplicated, and (3) the number of tokens of context used to prompt the model. Surprisingly, we find the situation becomes more complicated when generalizing these results across model families. On the whole, we find that memorization in LMs is more prevalent than previously believed and will likely get worse as models continues to scale, at least without active mitigations.
Published: 2022

9. What Does it Mean for a Language Model to Preserve Privacy?

Author: Brown, Hannah, Lee, Katherine, Mireshghallah, Fatemehsadat, Shokri, Reza, and Tramèr, Florian
Subjects: Statistics - Machine Learning, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Natural language reflects our private lives and identities, making its privacy concerns as broad as those of real life. Language models lack the ability to understand the context and sensitivity of text, and tend to memorize phrases present in their training sets. An adversary can exploit this tendency to extract training data. Depending on the nature of the content and the context in which this data was collected, this could violate expectations of privacy. Thus there is a growing interest in techniques for training language models that preserve privacy. In this paper, we discuss the mismatch between the narrow assumptions made by popular data protection techniques (data sanitization and differential privacy), and the broadness of natural language and of privacy as a social norm. We argue that existing protection methods cannot guarantee a generic and meaningful notion of privacy for language models. We conclude that language models should be trained on text data which was explicitly produced for public use., Comment: 21 pages, 2 figures
Published: 2022

10. Counterfactual Memorization in Neural Language Models

Author: Zhang, Chiyuan, Ippolito, Daphne, Lee, Katherine, Jagielski, Matthew, Tramèr, Florian, and Carlini, Nicholas
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Modern neural language models that are widely used in various NLP tasks risk memorizing sensitive information from their training data. Understanding this memorization is important in real world applications and also from a learning-theoretical perspective. An open question in previous studies of language model memorization is how to filter out "common" memorization. In fact, most memorization criteria strongly correlate with the number of occurrences in the training set, capturing memorized familiar phrases, public knowledge, templated texts, or other repeated data. We formulate a notion of counterfactual memorization which characterizes how a model's predictions change if a particular document is omitted during training. We identify and study counterfactually-memorized training examples in standard text datasets. We estimate the influence of each memorized training example on the validation set and on generated texts, showing how this can provide direct evidence of the source of memorization at test time., Comment: NeurIPS 2023; 42 pages, 33 figures
Published: 2021

11. Large Language Models Can Be Strong Differentially Private Learners

Author: Li, Xuechen, Tramèr, Florian, Liang, Percy, and Hashimoto, Tatsunori
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language
Abstract: Differentially Private (DP) learning has seen limited success for building large deep learning models of text, and straightforward attempts at applying Differentially Private Stochastic Gradient Descent (DP-SGD) to NLP tasks have resulted in large performance drops and high computational overhead. We show that this performance drop can be mitigated with (1) the use of large pretrained language models; (2) non-standard hyperparameters that suit DP optimization; and (3) fine-tuning objectives which are aligned with the pretraining procedure. With the above, we obtain NLP models that outperform state-of-the-art DP-trained models under the same privacy budget and strong non-private baselines -- by directly fine-tuning pretrained models with DP optimization on moderately-sized corpora. To address the computational challenge of running DP-SGD with large Transformers, we propose a memory saving technique that allows clipping in DP-SGD to run without instantiating per-example gradients for any linear layer in the model. The technique enables privately training Transformers with almost the same memory cost as non-private training at a modest run-time overhead. Contrary to conventional wisdom that DP optimization fails at learning high-dimensional models (due to noise that scales with dimension) empirical results reveal that private learning with pretrained language models doesn't tend to suffer from dimension-dependent performance degradation. Code to reproduce results can be found at https://github.com/lxuechen/private-transformers., Comment: 31 pages; update ethics statement to clarify benefits and potential long-term harms
Published: 2021

12. Extracting Training Data from Large Language Models

Author: Carlini, Nicholas, Tramer, Florian, Wallace, Eric, Jagielski, Matthew, Herbert-Voss, Ariel, Lee, Katherine, Roberts, Adam, Brown, Tom, Song, Dawn, Erlingsson, Ulfar, Oprea, Alina, and Raffel, Colin
Subjects: Computer Science - Cryptography and Security, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
Published: 2020

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources

Refine your results

12 results on '"Tramèr, Florian"'

1. Blind Baselines Beat Membership Inference Attacks for Foundation Models

2. Competition Report: Finding Universal Jailbreak Backdoors in Aligned LLMs

3. Foundational Challenges in Assuring Alignment and Safety of Large Language Models

4. Query-Based Adversarial Prompt Generation

5. Universal Jailbreak Backdoors from Poisoned Human Feedback

6. Are aligned neural networks adversarially aligned?

7. Preventing Verbatim Memorization in Language Models Gives a False Sense of Privacy

8. Quantifying Memorization Across Neural Language Models

9. What Does it Mean for a Language Model to Preserve Privacy?

10. Counterfactual Memorization in Neural Language Models

11. Large Language Models Can Be Strong Differentially Private Learners

12. Extracting Training Data from Large Language Models

Catalog

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Database

12 results on '"Tramèr, Florian"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources