Author: "Durmus, Esin" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Durmus, Esin"' showing total 70 results

Start Over Author "Durmus, Esin"

70 results on '"Durmus, Esin"'

1. Collective Constitutional AI: Aligning a Language Model with Public Input

Author: Huang, Saffron, Siddarth, Divya, Lovitt, Liane, Liao, Thomas I., Durmus, Esin, Tamkin, Alex, and Ganguli, Deep
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Human-Computer Interaction, I.2.7, K.4.2
Abstract: There is growing consensus that language model (LM) developers should not be the sole deciders of LM behavior, creating a need for methods that enable the broader public to collectively shape the behavior of LM systems that affect them. To address this need, we present Collective Constitutional AI (CCAI): a multi-stage process for sourcing and integrating public input into LMs-from identifying a target population to sourcing principles to training and evaluating a model. We demonstrate the real-world practicality of this approach by creating what is, to our knowledge, the first LM fine-tuned with collectively sourced public input and evaluating this model against a baseline model trained with established principles from a LM developer. Our quantitative evaluations demonstrate several benefits of our approach: the CCAI-trained model shows lower bias across nine social dimensions compared to the baseline model, while maintaining equivalent performance on language, math, and helpful-harmless evaluations. Qualitative comparisons of the models suggest that the models differ on the basis of their respective constitutions, e.g., when prompted with contentious topics, the CCAI-trained model tends to generate responses that reframe the matter positively instead of a refusal. These results demonstrate a promising, tractable pathway toward publicly informed development of language models.
Published: 2024
Full Text: View/download PDF

2. NLP Systems That Can't Tell Use from Mention Censor Counterspeech, but Teaching the Distinction Helps

Author: Gligoric, Kristina, Cheng, Myra, Zheng, Lucia, Durmus, Esin, and Jurafsky, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Computers and Society, Computer Science - Human-Computer Interaction, Computer Science - Social and Information Networks
Abstract: The use of words to convey speaker's intent is traditionally distinguished from the `mention' of words for quoting what someone said, or pointing out properties of a word. Here we show that computationally modeling this use-mention distinction is crucial for dealing with counterspeech online. Counterspeech that refutes problematic content often mentions harmful language but is not harmful itself (e.g., calling a vaccine dangerous is not the same as expressing disapproval of someone for calling vaccines dangerous). We show that even recent language models fail at distinguishing use from mention, and that this failure propagates to two key downstream tasks: misinformation and hate speech detection, resulting in censorship of counterspeech. We introduce prompting mitigations that teach the use-mention distinction, and show they reduce these errors. Our work highlights the importance of the use-mention distinction for NLP and CSS and offers ways to address it., Comment: NAACL 2024 (Main conference)
Published: 2024

3. Evaluating and Mitigating Discrimination in Language Model Decisions

Author: Tamkin, Alex, Askell, Amanda, Lovitt, Liane, Durmus, Esin, Joseph, Nicholas, Kravec, Shauna, Nguyen, Karina, Kaplan, Jared, and Ganguli, Deep
Subjects: Computer Science - Computation and Language
Abstract: As language models (LMs) advance, interest is growing in applying them to high-stakes societal decisions, such as determining financing or housing eligibility. However, their potential for discrimination in such contexts raises ethical concerns, motivating the need for better methods to evaluate these risks. We present a method for proactively evaluating the potential discriminatory impact of LMs in a wide range of use cases, including hypothetical use cases where they have not yet been deployed. Specifically, we use an LM to generate a wide array of potential prompts that decision-makers may input into an LM, spanning 70 diverse decision scenarios across society, and systematically vary the demographic information in each prompt. Applying this methodology reveals patterns of both positive and negative discrimination in the Claude 2.0 model in select settings when no interventions are applied. While we do not endorse or permit the use of language models to make automated decisions for the high-risk use cases we study, we demonstrate techniques to significantly decrease both positive and negative discrimination through careful prompt engineering, providing pathways toward safer deployment in use cases where they may be appropriate. Our work enables developers and policymakers to anticipate, measure, and address discrimination as language model capabilities and applications continue to expand. We release our dataset and prompts at https://huggingface.co/datasets/Anthropic/discrim-eval
Published: 2023

4. Specific versus General Principles for Constitutional AI

Author: Kundu, Sandipan, Bai, Yuntao, Kadavath, Saurav, Askell, Amanda, Callahan, Andrew, Chen, Anna, Goldie, Anna, Balwit, Avital, Mirhoseini, Azalia, McLean, Brayden, Olsson, Catherine, Evraets, Cassie, Tran-Johnson, Eli, Durmus, Esin, Perez, Ethan, Kernion, Jackson, Kerr, Jamie, Ndousse, Kamal, Nguyen, Karina, Elhage, Nelson, Cheng, Newton, Schiefer, Nicholas, DasSarma, Nova, Rausch, Oliver, Larson, Robin, Yang, Shannon, Kravec, Shauna, Telleen-Lawton, Timothy, Liao, Thomas I., Henighan, Tom, Hume, Tristan, Hatfield-Dodds, Zac, Mindermann, Sören, Joseph, Nicholas, McCandlish, Sam, and Kaplan, Jared
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Human feedback can prevent overtly harmful utterances in conversational models, but may not automatically mitigate subtle problematic behaviors such as a stated desire for self-preservation or power. Constitutional AI offers an alternative, replacing human feedback with feedback from AI models conditioned only on a list of written principles. We find this approach effectively prevents the expression of such behaviors. The success of simple principles motivates us to ask: can models learn general ethical behaviors from only a single written principle? To test this, we run experiments using a principle roughly stated as "do what's best for humanity". We find that the largest dialogue models can generalize from this short constitution, resulting in harmless assistants with no stated interest in specific motivations like power. A general principle may thus partially avoid the need for a long list of constitutions targeting potentially harmful behaviors. However, more detailed constitutions still improve fine-grained control over specific types of harms. This suggests both general and specific principles have value for steering AI safely.
Published: 2023

5. Towards Understanding Sycophancy in Language Models

Author: Sharma, Mrinank, Tong, Meg, Korbak, Tomasz, Duvenaud, David, Askell, Amanda, Bowman, Samuel R., Cheng, Newton, Durmus, Esin, Hatfield-Dodds, Zac, Johnston, Scott R., Kravec, Shauna, Maxwell, Timothy, McCandlish, Sam, Ndousse, Kamal, Rausch, Oliver, Schiefer, Nicholas, Yan, Da, Zhang, Miranda, and Perez, Ethan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Statistics - Machine Learning, I.2.6
Abstract: Human feedback is commonly utilized to finetune AI assistants. But human feedback may also encourage model responses that match user beliefs over truthful ones, a behaviour known as sycophancy. We investigate the prevalence of sycophancy in models whose finetuning procedure made use of human feedback, and the potential role of human preference judgments in such behavior. We first demonstrate that five state-of-the-art AI assistants consistently exhibit sycophancy across four varied free-form text-generation tasks. To understand if human preferences drive this broadly observed behavior, we analyze existing human preference data. We find that when a response matches a user's views, it is more likely to be preferred. Moreover, both humans and preference models (PMs) prefer convincingly-written sycophantic responses over correct ones a non-negligible fraction of the time. Optimizing model outputs against PMs also sometimes sacrifices truthfulness in favor of sycophancy. Overall, our results indicate that sycophancy is a general behavior of state-of-the-art AI assistants, likely driven in part by human preference judgments favoring sycophantic responses., Comment: 32 pages, 20 figures
Published: 2023

6. Studying Large Language Model Generalization with Influence Functions

Author: Grosse, Roger, Bae, Juhan, Anil, Cem, Elhage, Nelson, Tamkin, Alex, Tajdini, Amirhossein, Steiner, Benoit, Li, Dustin, Durmus, Esin, Perez, Ethan, Hubinger, Evan, Lukošiūtė, Kamilė, Nguyen, Karina, Joseph, Nicholas, McCandlish, Sam, Kaplan, Jared, and Bowman, Samuel R.
Subjects: Computer Science - Machine Learning, Computer Science - Computation and Language, Statistics - Machine Learning
Abstract: When trying to gain better visibility into a machine learning model in order to understand and mitigate the associated risks, a potentially valuable source of evidence is: which training examples most contribute to a given behavior? Influence functions aim to answer a counterfactual: how would the model's parameters (and hence its outputs) change if a given sequence were added to the training set? While influence functions have produced insights for small models, they are difficult to scale to large language models (LLMs) due to the difficulty of computing an inverse-Hessian-vector product (IHVP). We use the Eigenvalue-corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation to scale influence functions up to LLMs with up to 52 billion parameters. In our experiments, EK-FAC achieves similar accuracy to traditional influence function estimators despite the IHVP computation being orders of magnitude faster. We investigate two algorithmic techniques to reduce the cost of computing gradients of candidate training sequences: TF-IDF filtering and query batching. We use influence functions to investigate the generalization patterns of LLMs, including the sparsity of the influence patterns, increasing abstraction with scale, math and programming abilities, cross-lingual generalization, and role-playing behavior. Despite many apparently sophisticated forms of generalization, we identify a surprising limitation: influences decay to near-zero when the order of key phrases is flipped. Overall, influence functions give us a powerful new tool for studying the generalization properties of LLMs., Comment: 119 pages, 47 figures, 22 tables
Published: 2023

7. Question Decomposition Improves the Faithfulness of Model-Generated Reasoning

Author: Radhakrishnan, Ansh, Nguyen, Karina, Chen, Anna, Chen, Carol, Denison, Carson, Hernandez, Danny, Durmus, Esin, Hubinger, Evan, Kernion, Jackson, Lukošiūtė, Kamilė, Cheng, Newton, Joseph, Nicholas, Schiefer, Nicholas, Rausch, Oliver, McCandlish, Sam, Showk, Sheer El, Lanham, Tamera, Maxwell, Tim, Chandrasekaran, Venkatesa, Hatfield-Dodds, Zac, Kaplan, Jared, Brauner, Jan, Bowman, Samuel R., and Perez, Ethan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: As large language models (LLMs) perform more difficult tasks, it becomes harder to verify the correctness and safety of their behavior. One approach to help with this issue is to prompt LLMs to externalize their reasoning, e.g., by having them generate step-by-step reasoning as they answer a question (Chain-of-Thought; CoT). The reasoning may enable us to check the process that models use to perform tasks. However, this approach relies on the stated reasoning faithfully reflecting the model's actual reasoning, which is not always the case. To improve over the faithfulness of CoT reasoning, we have models generate reasoning by decomposing questions into subquestions. Decomposition-based methods achieve strong performance on question-answering tasks, sometimes approaching that of CoT while improving the faithfulness of the model's stated reasoning on several recently-proposed metrics. By forcing the model to answer simpler subquestions in separate contexts, we greatly increase the faithfulness of model-generated reasoning over CoT, while still achieving some of the performance gains of CoT. Our results show it is possible to improve the faithfulness of model-generated reasoning; continued improvements may lead to reasoning that enables us to verify the correctness and safety of LLM behavior., Comment: For few-shot examples and prompts, see https://github.com/anthropics/DecompositionFaithfulnessPaper
Published: 2023

8. Measuring Faithfulness in Chain-of-Thought Reasoning

Author: Lanham, Tamera, Chen, Anna, Radhakrishnan, Ansh, Steiner, Benoit, Denison, Carson, Hernandez, Danny, Li, Dustin, Durmus, Esin, Hubinger, Evan, Kernion, Jackson, Lukošiūtė, Kamilė, Nguyen, Karina, Cheng, Newton, Joseph, Nicholas, Schiefer, Nicholas, Rausch, Oliver, Larson, Robin, McCandlish, Sam, Kundu, Sandipan, Kadavath, Saurav, Yang, Shannon, Henighan, Thomas, Maxwell, Timothy, Telleen-Lawton, Timothy, Hume, Tristan, Hatfield-Dodds, Zac, Kaplan, Jared, Brauner, Jan, Bowman, Samuel R., and Perez, Ethan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Large language models (LLMs) perform better when they produce step-by-step, "Chain-of-Thought" (CoT) reasoning before answering a question, but it is unclear if the stated reasoning is a faithful explanation of the model's actual reasoning (i.e., its process for answering the question). We investigate hypotheses for how CoT reasoning may be unfaithful, by examining how the model predictions change when we intervene on the CoT (e.g., by adding mistakes or paraphrasing it). Models show large variation across tasks in how strongly they condition on the CoT when predicting their answer, sometimes relying heavily on the CoT and other times primarily ignoring it. CoT's performance boost does not seem to come from CoT's added test-time compute alone or from information encoded via the particular phrasing of the CoT. As models become larger and more capable, they produce less faithful reasoning on most tasks we study. Overall, our results suggest that CoT can be faithful if the circumstances such as the model size and task are carefully chosen.
Published: 2023

9. Towards Measuring the Representation of Subjective Global Opinions in Language Models

Author: Durmus, Esin, Nguyen, Karina, Liao, Thomas I., Schiefer, Nicholas, Askell, Amanda, Bakhtin, Anton, Chen, Carol, Hatfield-Dodds, Zac, Hernandez, Danny, Joseph, Nicholas, Lovitt, Liane, McCandlish, Sam, Sikder, Orowa, Tamkin, Alex, Thamkul, Janel, Kaplan, Jared, Clark, Jack, and Ganguli, Deep
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) may not equitably represent diverse global perspectives on societal issues. In this paper, we develop a quantitative framework to evaluate whose opinions model-generated responses are more similar to. We first build a dataset, GlobalOpinionQA, comprised of questions and answers from cross-national surveys designed to capture diverse opinions on global issues across different countries. Next, we define a metric that quantifies the similarity between LLM-generated survey responses and human responses, conditioned on country. With our framework, we run three experiments on an LLM trained to be helpful, honest, and harmless with Constitutional AI. By default, LLM responses tend to be more similar to the opinions of certain populations, such as those from the USA, and some European and South American countries, highlighting the potential for biases. When we prompt the model to consider a particular country's perspective, responses shift to be more similar to the opinions of the prompted populations, but can reflect harmful cultural stereotypes. When we translate GlobalOpinionQA questions to a target language, the model's responses do not necessarily become the most similar to the opinions of speakers of those languages. We release our dataset for others to use and build on. Our data is at https://huggingface.co/datasets/Anthropic/llm_global_opinions. We also provide an interactive visualization at https://llmglobalvalues.anthropic.com.
Published: 2023

10. Opportunities and Risks of LLMs for Scalable Deliberation with Polis

Author: Small, Christopher T., Vendrov, Ivan, Durmus, Esin, Homaei, Hadjar, Barry, Elizabeth, Cornebise, Julien, Suzman, Ted, Ganguli, Deep, and Megill, Colin
Subjects: Computer Science - Social and Information Networks, Computer Science - Computation and Language, Computer Science - Computers and Society, Computer Science - Human-Computer Interaction
Abstract: Polis is a platform that leverages machine intelligence to scale up deliberative processes. In this paper, we explore the opportunities and risks associated with applying Large Language Models (LLMs) towards challenges with facilitating, moderating and summarizing the results of Polis engagements. In particular, we demonstrate with pilot experiments using Anthropic's Claude that LLMs can indeed augment human intelligence to help more efficiently run Polis conversations. In particular, we find that summarization capabilities enable categorically new methods with immense promise to empower the public in collective meaning-making exercises. And notably, LLM context limitations have a significant impact on insight and quality of these results. However, these opportunities come with risks. We discuss some of these risks, as well as principles and techniques for characterizing and mitigating them, and the implications for other deliberative or political systems that may employ LLMs. Finally, we conclude with several open future research directions for augmenting tools like Polis with LLMs., Comment: 31 pages (main body; 45 with Bibliography and Appendix), 6 figures
Published: 2023

11. Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

Author: Cheng, Myra, Durmus, Esin, and Jurafsky, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computers and Society
Abstract: To recognize and mitigate harms from large language models (LLMs), we need to understand the prevalence and nuances of stereotypes in LLM outputs. Toward this end, we present Marked Personas, a prompt-based method to measure stereotypes in LLMs for intersectional demographic groups without any lexicon or data labeling. Grounded in the sociolinguistic concept of markedness (which characterizes explicitly linguistically marked categories versus unmarked defaults), our proposed method is twofold: 1) prompting an LLM to generate personas, i.e., natural language descriptions, of the target demographic group alongside personas of unmarked, default groups; 2) identifying the words that significantly distinguish personas of the target group from corresponding unmarked ones. We find that the portrayals generated by GPT-3.5 and GPT-4 contain higher rates of racial stereotypes than human-written portrayals using the same prompts. The words distinguishing personas of marked (non-white, non-male) groups reflect patterns of othering and exoticizing these demographics. An intersectional lens further reveals tropes that dominate portrayals of marginalized groups, such as tropicalism and the hypersexualization of minoritized women. These representational harms have concerning implications for downstream applications like story generation., Comment: To appear at ACL 2023, 9 pages, 3 figures, 3 tables
Published: 2023

12. Whose Opinions Do Language Models Reflect?

Author: Santurkar, Shibani, Durmus, Esin, Ladhak, Faisal, Lee, Cinoo, Liang, Percy, and Hashimoto, Tatsunori
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computers and Society, Computer Science - Machine Learning
Abstract: Language models (LMs) are increasingly being used in open-ended contexts, where the opinions reflected by LMs in response to subjective queries can have a profound impact, both on user satisfaction, as well as shaping the views of society at large. In this work, we put forth a quantitative framework to investigate the opinions reflected by LMs -- by leveraging high-quality public opinion polls and their associated human responses. Using this framework, we create OpinionsQA, a new dataset for evaluating the alignment of LM opinions with those of 60 US demographic groups over topics ranging from abortion to automation. Across topics, we find substantial misalignment between the views reflected by current LMs and those of US demographic groups: on par with the Democrat-Republican divide on climate change. Notably, this misalignment persists even after explicitly steering the LMs towards particular demographic groups. Our analysis not only confirms prior observations about the left-leaning tendencies of some human feedback-tuned LMs, but also surfaces groups whose opinions are poorly reflected by current LMs (e.g., 65+ and widowed individuals). Our code and data are available at https://github.com/tatsu-lab/opinions_qa.
Published: 2023

13. Benchmarking Large Language Models for News Summarization

Author: Zhang, Tianyi, Ladhak, Faisal, Durmus, Esin, Liang, Percy, McKeown, Kathleen, and Hashimoto, Tatsunori B.
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.
Published: 2023

14. Contrastive Error Attribution for Finetuned Language Models

Author: Ladhak, Faisal, Durmus, Esin, and Hashimoto, Tatsunori
Subjects: Computer Science - Computation and Language
Abstract: Recent work has identified noisy and misannotated data as a core cause of hallucinations and unfaithful outputs in Natural Language Generation (NLG) tasks. Consequently, identifying and removing these examples is a key open challenge in creating reliable NLG systems. In this work, we introduce a framework to identify and remove low-quality training instances that lead to undesirable outputs, such as faithfulness errors in text summarization. We show that existing approaches for error tracing, such as gradient-based influence measures, do not perform reliably for detecting faithfulness errors in NLG datasets. We overcome the drawbacks of existing error tracing methods through a new, contrast-based estimate that compares undesired generations to human-corrected outputs. Our proposed method can achieve a mean average precision of 0.93 at detecting known data errors across synthetic tasks with known ground truth, substantially outperforming existing approaches. Using this approach and re-training models on cleaned data leads to a 70% reduction in entity hallucinations on the NYT dataset and a 55% reduction in semantic errors on the E2E dataset., Comment: ACL 2023
Published: 2022

15. Evaluating Human-Language Model Interaction

Author: Lee, Mina, Srivastava, Megha, Hardy, Amelia, Thickstun, John, Durmus, Esin, Paranjape, Ashwin, Gerard-Ursin, Ines, Li, Xiang Lisa, Ladhak, Faisal, Rong, Frieda, Wang, Rose E., Kwon, Minae, Park, Joon Sung, Cao, Hancheng, Lee, Tony, Bommasani, Rishi, Bernstein, Michael, and Liang, Percy
Subjects: Computer Science - Computation and Language
Abstract: Many real-world applications of language models (LMs), such as writing assistance and code autocomplete, involve human-LM interaction. However, most benchmarks are non-interactive in that a model produces output without human involvement. To evaluate human-LM interaction, we develop a new framework, Human-AI Language-based Interaction Evaluation (HALIE), that defines the components of interactive systems and dimensions to consider when designing evaluation metrics. Compared to standard, non-interactive evaluation, HALIE captures (i) the interactive process, not only the final output; (ii) the first-person subjective experience, not just a third-party assessment; and (iii) notions of preference beyond quality (e.g., enjoyment and ownership). We then design five tasks to cover different forms of interaction: social dialogue, question answering, crossword puzzles, summarization, and metaphor generation. With four state-of-the-art LMs (three variants of OpenAI's GPT-3 and AI21 Labs' Jurassic-1), we find that better non-interactive performance does not always translate to better human-LM interaction. In particular, we highlight three cases where the results from non-interactive and interactive metrics diverge and underscore the importance of human-LM interaction for LM evaluation., Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI)
Published: 2022

16. Holistic Evaluation of Language Models

Author: Liang, Percy, Bommasani, Rishi, Lee, Tony, Tsipras, Dimitris, Soylu, Dilara, Yasunaga, Michihiro, Zhang, Yian, Narayanan, Deepak, Wu, Yuhuai, Kumar, Ananya, Newman, Benjamin, Yuan, Binhang, Yan, Bobby, Zhang, Ce, Cosgrove, Christian, Manning, Christopher D., Ré, Christopher, Acosta-Navas, Diana, Hudson, Drew A., Zelikman, Eric, Durmus, Esin, Ladhak, Faisal, Rong, Frieda, Ren, Hongyu, Yao, Huaxiu, Wang, Jue, Santhanam, Keshav, Orr, Laurel, Zheng, Lucia, Yuksekgonul, Mert, Suzgun, Mirac, Kim, Nathan, Guha, Neel, Chatterji, Niladri, Khattab, Omar, Henderson, Peter, Huang, Qian, Chi, Ryan, Xie, Sang Michael, Santurkar, Shibani, Ganguli, Surya, Hashimoto, Tatsunori, Icard, Thomas, Zhang, Tianyi, Chaudhary, Vishrav, Wang, William, Li, Xuechen, Mai, Yifan, Zhang, Yuhui, and Koreeda, Yuta
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood. We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models. First, we taxonomize the vast space of potential scenarios (i.e. use cases) and metrics (i.e. desiderata) that are of interest for LMs. Then we select a broad subset based on coverage and feasibility, noting what's missing or underrepresented (e.g. question answering for neglected English dialects, metrics for trustworthiness). Second, we adopt a multi-metric approach: We measure 7 metrics (accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency) for each of 16 core scenarios when possible (87.5% of the time). This ensures metrics beyond accuracy don't fall to the wayside, and that trade-offs are clearly exposed. We also perform 7 targeted evaluations, based on 26 targeted scenarios, to analyze specific aspects (e.g. reasoning, disinformation). Third, we conduct a large-scale evaluation of 30 prominent language models (spanning open, limited-access, and closed models) on all 42 scenarios, 21 of which were not previously used in mainstream LM evaluation. Prior to HELM, models on average were evaluated on just 17.9% of the core HELM scenarios, with some prominent models not sharing a single scenario in common. We improve this to 96.0%: now all 30 models have been densely benchmarked on the same core scenarios and metrics under standardized conditions. Our evaluation surfaces 25 top-level findings. For full transparency, we release all raw model prompts and completions publicly for further analysis, as well as a general modular toolkit. We intend for HELM to be a living benchmark for the community, continuously updated with new scenarios, metrics, and models., Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Project page: https://crfm.stanford.edu/helm/v1.0
Published: 2022

17. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

Author: Bianchi, Federico, Kalluri, Pratyusha, Durmus, Esin, Ladhak, Faisal, Cheng, Myra, Nozza, Debora, Hashimoto, Tatsunori, Jurafsky, Dan, Zou, James, and Caliskan, Aylin
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Machine learning models that convert user-written text descriptions into images are now widely available online and used by millions of users to generate millions of images a day. We investigate the potential for these models to amplify dangerous and complex stereotypes. We find a broad range of ordinary prompts produce stereotypes, including prompts simply mentioning traits, descriptors, occupations, or objects. For example, we find cases of prompting for basic traits or social roles resulting in images reinforcing whiteness as ideal, prompting for occupations resulting in amplification of racial and gender disparities, and prompting for objects resulting in reification of American norms. Stereotypes are present regardless of whether prompts explicitly mention identity and demographic language or avoid such language. Moreover, stereotypes persist despite mitigation strategies; neither user attempts to counter stereotypes by requesting images with specific counter-stereotypes nor institutional attempts to add system ``guardrails'' have prevented the perpetuation of stereotypes. Our analysis justifies concerns regarding the impacts of today's models, presenting striking exemplars, and connecting these findings with deep insights into harms drawn from social scientific and humanist disciplines. This work contributes to the effort to shed light on the uniquely complex biases in language-vision models and demonstrates the ways that the mass deployment of text-to-image generation models results in mass dissemination of stereotypes and resulting harms., Comment: FAccT 2023 paper. The published version is available at 10.1145/3593013.3594095
Published: 2022
Full Text: View/download PDF

18. GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Author: Gehrmann, Sebastian, Bhattacharjee, Abhik, Mahendiran, Abinaya, Wang, Alex, Papangelis, Alexandros, Madaan, Aman, McMillan-Major, Angelina, Shvets, Anna, Upadhyay, Ashish, Yao, Bingsheng, Wilie, Bryan, Bhagavatula, Chandra, You, Chaobin, Thomson, Craig, Garbacea, Cristina, Wang, Dakuo, Deutsch, Daniel, Xiong, Deyi, Jin, Di, Gkatzia, Dimitra, Radev, Dragomir, Clark, Elizabeth, Durmus, Esin, Ladhak, Faisal, Ginter, Filip, Winata, Genta Indra, Strobelt, Hendrik, Hayashi, Hiroaki, Novikova, Jekaterina, Kanerva, Jenna, Chim, Jenny, Zhou, Jiawei, Clive, Jordan, Maynez, Joshua, Sedoc, João, Juraska, Juraj, Dhole, Kaustubh, Chandu, Khyathi Raghavi, Perez-Beltrachini, Laura, Ribeiro, Leonardo F. R., Tunstall, Lewis, Zhang, Li, Pushkarna, Mahima, Creutz, Mathias, White, Michael, Kale, Mihir Sanjay, Eddine, Moussa Kamal, Daheim, Nico, Subramani, Nishant, Dusek, Ondrej, Liang, Paul Pu, Ammanamanchi, Pawan Sasanka, Zhu, Qi, Puduppully, Ratish, Kriz, Reno, Shahriyar, Rifat, Cardenas, Ronald, Mahamood, Saad, Osei, Salomey, Cahyawijaya, Samuel, Štajner, Sanja, Montella, Sebastien, Shailza, Jolly, Shailza, Mille, Simon, Hasan, Tahmid, Shen, Tianhao, Adewumi, Tosin, Raunak, Vikas, Raheja, Vipul, Nikolaev, Vitaly, Tsai, Vivian, Jernite, Yacine, Xu, Ying, Sang, Yisi, Liu, Yixin, and Hou, Yufang
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Evaluation in machine learning is usually informed by past choices, for example which datasets or metrics to use. This standardization enables the comparison on equal footing using leaderboards, but the evaluation choices become sub-optimal as better alternatives arise. This problem is especially pertinent in natural language generation which requires ever-improving suites of datasets, metrics, and human evaluation to make definitive claims. To make following best model evaluation practices easier, we introduce GEMv2. The new version of the Generation, Evaluation, and Metrics Benchmark introduces a modular infrastructure for dataset, model, and metric developers to benefit from each others work. GEMv2 supports 40 documented datasets in 51 languages. Models for all datasets can be evaluated online and our interactive data card creation and rendering tools make it easier to add new datasets to the living benchmark.
Published: 2022

19. Spurious Correlations in Reference-Free Evaluation of Text Generation

Author: Durmus, Esin, Ladhak, Faisal, and Hashimoto, Tatsunori
Subjects: Computer Science - Computation and Language
Abstract: Model-based, reference-free evaluation metrics have been proposed as a fast and cost-effective approach to evaluate Natural Language Generation (NLG) systems. Despite promising recent results, we find evidence that reference-free evaluation metrics of summarization and dialog generation may be relying on spurious correlations with measures such as word overlap, perplexity, and length. We further observe that for text summarization, these metrics have high error rates when ranking current state-of-the-art abstractive summarization systems. We demonstrate that these errors can be mitigated by explicitly designing evaluation metrics to avoid spurious features in reference-free evaluation., Comment: Published in ACL 2022 main conference
Published: 2022

20. Language modeling via stochastic processes

Author: Wang, Rose E, Durmus, Esin, Goodman, Noah, and Hashimoto, Tatsunori
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Modern language models can generate high-quality short texts. However, they often meander or are incoherent when generating longer texts. These issues arise from the next-token-only language modeling objective. Recent work in self-supervised learning suggests that models can learn good latent representations via contrastive learning, which can be effective for discriminative tasks. Our work analyzes the application of contrastive representations for generative tasks, like long text generation. We propose one approach for leveraging constrastive representations, which we call Time Control (TC). TC first learns a contrastive representation of the target text domain, then generates text by decoding from these representations. Compared to domain-specific methods and fine-tuning GPT2 across a variety of text domains, TC performs competitively to methods specific for learning sentence representations on discourse coherence. On long text generation settings, TC preserves the text structure both in terms of ordering (up to $+15\%$ better) and text length consistency (up to $+90\%$ better)., Comment: Correct claims on goal-directed decoding, adds non-goal-directed baselines. ICLR 2022 Oral. Code: https://github.com/rosewang2008/language_modeling_via_stochastic_processes
Published: 2022

21. Towards Understanding Persuasion in Computational Argumentation

Author: Durmus, Esin
Subjects: Computer Science - Computation and Language
Abstract: Opinion formation and persuasion in argumentation are affected by three major factors: the argument itself, the source of the argument, and the properties of the audience. Understanding the role of each and the interplay between them is crucial for obtaining insights regarding argument interpretation and generation. It is particularly important for building effective argument generation systems that can take both the discourse and the audience characteristics into account. Having such personalized argument generation systems would be helpful to expose individuals to different viewpoints and help them make a more fair and informed decision on an issue. Even though studies in Social Sciences and Psychology have shown that source and audience effects are essential components of the persuasion process, most research in computational persuasion has focused solely on understanding the characteristics of persuasive language. In this thesis, we make several contributions to understand the relative effect of the source, audience, and language in computational persuasion. We first introduce a large-scale dataset with extensive user information to study these factors' effects simultaneously. Then, we propose models to understand the role of the audience's prior beliefs on their perception of arguments. We also investigate the role of social interactions and engagement in understanding users' success in online debating over time. We find that the users' prior beliefs and social interactions play an essential role in predicting their success in persuasion. Finally, we explore the importance of incorporating contextual information to predict argument impact and show improvements compared to encoding only the text of the arguments., Comment: PhD Dissertation
Published: 2021

22. Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization

Author: Ladhak, Faisal, Durmus, Esin, He, He, Cardie, Claire, and McKeown, Kathleen
Subjects: Computer Science - Computation and Language
Abstract: Despite recent progress in abstractive summarization, systems still suffer from faithfulness errors. While prior work has proposed models that improve faithfulness, it is unclear whether the improvement comes from an increased level of extractiveness of the model outputs as one naive way to improve faithfulness is to make summarization models more extractive. In this work, we present a framework for evaluating the effective faithfulness of summarization systems, by generating a faithfulnessabstractiveness trade-off curve that serves as a control at different operating points on the abstractiveness spectrum. We then show that the Maximum Likelihood Estimation (MLE) baseline as well as a recently proposed method for improving faithfulness, are both worse than the control at the same level of abstractiveness. Finally, we learn a selector to identify the most faithful and abstractive summary for a given document, and show that this system can attain higher faithfulness scores in human evaluations while being more abstractive than the baseline system on two datasets. Moreover, we show that our system is able to achieve a better faithfulness-abstractiveness trade-off than the control at the same level of abstractiveness., Comment: Published in ACL 2022 main conference
Published: 2021

23. On the Opportunities and Risks of Foundation Models

Author: Bommasani, Rishi, Hudson, Drew A., Adeli, Ehsan, Altman, Russ, Arora, Simran, von Arx, Sydney, Bernstein, Michael S., Bohg, Jeannette, Bosselut, Antoine, Brunskill, Emma, Brynjolfsson, Erik, Buch, Shyamal, Card, Dallas, Castellon, Rodrigo, Chatterji, Niladri, Chen, Annie, Creel, Kathleen, Davis, Jared Quincy, Demszky, Dora, Donahue, Chris, Doumbouya, Moussa, Durmus, Esin, Ermon, Stefano, Etchemendy, John, Ethayarajh, Kawin, Fei-Fei, Li, Finn, Chelsea, Gale, Trevor, Gillespie, Lauren, Goel, Karan, Goodman, Noah, Grossman, Shelby, Guha, Neel, Hashimoto, Tatsunori, Henderson, Peter, Hewitt, John, Ho, Daniel E., Hong, Jenny, Hsu, Kyle, Huang, Jing, Icard, Thomas, Jain, Saahil, Jurafsky, Dan, Kalluri, Pratyusha, Karamcheti, Siddharth, Keeling, Geoff, Khani, Fereshte, Khattab, Omar, Koh, Pang Wei, Krass, Mark, Krishna, Ranjay, Kuditipudi, Rohith, Kumar, Ananya, Ladhak, Faisal, Lee, Mina, Lee, Tony, Leskovec, Jure, Levent, Isabelle, Li, Xiang Lisa, Li, Xuechen, Ma, Tengyu, Malik, Ali, Manning, Christopher D., Mirchandani, Suvir, Mitchell, Eric, Munyikwa, Zanele, Nair, Suraj, Narayan, Avanika, Narayanan, Deepak, Newman, Ben, Nie, Allen, Niebles, Juan Carlos, Nilforoshan, Hamed, Nyarko, Julian, Ogut, Giray, Orr, Laurel, Papadimitriou, Isabel, Park, Joon Sung, Piech, Chris, Portelance, Eva, Potts, Christopher, Raghunathan, Aditi, Reich, Rob, Ren, Hongyu, Rong, Frieda, Roohani, Yusuf, Ruiz, Camilo, Ryan, Jack, Ré, Christopher, Sadigh, Dorsa, Sagawa, Shiori, Santhanam, Keshav, Shih, Andy, Srinivasan, Krishnan, Tamkin, Alex, Taori, Rohan, Thomas, Armin W., Tramèr, Florian, Wang, Rose E., Wang, William, Wu, Bohan, Wu, Jiajun, Wu, Yuhuai, Xie, Sang Michael, Yasunaga, Michihiro, You, Jiaxuan, Zaharia, Matei, Zhang, Michael, Zhang, Tianyi, Zhang, Xikun, Zhang, Yuhui, Zheng, Lucia, Zhou, Kaitlyn, and Liang, Percy
Subjects: Computer Science - Machine Learning, Computer Science - Artificial Intelligence, Computer Science - Computers and Society
Abstract: AI is undergoing a paradigm shift with the rise of models (e.g., BERT, DALL-E, GPT-3) that are trained on broad data at scale and are adaptable to a wide range of downstream tasks. We call these models foundation models to underscore their critically central yet incomplete character. This report provides a thorough account of the opportunities and risks of foundation models, ranging from their capabilities (e.g., language, vision, robotics, reasoning, human interaction) and technical principles(e.g., model architectures, training procedures, data, systems, security, evaluation, theory) to their applications (e.g., law, healthcare, education) and societal impact (e.g., inequity, misuse, economic and environmental impact, legal and ethical considerations). Though foundation models are based on standard deep learning and transfer learning, their scale results in new emergent capabilities,and their effectiveness across so many tasks incentivizes homogenization. Homogenization provides powerful leverage but demands caution, as the defects of the foundation model are inherited by all the adapted models downstream. Despite the impending widespread deployment of foundation models, we currently lack a clear understanding of how they work, when they fail, and what they are even capable of due to their emergent properties. To tackle these questions, we believe much of the critical research on foundation models will require deep interdisciplinary collaboration commensurate with their fundamentally sociotechnical nature., Comment: Authored by the Center for Research on Foundation Models (CRFM) at the Stanford Institute for Human-Centered Artificial Intelligence (HAI). Report page with citation guidelines: https://crfm.stanford.edu/report.html
Published: 2021

24. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Author: Gehrmann, Sebastian, Adewumi, Tosin, Aggarwal, Karmanya, Ammanamanchi, Pawan Sasanka, Anuoluwapo, Aremu, Bosselut, Antoine, Chandu, Khyathi Raghavi, Clinciu, Miruna, Das, Dipanjan, Dhole, Kaustubh D., Du, Wanyu, Durmus, Esin, Dušek, Ondřej, Emezue, Chris, Gangal, Varun, Garbacea, Cristina, Hashimoto, Tatsunori, Hou, Yufang, Jernite, Yacine, Jhamtani, Harsh, Ji, Yangfeng, Jolly, Shailza, Kale, Mihir, Kumar, Dhruv, Ladhak, Faisal, Madaan, Aman, Maddela, Mounica, Mahajan, Khyati, Mahamood, Saad, Majumder, Bodhisattwa Prasad, Martins, Pedro Henrique, McMillan-Major, Angelina, Mille, Simon, van Miltenburg, Emiel, Nadeem, Moin, Narayan, Shashi, Nikolaev, Vitaly, Niyongabo, Rubungo Andre, Osei, Salomey, Parikh, Ankur, Perez-Beltrachini, Laura, Rao, Niranjan Ramesh, Raunak, Vikas, Rodriguez, Juan Diego, Santhanam, Sashank, Sedoc, João, Sellam, Thibault, Shaikh, Samira, Shimorina, Anastasia, Cabezudo, Marco Antonio Sobrevilla, Strobelt, Hendrik, Subramani, Nishant, Xu, Wei, Yang, Diyi, Yerukola, Akhila, and Zhou, Jiawei
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We introduce GEM, a living benchmark for natural language Generation (NLG), its Evaluation, and Metrics. Measuring progress in NLG relies on a constantly evolving ecosystem of automated metrics, datasets, and human evaluation standards. Due to this moving target, new models often still evaluate on divergent anglo-centric corpora with well-established, but flawed, metrics. This disconnect makes it challenging to identify the limitations of current models and opportunities for progress. Addressing this limitation, GEM provides an environment in which models can easily be applied to a wide set of tasks and in which evaluation strategies can be tested. Regular updates to the benchmark will help NLG research become more multilingual and evolve the challenge alongside models. This paper serves as the description of the data for which we are organizing a shared task at our ACL 2021 Workshop and to which we invite the entire NLG community to participate.
Published: 2021

25. Exploring the Role of Argument Structure in Online Debate Persuasion

Author: Li, Jialu, Durmus, Esin, and Cardie, Claire
Subjects: Computer Science - Computation and Language
Abstract: Online debate forums provide users a platform to express their opinions on controversial topics while being exposed to opinions from diverse set of viewpoints. Existing work in Natural Language Processing (NLP) has shown that linguistic features extracted from the debate text and features encoding the characteristics of the audience are both critical in persuasion studies. In this paper, we aim to further investigate the role of discourse structure of the arguments from online debates in their persuasiveness. In particular, we use the factor graph model to obtain features for the argument structure of debates from an online debating platform and incorporate these features to an LSTM-based model to predict the debater that makes the most convincing arguments. We find that incorporating argument structure features play an essential role in achieving the better predictive performance in assessing the persuasiveness of the arguments in online debates., Comment: Accepted to EMNLP 2020
Published: 2020

26. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

Author: Ladhak, Faisal, Durmus, Esin, Cardie, Claire, and McKeown, Kathleen
Subjects: Computer Science - Computation and Language
Abstract: We introduce WikiLingua, a large-scale, multilingual dataset for the evaluation of crosslingual abstractive summarization systems. We extract article and summary pairs in 18 languages from WikiHow, a high quality, collaborative resource of how-to guides on a diverse set of topics written by human authors. We create gold-standard article-summary alignments across languages by aligning the images that are used to describe each how-to step in an article. As a set of baselines for further studies, we evaluate the performance of existing cross-lingual abstractive summarization methods on our dataset. We further propose a method for direct crosslingual summarization (i.e., without requiring translation at inference time) by leveraging synthetic data and Neural Machine Translation as a pre-training step. Our method significantly outperforms the baseline approaches, while being more cost efficient during inference., Comment: Findings of EMNLP 2020
Published: 2020

27. FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

Author: Durmus, Esin, He, He, and Diab, Mona
Subjects: Computer Science - Computation and Language
Abstract: Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries., Comment: Accepted to ACL 2020
Published: 2020
Full Text: View/download PDF

28. The Role of Pragmatic and Discourse Context in Determining Argument Impact

Author: Durmus, Esin, Ladhak, Faisal, and Cardie, Claire
Subjects: Computer Science - Computation and Language
Abstract: Research in the social sciences and psychology has shown that the persuasiveness of an argument depends not only the language employed, but also on attributes of the source/communicator, the audience, and the appropriateness and strength of the argument's claims given the pragmatic and discourse context of the argument. Among these characteristics of persuasive arguments, prior work in NLP does not explicitly investigate the effect of the pragmatic and discourse context when determining argument quality. This paper presents a new dataset to initiate the study of this aspect of argumentation: it consists of a diverse collection of arguments covering 741 controversial topics and comprising over 47,000 claims. We further propose predictive models that incorporate the pragmatic and discourse context of argumentative claims and show that they outperform models that rely only on claim-specific linguistic features for predicting the perceived impact of individual claims within a particular line of argument., Comment: EMNLP 2019
Published: 2020
Full Text: View/download PDF

29. Determining Relative Argument Specificity and Stance for Complex Argumentative Structures

Author: Durmus, Esin, Ladhak, Faisal, and Cardie, Claire
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Systems for automatic argument generation and debate require the ability to (1) determine the stance of any claims employed in the argument and (2) assess the specificity of each claim relative to the argument context. Existing work on understanding claim specificity and stance, however, has been limited to the study of argumentative structures that are relatively shallow, most often consisting of a single claim that directly supports or opposes the argument thesis. In this paper, we tackle these tasks in the context of complex arguments on a diverse set of topics. In particular, our dataset consists of manually curated argument trees for 741 controversial topics covering 95,312 unique claims; lines of argument are generally of depth 2 to 6. We find that as the distance between a pair of claims increases along the argument path, determining the relative specificity of a pair of claims becomes easier and determining their relative stance becomes harder.
Published: 2019
Full Text: View/download PDF

30. Exploring the Role of Prior Beliefs for Argument Persuasion

Author: Durmus, Esin and Cardie, Claire
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Public debate forums provide a common platform for exchanging opinions on a topic of interest. While recent studies in natural language processing (NLP) have provided empirical evidence that the language of the debaters and their patterns of interaction play a key role in changing the mind of a reader, research in psychology has shown that prior beliefs can affect our interpretation of an argument and could therefore constitute a competing alternative explanation for resistance to changing one's stance. To study the actual effect of language use vs. prior beliefs on persuasion, we provide a new dataset and propose a controlled setting that takes into consideration two reader level factors: political and religious ideology. We find that prior beliefs affected by these reader level factors play a more important role than language use effects and argue that it is important to account for them in NLP studies of persuasion., Comment: 11 pages
Published: 2019
Full Text: View/download PDF

31. A Corpus for Modeling User and Language Effects in Argumentation on Online Debating

Author: Durmus, Esin and Cardie, Claire
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Existing argumentation datasets have succeeded in allowing researchers to develop computational methods for analyzing the content, structure and linguistic features of argumentative text. They have been much less successful in fostering studies of the effect of "user" traits -- characteristics and beliefs of the participants -- on the debate/argument outcome as this type of user information is generally not available. This paper presents a dataset of 78, 376 debates generated over a 10-year period along with surprisingly comprehensive participant profiles. We also complete an example study using the dataset to analyze the effect of selected user traits on the debate outcome in comparison to the linguistic features typically employed in studies of this kind.
Published: 2019
Full Text: View/download PDF

32. Collective Constitutional AI: Aligning a Language Model with Public Input

Author: Huang, Saffron, primary, Siddarth, Divya, additional, Lovitt, Liane, additional, Liao, Thomas I., additional, Durmus, Esin, additional, Tamkin, Alex, additional, and Ganguli, Deep, additional
Published: 2024
Full Text: View/download PDF

33. Benchmarking Large Language Models for News Summarization

Author: Zhang, Tianyi, primary, Ladhak, Faisal, additional, Durmus, Esin, additional, Liang, Percy, additional, McKeown, Kathleen, additional, and Hashimoto, Tatsunori B., additional
Published: 2024
Full Text: View/download PDF

34. Easily Accessible Text-to-Image Generation Amplifies Demographic Stereotypes at Large Scale

Author: Bianchi, Federico, primary, Kalluri, Pratyusha, additional, Durmus, Esin, additional, Ladhak, Faisal, additional, Cheng, Myra, additional, Nozza, Debora, additional, Hashimoto, Tatsunori, additional, Jurafsky, Dan, additional, Zou, James, additional, and Caliskan, Aylin, additional
Published: 2023
Full Text: View/download PDF

35. Marked Personas: Using Natural Language Prompts to Measure Stereotypes in Language Models

Author: Cheng, Myra, primary, Durmus, Esin, additional, and Jurafsky, Dan, additional
Published: 2023
Full Text: View/download PDF

36. Towards Reference-free Text Simplification Evaluation with a BERT Siamese Network Architecture

Author: Zhao, Xinran, primary, Durmus, Esin, additional, and Yeung, Dit-Yan, additional
Published: 2023
Full Text: View/download PDF

37. Contrastive Error Attribution for Finetuned Language Models

Author: Ladhak, Faisal, primary, Durmus, Esin, additional, and Hashimoto, Tatsunori, additional
Published: 2023
Full Text: View/download PDF

38. Improving Faithfulness by Augmenting Negative Summaries from Fake Documents

Author: Wang, Tianshu, primary, Ladhak, Faisal, additional, Durmus, Esin, additional, and He, He, additional
Published: 2022
Full Text: View/download PDF

39. GEMv2: Multilingual NLG Benchmarking in a Single Line of Code

Author: Gehrmann, Sebastian, primary, Bhattacharjee, Abhik, additional, Mahendiran, Abinaya, additional, Wang, Alex, additional, Papangelis, Alexandros, additional, Madaan, Aman, additional, Mcmillan-major, Angelina, additional, Shvets, Anna, additional, Upadhyay, Ashish, additional, Bohnet, Bernd, additional, Yao, Bingsheng, additional, Wilie, Bryan, additional, Bhagavatula, Chandra, additional, You, Chaobin, additional, Thomson, Craig, additional, Garbacea, Cristina, additional, Wang, Dakuo, additional, Deutsch, Daniel, additional, Xiong, Deyi, additional, Jin, Di, additional, Gkatzia, Dimitra, additional, Radev, Dragomir, additional, Clark, Elizabeth, additional, Durmus, Esin, additional, Ladhak, Faisal, additional, Ginter, Filip, additional, Winata, Genta Indra, additional, Strobelt, Hendrik, additional, Hayashi, Hiroaki, additional, Novikova, Jekaterina, additional, Kanerva, Jenna, additional, Chim, Jenny, additional, Zhou, Jiawei, additional, Clive, Jordan, additional, Maynez, Joshua, additional, Sedoc, João, additional, Juraska, Juraj, additional, Dhole, Kaustubh, additional, Chandu, Khyathi Raghavi, additional, Beltrachini, Laura Perez, additional, Ribeiro, Leonardo F . R., additional, Tunstall, Lewis, additional, Zhang, Li, additional, Pushkarna, Mahim, additional, Creutz, Mathias, additional, White, Michael, additional, Kale, Mihir Sanjay, additional, Eddine, Moussa Kamal, additional, Daheim, Nico, additional, Subramani, Nishant, additional, Dusek, Ondrej, additional, Liang, Paul Pu, additional, Ammanamanchi, Pawan Sasanka, additional, Zhu, Qi, additional, Puduppully, Ratish, additional, Kriz, Reno, additional, Shahriyar, Rifat, additional, Cardenas, Ronald, additional, Mahamood, Saad, additional, Osei, Salomey, additional, Cahyawijaya, Samuel, additional, Štajner, Sanja, additional, Montella, Sebastien, additional, Jolly, Shailza, additional, Mille, Simon, additional, Hasan, Tahmid, additional, Shen, Tianhao, additional, Adewumi, Tosin, additional, Raunak, Vikas, additional, Raheja, Vipul, additional, Nikolaev, Vitaly, additional, Tsai, Vivian, additional, Jernite, Yacine, additional, Xu, Ying, additional, Sang, Yisi, additional, Liu, Yixin, additional, and Hou, Yufang, additional
Published: 2022
Full Text: View/download PDF

40. Spurious Correlations in Reference-Free Evaluation of Text Generation

Author: Durmus, Esin, primary, Ladhak, Faisal, additional, and Hashimoto, Tatsunori, additional
Published: 2022
Full Text: View/download PDF

41. Faithful or Extractive? On Mitigating the Faithfulness-Abstractiveness Trade-off in Abstractive Summarization

Author: Ladhak, Faisal, primary, Durmus, Esin, additional, He, He, additional, Cardie, Claire, additional, and McKeown, Kathleen, additional
Published: 2022
Full Text: View/download PDF

42. Leveraging Topic Relatedness for Argument Persuasion

Author: Zhao, Xinran, primary, Durmus, Esin, additional, Zhang, Hongming, additional, and Cardie, Claire, additional
Published: 2021
Full Text: View/download PDF

43. The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

Author: Gehrmann, Sebastian, primary, Adewumi, Tosin, additional, Aggarwal, Karmanya, additional, Ammanamanchi, Pawan Sasanka, additional, Aremu, Anuoluwapo, additional, Bosselut, Antoine, additional, Chandu, Khyathi Raghavi, additional, Clinciu, Miruna-Adriana, additional, Das, Dipanjan, additional, Dhole, Kaustubh, additional, Du, Wanyu, additional, Durmus, Esin, additional, Dušek, Ondřej, additional, Emezue, Chris Chinenye, additional, Gangal, Varun, additional, Garbacea, Cristina, additional, Hashimoto, Tatsunori, additional, Hou, Yufang, additional, Jernite, Yacine, additional, Jhamtani, Harsh, additional, Ji, Yangfeng, additional, Jolly, Shailza, additional, Kale, Mihir, additional, Kumar, Dhruv, additional, Ladhak, Faisal, additional, Madaan, Aman, additional, Maddela, Mounica, additional, Mahajan, Khyati, additional, Mahamood, Saad, additional, Majumder, Bodhisattwa Prasad, additional, Martins, Pedro Henrique, additional, McMillan-Major, Angelina, additional, Mille, Simon, additional, van Miltenburg, Emiel, additional, Nadeem, Moin, additional, Narayan, Shashi, additional, Nikolaev, Vitaly, additional, Niyongabo Rubungo, Andre, additional, Osei, Salomey, additional, Parikh, Ankur, additional, Perez-Beltrachini, Laura, additional, Rao, Niranjan Ramesh, additional, Raunak, Vikas, additional, Rodriguez, Juan Diego, additional, Santhanam, Sashank, additional, Sedoc, João, additional, Sellam, Thibault, additional, Shaikh, Samira, additional, Shimorina, Anastasia, additional, Sobrevilla Cabezudo, Marco Antonio, additional, Strobelt, Hendrik, additional, Subramani, Nishant, additional, Xu, Wei, additional, Yang, Diyi, additional, Yerukola, Akhila, additional, and Zhou, Jiawei, additional
Published: 2021
Full Text: View/download PDF

44. Proceedings of the Third Workshop on Computational Modeling of People's Opinions, Personality, and Emotion's in Social Media

Author: Nissim, Malvina, Patti, Viviana, Plank, Barbara, and Durmus, Esin
Published: 2020

45. Topic and Emotion Development among Dutch COVID-19 Twitter Communities in the early Pandemic

Author: Marinov, Boris, Spenader, Jennifer, Caselli, Tommaso, Nissim, Malvina, Patti, Viviana, Plank, Barbara, Durmus, Esin, and Artificial Intelligence
Subjects: emotion detection, COVID-19, social media analysis, Dutch, NLP
Abstract: The paper focuses on a large collection of Dutch tweets from the Netherlands to get an insight into the perception and reactions of users during the early months of the COVID-19 pandemic. We focused on five major user communities of users: government and health organizations, news media, politicians, the general public and conspiracy theory supporters, investigating differences among them in topic dominance and the expressions of emotions. Through topic modeling we monitor the evolution of the conversation about COVID-19 among these communities. Our results indicate that the national focus on COVID-19 shifted from the virus itself to its impact on the economy between February and April. Surprisingly, the overall emotional public response appears to be substantially positive and expressing trust, although differences can be observed in specific group of users.
Published: 2020

46. Matching Theory and Data with Personal-ITY: What a Corpus of Italian YouTube Comments Reveals About Personality

Author: Bassignana, Elisa, Nissim, Malvina, Patti, Viviana, Plank, Barbara, and Durmus, Esin
Abstract: As a contribution to personality detection in languages other than English, we rely on distant supervision to create Personal-ITY, a novel corpus of YouTube comments in Italian, where authors are labelled with personality traits. The traits are derived from one of the mainstream personality theories in psychology research, named MBTI. Using personality prediction experiments, we (i) study the task of personality prediction in itself on our corpus as well as on TWISTY, a Twitter dataset also annotated with MBTI labels; (ii) carry out an extensive, in-depth analysis of the features used by the classifier, and view them specifically under the light of the original theory that we used to create the corpus in the first place. We observe that no single model is best at personality detection, and that while some traits are easier than others to detect, and also to match back to theory, for other, less frequent traits the picture is much more blurred.
Published: 2020

47. Exploring the Role of Argument Structure in Online Debate Persuasion

Author: Li, Jialu, primary, Durmus, Esin, additional, and Cardie, Claire, additional
Published: 2020
Full Text: View/download PDF

48. WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

Author: Ladhak, Faisal, primary, Durmus, Esin, additional, Cardie, Claire, additional, and McKeown, Kathleen, additional
Published: 2020
Full Text: View/download PDF

49. Modeling the Factors of User Success in Online Debate

Author: Durmus, Esin, primary and Cardie, Claire, additional
Published: 2019
Full Text: View/download PDF

50. Persuasion of the Undecided: Language vs. the Listener

Author: Longpre, Liane, primary, Durmus, Esin, additional, and Cardie, Claire, additional
Published: 2019
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

70 results on '"Durmus, Esin"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources