Author: "Tafjord, Oyvind" - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Tafjord, Oyvind"' showing total 141 results

Start Over Author "Tafjord, Oyvind"

141 results on '"Tafjord, Oyvind"'

1. SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs

Author: Gu, Yuling, Tafjord, Oyvind, Kim, Hyunwoo, Moore, Jared, Bras, Ronan Le, Clark, Peter, and Choi, Yejin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: While prior work has explored whether large language models (LLMs) possess a "theory of mind" (ToM) - the ability to attribute mental states to oneself and others - there has been little work testing whether LLMs can implicitly apply such knowledge to predict behavior, or to judge whether an observed behavior is rational. Such skills are critical for appropriate interaction in social environments. We create a new dataset, SimpleTom, containing concise, diverse stories (e.g., "The can of Pringles has moldy chips in it. Mary picks up the can in the supermarket and walks to the cashier."), each with three questions that test different degrees of ToM reasoning, asking models to predict (a) mental state ("Is Mary aware of the mold?"), (b) behavior ("Will Mary pay for the chips or report the mold?"), and (c) judgment ("Mary paid for the chips. Was that reasonable?"). To our knowledge, SimpleToM is the first dataset to systematically explore downstream reasoning requiring knowledge of mental states in realistic scenarios. Our experimental results are intriguing: While most models can reliably predict mental state on our dataset (a), they often fail to correctly predict the behavior (b), and fare even worse at judging whether given behaviors are reasonable (c), despite being correctly aware of the protagonist's mental state should make such secondary predictions obvious. We further show that we can help models do better at (b) and (c) via interventions such as reminding the model of its earlier mental state answer and mental-state-specific chain-of-thought prompting, raising the action prediction accuracies (e.g., from 49.5% to 93.5% for GPT-4o) and judgment accuracies (e.g., from 15.3% to 94.7% in GPT-4o). While this shows that models can be coaxed to perform well, it requires task-specific interventions, and the natural model performances remain low, a cautionary tale for LLM deployment.
Published: 2024

2. OLMoE: Open Mixture-of-Experts Language Models

Author: Muennighoff, Niklas, Soldaini, Luca, Groeneveld, Dirk, Lo, Kyle, Morrison, Jacob, Min, Sewon, Shi, Weijia, Walsh, Pete, Tafjord, Oyvind, Lambert, Nathan, Gu, Yuling, Arora, Shane, Bhagia, Akshita, Schwenk, Dustin, Wadden, David, Wettig, Alexander, Hui, Binyuan, Dettmers, Tim, Kiela, Douwe, Farhadi, Ali, Smith, Noah A., Koh, Pang Wei, Singh, Amanpreet, and Hajishirzi, Hannaneh
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs., Comment: 61 pages (24 main), 36 figures, 14 tables
Published: 2024

3. Answer, Assemble, Ace: Understanding How Transformers Answer Multiple Choice Questions

Author: Wiegreffe, Sarah, Tafjord, Oyvind, Belinkov, Yonatan, Hajishirzi, Hannaneh, and Sabharwal, Ashish
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Multiple-choice question answering (MCQA) is a key competence of performant transformer language models that is tested by mainstream benchmarks. However, recent evidence shows that models can have quite a range of performance, particularly when the task format is diversified slightly (such as by shuffling answer choice order). In this work we ask: how do successful models perform formatted MCQA? We employ vocabulary projection and activation patching methods to localize key hidden states that encode relevant information for predicting the correct answer. We find that prediction of a specific answer symbol is causally attributed to a single middle layer, and specifically its multi-head self-attention mechanism. We show that subsequent layers increase the probability of the predicted answer symbol in vocabulary space, and that this probability increase is associated with a sparse set of attention heads with unique roles. We additionally uncover differences in how different models adjust to alternative symbols. Finally, we demonstrate that a synthetic task can disentangle sources of model error to pinpoint when a model has learned formatted MCQA, and show that an inability to separate answer symbol tokens in vocabulary space is a property of models unable to perform formatted MCQA tasks., Comment: Preprint. Code will be available at https://github.com/allenai/understanding_mcqa
Published: 2024

4. OLMES: A Standard for Language Model Evaluations

Author: Gu, Yuling, Tafjord, Oyvind, Kuehl, Bailey, Haddad, Dany, Dodge, Jesse, and Hajishirzi, Hannaneh
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Progress in AI is often demonstrated by new models claiming improved performance on tasks measuring model capabilities. Evaluating language models in particular is challenging, as small changes to how a model is evaluated on a task can lead to large changes in measured performance. There is no common standard setup, so different models are evaluated on the same tasks in different ways, leading to claims about which models perform best not being reproducible. We propose OLMES, a completely documented, practical, open standard for reproducible LLM evaluations. In developing this standard, we identify and review the varying factors in evaluation practices adopted by the community - such as details of prompt formatting, choice of in-context examples, probability normalizations, and task formulation. In particular, OLMES supports meaningful comparisons between smaller base models that require the unnatural "cloze" formulation of multiple-choice questions against larger models that can utilize the original formulation. OLMES includes well-considered recommendations guided by results from existing literature as well as new experiments investigating open questions.
Published: 2024

5. DISCOVERYWORLD: A Virtual Environment for Developing and Evaluating Automated Scientific Discovery Agents

Author: Jansen, Peter, Côté, Marc-Alexandre, Khot, Tushar, Bransom, Erin, Mishra, Bhavana Dalvi, Majumder, Bodhisattwa Prasad, Tafjord, Oyvind, and Clark, Peter
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Automated scientific discovery promises to accelerate progress across scientific domains. However, developing and evaluating an AI agent's capacity for end-to-end scientific reasoning is challenging as running real-world experiments is often prohibitively expensive or infeasible. In this work we introduce DISCOVERYWORLD, the first virtual environment for developing and benchmarking an agent's ability to perform complete cycles of novel scientific discovery. DISCOVERYWORLD contains a variety of different challenges, covering topics as diverse as radioisotope dating, rocket science, and proteomics, to encourage development of general discovery skills rather than task-specific solutions. DISCOVERYWORLD itself is an inexpensive, simulated, text-based environment (with optional 2D visual overlay). It includes 120 different challenge tasks, spanning eight topics each with three levels of difficulty and several parametric variations. Each task requires an agent to form hypotheses, design and run experiments, analyze results, and act on conclusions. DISCOVERYWORLD further provides three automatic metrics for evaluating performance, based on (a) task completion, (b) task-relevant actions taken, and (c) the discovered explanatory knowledge. We find that strong baseline agents, that perform well in prior published environments, struggle on most DISCOVERYWORLD tasks, suggesting that DISCOVERYWORLD captures some of the novel challenges of discovery, and thus that DISCOVERYWORLD may help accelerate near-term development and assessment of scientific discovery competency in agents. Code available at: www.github.com/allenai/discoveryworld, Comment: Accepted to NeurIPS 2024 (Benchmark Track, Spotlight)
Published: 2024

6. Enhancing Systematic Decompositional Natural Language Inference Using Informal Logic

Author: Weir, Nathaniel, Sanders, Kate, Weller, Orion, Sharma, Shreya, Jiang, Dongwei, Jiang, Zhengping, Mishra, Bhavana Dalvi, Tafjord, Oyvind, Jansen, Peter, Clark, Peter, and Van Durme, Benjamin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Recent language models enable new opportunities for structured reasoning with text, such as the construction of intuitive, proof-like textual entailment trees without relying on brittle formal logic. However, progress in this direction has been hampered by a long-standing lack of a clear protocol for determining what valid compositional entailment is. This absence causes noisy datasets and limited performance gains by modern neuro-symbolic engines. To address these problems, we formulate a consistent and theoretically grounded approach to annotating decompositional entailment and evaluate its impact on LLM-based textual inference. We find that our new dataset, RDTE (Recognizing Decompositional Textual Entailment), has a substantially higher internal consistency (+9%) than prior decompositional entailment datasets. We also find that training an RDTE-oriented entailment classifier via knowledge distillation and employing it in an entailment tree reasoning engine significantly improves both accuracy and proof quality, illustrating the practical benefit of this advance for textual inference.
Published: 2024

7. OLMo: Accelerating the Science of Language Models

Author: Groeneveld, Dirk, Beltagy, Iz, Walsh, Pete, Bhagia, Akshita, Kinney, Rodney, Tafjord, Oyvind, Jha, Ananya Harsh, Ivison, Hamish, Magnusson, Ian, Wang, Yizhong, Arora, Shane, Atkinson, David, Authur, Russell, Chandu, Khyathi Raghavi, Cohan, Arman, Dumas, Jennifer, Elazar, Yanai, Gu, Yuling, Hessel, Jack, Khot, Tushar, Merrill, William, Morrison, Jacob, Muennighoff, Niklas, Naik, Aakanksha, Nam, Crystal, Peters, Matthew E., Pyatkin, Valentina, Ravichander, Abhilasha, Schwenk, Dustin, Shah, Saurabh, Smith, Will, Strubell, Emma, Subramani, Nishant, Wortsman, Mitchell, Dasigi, Pradeep, Lambert, Nathan, Richardson, Kyle, Zettlemoyer, Luke, Dodge, Jesse, Lo, Kyle, Soldaini, Luca, Smith, Noah A., and Hajishirzi, Hannaneh
Subjects: Computer Science - Computation and Language
Abstract: Language models (LMs) have become ubiquitous in both NLP research and in commercial product offerings. As their commercial importance has surged, the most powerful models have become closed off, gated behind proprietary interfaces, with important details of their training data, architectures, and development undisclosed. Given the importance of these details in scientifically studying these models, including their biases and potential risks, we believe it is essential for the research community to have access to powerful, truly open LMs. To this end, we have built OLMo, a competitive, truly Open Language Model, to enable the scientific study of language models. Unlike most prior efforts that have only released model weights and inference code, we release OLMo alongside open training data and training and evaluation code. We hope this release will empower the open research community and inspire a new wave of innovation.
Published: 2024

8. Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research

Author: Soldaini, Luca, Kinney, Rodney, Bhagia, Akshita, Schwenk, Dustin, Atkinson, David, Authur, Russell, Bogin, Ben, Chandu, Khyathi, Dumas, Jennifer, Elazar, Yanai, Hofmann, Valentin, Jha, Ananya Harsh, Kumar, Sachin, Lucy, Li, Lyu, Xinxi, Lambert, Nathan, Magnusson, Ian, Morrison, Jacob, Muennighoff, Niklas, Naik, Aakanksha, Nam, Crystal, Peters, Matthew E., Ravichander, Abhilasha, Richardson, Kyle, Shen, Zejiang, Strubell, Emma, Subramani, Nishant, Tafjord, Oyvind, Walsh, Pete, Zettlemoyer, Luke, Smith, Noah A., Hajishirzi, Hannaneh, Beltagy, Iz, Groeneveld, Dirk, Dodge, Jesse, and Lo, Kyle
Subjects: Computer Science - Computation and Language
Abstract: Information about pretraining corpora used to train the current best-performing language models is seldom discussed: commercial models rarely detail their data, and even open models are often released without accompanying training data or recipes to reproduce them. As a result, it is challenging to conduct and advance scientific research on language modeling, such as understanding how training data impacts model capabilities and limitations. To facilitate scientific research on language model pretraining, we curate and release Dolma, a three-trillion-token English corpus, built from a diverse mixture of web content, scientific papers, code, public-domain books, social media, and encyclopedic materials. We extensively document Dolma, including its design principles, details about its construction, and a summary of its contents. We present analyses and experimental results on intermediate states of Dolma to share what we have learned about important data curation practices. Finally, we open-source our data curation toolkit to enable reproduction of our work as well as support further research in large-scale data curation., Comment: Accepted at ACL 2024; Dataset: https://hf.co/datasets/allenai/dolma; Code: https://github.com/allenai/dolma
Published: 2024

9. Paloma: A Benchmark for Evaluating Language Model Fit

Author: Magnusson, Ian, Bhagia, Akshita, Hofmann, Valentin, Soldaini, Luca, Jha, Ananya Harsh, Tafjord, Oyvind, Schwenk, Dustin, Walsh, Evan Pete, Elazar, Yanai, Lo, Kyle, Groeneveld, Dirk, Beltagy, Iz, Hajishirzi, Hannaneh, Smith, Noah A., Richardson, Kyle, and Dodge, Jesse
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Language models (LMs) commonly report perplexity on monolithic data held out from training. Implicitly or explicitly, this data is composed of domains$\unicode{x2013}$varying distributions of language. Rather than assuming perplexity on one distribution extrapolates to others, Perplexity Analysis for Language Model Assessment (Paloma), measures LM fit to 585 text domains, ranging from nytimes.com to r/depression on Reddit. We invite submissions to our benchmark and organize results by comparability based on compliance with guidelines such as removal of benchmark contamination from pretraining. Submissions can also record parameter and training token count to make comparisons of Pareto efficiency for performance as a function of these measures of cost. We populate our benchmark with results from 6 baselines pretrained on popular corpora. In case studies, we demonstrate analyses that are possible with Paloma, such as finding that pretraining without data beyond Common Crawl leads to inconsistent fit to many domains., Comment: Project Page: https://paloma.allen.ai/
Published: 2023

10. Catwalk: A Unified Language Model Evaluation Framework for Many Datasets

Author: Groeneveld, Dirk, Awadalla, Anas, Beltagy, Iz, Bhagia, Akshita, Magnusson, Ian, Peng, Hao, Tafjord, Oyvind, Walsh, Pete, Richardson, Kyle, and Dodge, Jesse
Subjects: Computer Science - Computation and Language
Abstract: The success of large language models has shifted the evaluation paradigms in natural language processing (NLP). The community's interest has drifted towards comparing NLP models across many tasks, domains, and datasets, often at an extreme scale. This imposes new engineering challenges: efforts in constructing datasets and models have been fragmented, and their formats and interfaces are incompatible. As a result, it often takes extensive (re)implementation efforts to make fair and controlled comparisons at scale. Catwalk aims to address these issues. Catwalk provides a unified interface to a broad range of existing NLP datasets and models, ranging from both canonical supervised training and fine-tuning, to more modern paradigms like in-context learning. Its carefully-designed abstractions allow for easy extensions to many others. Catwalk substantially lowers the barriers to conducting controlled experiments at scale. For example, we finetuned and evaluated over 64 models on over 86 datasets with a single command, without writing any code. Maintained by the AllenNLP team at the Allen Institute for Artificial Intelligence (AI2), Catwalk is an ongoing open-source effort: https://github.com/allenai/catwalk., Comment: technical report, work in progress
Published: 2023

11. BaRDa: A Belief and Reasoning Dataset that Separates Factual Accuracy and Reasoning Ability

Author: Clark, Peter, Mishra, Bhavana Dalvi, and Tafjord, Oyvind
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: While there are numerous benchmarks comparing the performance of modern language models (LMs), end-task evaluations often conflate notions of *factual accuracy* ("truth") and *reasoning ability* ("rationality", or "honesty" in the sense of correctly reporting implications of beliefs). Our goal is a dataset that clearly distinguishes these two notions. Our approach is to leverage and extend a collection of human-annotated *entailment trees*, engineered to express both good and bad chains of reasoning, and using a mixture of true and false facts, in particular including counterfactual examples, to avoid belief bias (also known as the "content effect"). The resulting dataset, called BaRDa, contains 3000 entailments (1787 valid, 1213 invalid), using 6681 true and 2319 false statements. Testing on four GPT-series models, GPT3(curie)/GPT3(davinici)/3.5/4, we find factual accuracy (truth) scores of 74.1/80.6/82.6/87.1 and reasoning accuracy scores of 63.1/78.0/71.8/79.2. This shows the clear progression of models towards improved factual accuracy and entailment reasoning, and the dataset provides a new benchmark that more cleanly separates and quantifies these two notions., Comment: Added note about how dataset sampling was performed
Published: 2023

12. Digital Socrates: Evaluating LLMs through Explanation Critiques

Author: Gu, Yuling, Tafjord, Oyvind, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: While LLMs can provide reasoned explanations along with their answers, the nature and quality of those explanations are still poorly understood. In response, our goal is to define a detailed way of characterizing the explanation capabilities of modern models and to create a nuanced, interpretable explanation evaluation tool that can generate such characterizations automatically, without relying on expensive API calls or human annotations. Our approach is to (a) define the new task of explanation critiquing - identifying and categorizing any main flaw in an explanation and providing suggestions to address the flaw, (b) create a sizeable, human-verified dataset for this task, and (c) train an open-source, automatic critique model (called Digital Socrates) using this data. Through quantitative and qualitative analysis, we demonstrate how Digital Socrates is useful for revealing insights about student models by examining their reasoning chains, and how it can provide high-quality, nuanced, automatic evaluation of those model explanations for the first time. Digital Socrates thus fills an important gap in evaluation tools for understanding and improving the explanation behavior of models., Comment: ACL 2024
Published: 2023

13. CLIN: A Continually Learning Language Agent for Rapid Task Adaptation and Generalization

Author: Majumder, Bodhisattwa Prasad, Mishra, Bhavana Dalvi, Jansen, Peter, Tafjord, Oyvind, Tandon, Niket, Zhang, Li, Callison-Burch, Chris, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: Language agents have shown some ability to interact with an external environment, e.g., a virtual world such as ScienceWorld, to perform complex tasks, e.g., growing a plant, without the startup costs of reinforcement learning. However, despite their zero-shot capabilities, these agents to date do not continually improve over time beyond performance refinement on a specific task. Here we present CLIN, the first language-based agent to achieve this, so that it continually improves over multiple trials, including when both the environment and task are varied, and without requiring parameter updates. Our approach is to use a persistent, dynamic, textual memory centered on causal abstractions (rather than general "helpful hints") that is regularly updated after each trial so that the agent gradually learns useful knowledge for new trials. In the ScienceWorld benchmark, CLIN is able to continually improve on repeated trials on the same task and environment, outperforming state-of-the-art reflective language agents like Reflexion by 23 absolute points. CLIN can also transfer its learning to new environments (or new tasks), improving its zero-shot performance by 4 points (13 for new tasks) and can further improve performance there through continual memory updates, enhancing performance by an additional 17 points (7 for new tasks). This suggests a new architecture for agents built on frozen models that can still continually and rapidly improve over time., Comment: Project page: https://allenai.github.io/clin/
Published: 2023

14. Increasing Probability Mass on Answer Choices Does Not Always Improve Accuracy

Author: Wiegreffe, Sarah, Finlayson, Matthew, Tafjord, Oyvind, Clark, Peter, and Sabharwal, Ashish
Subjects: Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: When pretrained language models (LMs) are applied to discriminative tasks such as multiple-choice questions, they place probability mass on vocabulary tokens that aren't among the given answer choices. Spreading probability mass across multiple surface forms with identical meaning (such as "bath" and "bathtub") is thought to cause an underestimation of a model's true performance, referred to as the "surface form competition" (SFC) hypothesis. This has motivated the introduction of various probability normalization methods. However, many core questions remain unanswered. How do we measure SFC? Are there direct ways of reducing it, and does doing so improve task performance? We propose a mathematical formalism for SFC which allows us to quantify and bound its impact for the first time. We identify a simple method for reducing it -- namely, increasing probability mass on the given answer choices by a) including them in the prompt and b) using in-context learning with even just one example. We show this method eliminates the impact of SFC in the majority of instances. Our experiments on three diverse datasets and six LMs reveal several additional surprising findings. For example, both normalization and prompting methods for reducing SFC can be ineffective or even detrimental to task performance for some LMs. We conclude with practical insights for effectively prompting LMs for multiple-choice tasks., Comment: EMNLP 2023
Published: 2023

15. Language Models with Rationality

Author: Kassner, Nora, Tafjord, Oyvind, Sabharwal, Ashish, Richardson, Kyle, Schuetze, Hinrich, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: While large language models (LLMs) are proficient at question-answering (QA), it is not always clear how (or even if) an answer follows from their latent "beliefs". This lack of interpretability is a growing impediment to widespread use of LLMs. To address this, our goals are to make model beliefs and their inferential relationships explicit, and to resolve inconsistencies that may exist, so that answers are supported by interpretable chains of reasoning drawn from a consistent network of beliefs. Our approach, which we call REFLEX, is to add a rational, self-reflecting layer on top of the LLM. First, given a question, we construct a belief graph using a backward-chaining process to materialize relevant model beliefs (including beliefs about answer candidates) and their inferential relationships. Second, we identify and minimize contradictions in that graph using a formal constraint reasoner. We find that REFLEX significantly improves consistency (by 8%-11% absolute) without harming overall answer accuracy, resulting in answers supported by faithful chains of reasoning drawn from a more consistent belief system. This suggests a new style of system architecture in which an LLM extended with a rational layer can provide an interpretable window into system beliefs, add a systematic reasoning capability, and repair latent inconsistencies present in the LLM.
Published: 2023

16. Lila: A Unified Benchmark for Mathematical Reasoning

Author: Mishra, Swaroop, Finlayson, Matthew, Lu, Pan, Tang, Leonard, Welleck, Sean, Baral, Chitta, Rajpurohit, Tanmay, Tafjord, Oyvind, Sabharwal, Ashish, Clark, Peter, and Kalyan, Ashwin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, 68T50, I.2.7
Abstract: Mathematical reasoning skills are essential for general-purpose intelligent systems to perform tasks from grocery shopping to climate modeling. Towards evaluating and improving AI systems in this domain, we propose LILA, a unified mathematical reasoning benchmark consisting of 23 diverse tasks along four dimensions: (i) mathematical abilities e.g., arithmetic, calculus (ii) language format e.g., question-answering, fill-in-the-blanks (iii) language diversity e.g., no language, simple language (iv) external knowledge e.g., commonsense, physics. We construct our benchmark by extending 20 datasets benchmark by collecting task instructions and solutions in the form of Python programs, thereby obtaining explainable solutions in addition to the correct answer. We additionally introduce two evaluation datasets to measure out-of-distribution performance and robustness to language perturbation. Finally, we introduce BHASKARA, a general-purpose mathematical reasoning model trained on LILA. Importantly, we find that multi-tasking leads to significant improvements (average relative improvement of 21.83% F1 score vs. single-task models), while the best performing model only obtains 60.40%, indicating the room for improvement in general mathematical reasoning and understanding., Comment: EMNLP 2022
Published: 2022

17. Entailer: Answering Questions with Faithful and Truthful Chains of Reasoning

Author: Tafjord, Oyvind, Mishra, Bhavana Dalvi, and Clark, Peter
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Our goal is a question-answering (QA) system that can show how its answers are implied by its own internal beliefs via a systematic chain of reasoning. Such a capability would allow better understanding of why a model produced the answer it did. Our approach is to recursively combine a trained backward-chaining model, capable of generating a set of premises entailing an answer hypothesis, with a verifier that checks that the model itself believes those premises (and the entailment itself) through self-querying. To our knowledge, this is the first system to generate multistep chains that are both faithful (the answer follows from the reasoning) and truthful (the chain reflects the system's own internal beliefs). In evaluation using two different datasets, users judge that a majority (70%+) of generated chains clearly show how an answer follows from a set of facts - substantially better than a high-performance baseline - while preserving answer accuracy. By materializing model beliefs that systematically support an answer, new opportunities arise for understanding the model's system of belief, and diagnosing and correcting its misunderstandings when an answer is wrong., Comment: accepted at EMNLP 2022. arXiv admin note: substantial text overlap with arXiv:2204.13074
Published: 2022

18. Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering

Author: Lu, Pan, Mishra, Swaroop, Xia, Tony, Qiu, Liang, Chang, Kai-Wei, Zhu, Song-Chun, Tafjord, Oyvind, Clark, Peter, and Kalyan, Ashwin
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Machine Learning, Computer Science - Multimedia
Abstract: When answering a question, humans utilize the information available across different modalities to synthesize a consistent and complete chain of thought (CoT). This process is normally a black box in the case of deep learning models like large-scale language models. Recently, science question benchmarks have been used to diagnose the multi-hop reasoning ability and interpretability of an AI system. However, existing datasets fail to provide annotations for the answers, or are restricted to the textual-only modality, small scales, and limited domain diversity. To this end, we present Science Question Answering (ScienceQA), a new benchmark that consists of ~21k multimodal multiple choice questions with a diverse set of science topics and annotations of their answers with corresponding lectures and explanations. We further design language models to learn to generate lectures and explanations as the chain of thought (CoT) to mimic the multi-hop reasoning process when answering ScienceQA questions. ScienceQA demonstrates the utility of CoT in language models, as CoT improves the question answering performance by 1.20% in few-shot GPT-3 and 3.99% in fine-tuned UnifiedQA. We also explore the upper bound for models to leverage explanations by feeding those in the input; we observe that it improves the few-shot performance of GPT-3 by 18.96%. Our analysis further shows that language models, similar to humans, benefit from explanations to learn from fewer data and achieve the same performance with just 40% of the data. The data and code are available at https://scienceqa.github.io., Comment: Accepted to NeurIPS 2022. 22 pages, 17 figures, 9 tables. Project: https://scienceqa.github.io
Published: 2022

19. Towards Teachable Reasoning Systems: Using a Dynamic Memory of User Feedback for Continual System Improvement

Author: Mishra, Bhavana Dalvi, Tafjord, Oyvind, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Our goal is a teachable reasoning system for question-answering (QA), where a user can interact with faithful answer explanations, and correct its errors so that the system improves over time. Our approach is to augment a QA model with a dynamic memory of user feedback, containing user-supplied corrections to erroneous model beliefs that users identify during interaction. Retrievals from memory are used as additional context for QA, to help avoid previous mistakes in similar new situations - a novel application of memory-based continuous learning. With simulated feedback, we find that our system (called TeachMe) continually improves with time, and without model retraining, requiring feedback on only 25% of training examples to reach within 1% of the upper-bound (feedback on all examples). Similarly, in experiments with real users, we observe a similar trend, with performance improving by over 15% on a hidden test set after teaching. This suggests new opportunities for using frozen language models in an interactive setting where users can inspect, debug, and correct the model's beliefs, leading to improved system's performance over time., Comment: accepted at EMNLP 2022
Published: 2022

20. BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief

Author: Kassner, Nora, Tafjord, Oyvind, Schütze, Hinrich, and Clark, Peter
Subjects: Computer Science - Computation and Language
Abstract: Although pretrained language models (PTLMs) contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after specialized training. As a result, it can be hard to identify what the model actually "believes" about the world, making it susceptible to inconsistent behavior and simple errors. Our goal is to reduce these problems. Our approach is to embed a PTLM in a broader system that also includes an evolving, symbolic memory of beliefs -- a BeliefBank -- that records but then may modify the raw PTLM answers. We describe two mechanisms to improve belief consistency in the overall system. First, a reasoning component -- a weighted MaxSAT solver -- revises beliefs that significantly clash with others. Second, a feedback component issues future queries to the PTLM using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms result in more consistent beliefs in the overall system, improving both the accuracy and consistency of its answers over time. This is significant as it is a first step towards PTLM-based architectures with a systematic notion of belief, enabling them to construct a more coherent picture of the world, and improve over time without model retraining., Comment: EMNLP 2021 Camera Ready. arXiv admin note: substantial text overlap with arXiv:2104.08401
Published: 2021

21. 'Let Your Characters Tell Their Story': A Dataset for Character-Centric Narrative Understanding

Author: Brahman, Faeze, Huang, Meng, Tafjord, Oyvind, Zhao, Chao, Sachan, Mrinmaya, and Chaturvedi, Snigdha
Subjects: Computer Science - Computation and Language
Abstract: When reading a literary piece, readers often make inferences about various characters' roles, personalities, relationships, intents, actions, etc. While humans can readily draw upon their past experiences to build such a character-centric view of the narrative, understanding characters in narratives can be a challenging task for machines. To encourage research in this field of character-centric narrative understanding, we present LiSCU -- a new dataset of literary pieces and their summaries paired with descriptions of characters that appear in them. We also introduce two new tasks on LiSCU: Character Identification and Character Description Generation. Our experiments with several pre-trained language models adapted for these tasks demonstrate that there is a need for better models of narrative comprehension., Comment: Accepted to Findings of EMNLP 2021
Published: 2021

22. General-Purpose Question-Answering with Macaw

Author: Tafjord, Oyvind and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Despite the successes of pretrained language models, there are still few high-quality, general-purpose QA systems that are freely available. In response, we present Macaw, a versatile, generative question-answering (QA) system that we are making available to the community. Macaw is built on UnifiedQA, itself built on T5, and exhibits strong performance, zero-shot, on a wide variety of topics, including outperforming GPT-3 by over 10% (absolute) on Challenge300, a suite of 300 challenge questions, despite being an order of magnitude smaller (11 billion vs. 175 billion parameters). In addition, Macaw allows different permutations ("angles") of its inputs and outputs to be used, for example Macaw can take a question and produce an answer; or take an answer and produce a question; or take an answer and question, and produce multiple-choice options. We describe the system, and illustrate a variety of question types where it produces surprisingly good answers, well outside the training setup. We also identify question classes where it still appears to struggle, offering insights into the limitations of pretrained language models. Macaw is freely available, and we hope that it proves useful to the community. Macaw is available at https://github.com/allenai/macaw
Published: 2021

23. Explaining Answers with Entailment Trees

Author: Dalvi, Bhavana, Jansen, Peter, Tafjord, Oyvind, Xie, Zhengnan, Smith, Hannah, Pipatanangkura, Leighanna, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Our goal, in the context of open-domain textual question-answering (QA), is to explain answers by showing the line of reasoning from what is known to the answer, rather than simply showing a fragment of textual evidence (a "rationale'"). If this could be done, new opportunities for understanding and debugging the system's reasoning become possible. Our approach is to generate explanations in the form of entailment trees, namely a tree of multipremise entailment steps from facts that are known, through intermediate conclusions, to the hypothesis of interest (namely the question + answer). To train a model with this skill, we created ENTAILMENTBANK, the first dataset to contain multistep entailment trees. Given a hypothesis (question + answer), we define three increasingly difficult explanation tasks: generate a valid entailment tree given (a) all relevant sentences (b) all relevant and some irrelevant sentences, or (c) a corpus. We show that a strong language model can partially solve these tasks, in particular when the relevant sentences are included in the input (e.g., 35% of trees for (a) are perfect), and with indications of generalization to other domains. This work is significant as it provides a new type of dataset (multistep entailments) and baselines, offering a new avenue for the community to generate richer, more systematic explanations., Comment: published in EMNLP 2021
Published: 2021

24. Enriching a Model's Notion of Belief using a Persistent Memory

Author: Kassner, Nora, Tafjord, Oyvind, Schutze, Hinrich, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Although pretrained language models (PTLMs) have been shown to contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after using specialized training techniques to reduce inconsistency. As a result, it can be hard to identify what the model actually "believes" about the world. Our goal is to reduce this problem, so systems are more globally consistent and accurate in their answers. Our approach is to add a memory component -- a BeliefBank -- that records a model's answers, and two mechanisms that use it to improve consistency among beliefs. First, a reasoning component -- a weighted SAT solver -- improves consistency by flipping answers that significantly clash with others. Second, a feedback component re-queries the model but using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms improve both accuracy and consistency. This is significant as it is a first step towards endowing models with an evolving memory, allowing them to construct a more coherent picture of the world., Comment: This is an old and now obsolete draft. See arXiv:2109.14723 ("BeliefBank: Adding Memory to a Pre-Trained Language Model for a Systematic Notion of Belief") for the final paper
Published: 2021

25. Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge

Author: Bhakthavatsalam, Sumithra, Khashabi, Daniel, Khot, Tushar, Mishra, Bhavana Dalvi, Richardson, Kyle, Sabharwal, Ashish, Schoenick, Carissa, Tafjord, Oyvind, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: We present the ARC-DA dataset, a direct-answer ("open response", "freeform") version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARC-DA dataset addresses these concerns by converting questions to direct-answer format using a combination of crowdsourcing and expert review. The resulting dataset contains 2985 questions with a total of 8436 valid answers (questions typically have more than one valid answer). ARC-DA is one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves. We describe the conversion approach taken, appropriate evaluation metrics, and several strong models. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We hope the dataset spurs further advances in complex question-answering by the community. ARC-DA is available at https://allenai.org/data/arc-da
Published: 2021

26. ProofWriter: Generating Implications, Proofs, and Abductive Statements over Natural Language

Author: Tafjord, Oyvind, Mishra, Bhavana Dalvi, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Transformers have been shown to emulate logical deduction over natural language theories (logical rules expressed in natural language), reliably assigning true/false labels to candidate implications. However, their ability to generate implications of a theory has not yet been demonstrated, and methods for reconstructing proofs of answers are imperfect. In this work we show that a generative model, called ProofWriter, can reliably generate both implications of a theory and the natural language proof(s) that support them. In particular, iterating a 1-step implication generator results in proofs that are highly reliable, and represent actual model decisions (rather than post-hoc rationalizations). On the RuleTaker dataset, the accuracy of ProofWriter's proofs exceed previous methods by +9% absolute, and in a way that generalizes to proof depths unseen in training and on out-of-domain problems. We also show that generative techniques can perform a type of abduction with high precision: Given a theory and an unprovable conclusion, identify a missing fact that allows the conclusion to be proved, along with a proof. These results significantly improve the viability of neural methods for systematically reasoning over natural language., Comment: Findings of ACL 2021
Published: 2020

27. Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge

Author: Talmor, Alon, Tafjord, Oyvind, Clark, Peter, Goldberg, Yoav, and Berant, Jonathan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: To what extent can a neural network systematically reason over symbolic facts? Evidence suggests that large pre-trained language models (LMs) acquire some reasoning capacity, but this ability is difficult to control. Recently, it has been shown that Transformer-based models succeed in consistent reasoning over explicit symbolic facts, under a "closed-world" assumption. However, in an open-domain setup, it is desirable to tap into the vast reservoir of implicit knowledge already encoded in the parameters of pre-trained LMs. In this work, we provide a first demonstration that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements. To do this, we describe a procedure for automatically generating datasets that teach a model new reasoning skills, and demonstrate that models learn to effectively perform inference which involves implicit taxonomic and world knowledge, chaining and counting. Finally, we show that "teaching" models to reason generalizes beyond the training distribution: they successfully compose the usage of multiple reasoning skills in single examples. Our work paves a path towards open-domain systems that constantly improve by interacting with users who can instantly correct a model by adding simple natural language statements., Comment: Presented as Spotlight at NeurIPS 2020
Published: 2020

28. UnifiedQA: Crossing Format Boundaries With a Single QA System

Author: Khashabi, Daniel, Min, Sewon, Khot, Tushar, Sabharwal, Ashish, Tafjord, Oyvind, Clark, Peter, and Hajishirzi, Hannaneh
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UnifiedQA, that performs surprisingly well across 17 QA datasets spanning 4 diverse formats. UnifiedQA performs on par with 9 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UnifiedQA performs surprisingly well, showing strong generalization from its out-of-format training data. Finally, simply fine-tuning this pre-trained QA model into specialized models results in a new state of the art on 6 datasets, establishing UnifiedQA as a strong starting point for building QA systems., Comment: EMNLP 2020 (Findings)
Published: 2020

29. 'You are grounded!': Latent Name Artifacts in Pre-trained Language Models

Author: Shwartz, Vered, Rudinger, Rachel, and Tafjord, Oyvind
Subjects: Computer Science - Computation and Language
Abstract: Pre-trained language models (LMs) may perpetuate biases originating in their training corpus to downstream models. We focus on artifacts associated with the representation of given names (e.g., Donald), which, depending on the corpus, may be associated with specific entities, as indicated by next token prediction (e.g., Trump). While helpful in some contexts, grounding happens also in under-specified or inappropriate contexts. For example, endings generated for `Donald is a' substantially differ from those of other names, and often have more-than-average negative sentiment. We demonstrate the potential effect on downstream tasks with reading comprehension probes where name perturbation changes the model answers. As a silver lining, our experiments suggest that additional pre-training on different corpora may mitigate this bias., Comment: EMNLP 2020
Published: 2020

30. Transformers as Soft Reasoners over Language

Author: Clark, Peter, Tafjord, Oyvind, and Richardson, Kyle
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Beginning with McCarthy's Advice Taker (1959), AI has pursued the goal of providing a system with explicit, general knowledge and having the system reason over that knowledge. However, expressing the knowledge in a formal (logical or probabilistic) representation has been a major obstacle to this research. This paper investigates a modern approach to this problem where the facts and rules are provided as natural language sentences, thus bypassing a formal representation. We train transformers to reason (or emulate reasoning) over these sentences using synthetically generated data. Our models, that we call RuleTakers, provide the first empirical demonstration that this kind of soft reasoning over language is learnable, can achieve high (99%) accuracy, and generalizes to test data requiring substantially deeper chaining than seen during training (95%+ scores). We also demonstrate that the models transfer well to two hand-authored rulebases, and to rulebases paraphrased into more natural language. These findings are significant as it suggests a new role for transformers, namely as limited "soft theorem provers" operating over explicit theories in language. This in turn suggests new possibilities for explainability, correctability, and counterfactual reasoning in question-answering., Comment: IJCAI 2020
Published: 2020

31. SUPP.AI: Finding Evidence for Supplement-Drug Interactions

Author: Wang, Lucy Lu, Tafjord, Oyvind, Cohan, Arman, Jain, Sarthak, Skjonsberg, Sam, Schoenick, Carissa, Botner, Nick, and Ammar, Waleed
Subjects: Computer Science - Computation and Language
Abstract: Dietary supplements are used by a large portion of the population, but information on their pharmacologic interactions is incomplete. To address this challenge, we present SUPP.AI, an application for browsing evidence of supplement-drug interactions (SDIs) extracted from the biomedical literature. We train a model to automatically extract supplement information and identify such interactions from the scientific literature. To address the lack of labeled data for SDI identification, we use labels of the closely related task of identifying drug-drug interactions (DDIs) for supervision. We fine-tune the contextualized word representations of the RoBERTa language model using labeled DDI data, and apply the fine-tuned model to identify supplement interactions. We extract 195k evidence sentences from 22M articles (P=0.82, R=0.58, F1=0.68) for 60k interactions. We create the SUPP.AI application for users to search evidence sentences extracted by our model. SUPP.AI is an attempt to close the information gap on dietary supplements by making up-to-date evidence on SDIs more discoverable for researchers, clinicians, and consumers., Comment: ACL Demo 2020
Published: 2019

32. QuaRTz: An Open-Domain Dataset of Qualitative Relationship Questions

Author: Tafjord, Oyvind, Gardner, Matt, Lin, Kevin, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: We introduce the first open-domain dataset, called QuaRTz, for reasoning about textual qualitative relationships. QuaRTz contains general qualitative statements, e.g., "A sunscreen with a higher SPF protects the skin longer.", twinned with 3864 crowdsourced situated questions, e.g., "Billy is wearing sunscreen with a lower SPF than Lucy. Who will be best protected from the sun?", plus annotations of the properties being compared. Unlike previous datasets, the general knowledge is textual and not tied to a fixed set of relationships, and tests a system's ability to comprehend and apply textual qualitative knowledge in a novel setting. We find state-of-the-art results are substantially (20%) below human performance, presenting an open challenge to the NLP community., Comment: EMNLP'19
Published: 2019

33. From 'F' to 'A' on the N.Y. Regents Science Exams: An Overview of the Aristo Project

Author: Clark, Peter, Etzioni, Oren, Khashabi, Daniel, Khot, Tushar, Mishra, Bhavana Dalvi, Richardson, Kyle, Sabharwal, Ashish, Schoenick, Carissa, Tafjord, Oyvind, Tandon, Niket, Bhakthavatsalam, Sumithra, Groeneveld, Dirk, Guerquin, Michal, and Schmitz, Michael
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: AI has achieved remarkable mastery over games such as Chess, Go, and Poker, and even Jeopardy, but the rich variety of standardized exams has remained a landmark challenge. Even in 2016, the best AI system achieved merely 59.3% on an 8th Grade science exam challenge. This paper reports unprecedented success on the Grade 8 New York Regents Science Exam, where for the first time a system scores more than 90% on the exam's non-diagram, multiple choice (NDMC) questions. In addition, our Aristo system, building upon the success of recent language models, exceeded 83% on the corresponding Grade 12 Science Exam NDMC questions. The results, on unseen test questions, are robust across different test years and different variations of this kind of test. They demonstrate that modern NLP methods can result in mastery on this task. While not a full solution to general question-answering (the questions are multiple choice, and the domain is restricted to 8th Grade science), it represents a significant milestone for the field., Comment: AI Magazine 41 (4) Winter 2020. New analysis sections added
Published: 2019

34. Reasoning Over Paragraph Effects in Situations

Author: Lin, Kevin, Tafjord, Oyvind, Clark, Peter, and Gardner, Matt
Subjects: Computer Science - Computation and Language
Abstract: A key component of successfully reading a passage of text is the ability to apply knowledge gained from the passage to a new situation. In order to facilitate progress on this kind of reading, we present ROPES, a challenging benchmark for reading comprehension targeting Reasoning Over Paragraph Effects in Situations. We target expository language describing causes and effects (e.g., "animal pollinators increase efficiency of fertilization in flowers"), as they have clear implications for new situations. A system is presented a background passage containing at least one of these relations, a novel situation that uses this background, and questions that require reasoning about effects of the relationships in the background passage in the context of the situation. We collect background passages from science textbooks and Wikipedia that contain such phenomena, and ask crowd workers to author situations, questions, and answers, resulting in a 14,322 question dataset. We analyze the challenges of this task and evaluate the performance of state-of-the-art reading comprehension models. The best model performs only slightly better than randomly guessing an answer of the correct type, at 61.6% F1, well below the human performance of 89.0%.
Published: 2019

35. Multi-class Hierarchical Question Classification for Multiple Choice Science Exams

Author: Xu, Dongfang, Jansen, Peter, Martin, Jaycie, Xie, Zhengnan, Yadav, Vikas, Madabushi, Harish Tayyar, Tafjord, Oyvind, and Clark, Peter
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Prior work has demonstrated that question classification (QC), recognizing the problem domain of a question, can help answer it more accurately. However, developing strong QC algorithms has been hindered by the limited size and complexity of annotated data available. To address this, we present the largest challenge dataset for QC, containing 7,787 science exam questions paired with detailed classification labels from a fine-grained hierarchical taxonomy of 406 problem domains. We then show that a BERT-based model trained on this dataset achieves a large (+0.12 MAP) gain compared with previous methods, while also achieving state-of-the-art performance on benchmark open-domain and biomedical QC datasets. Finally, we show that using this model's predictions of question topic significantly improves the accuracy of a question answering system by +1.7% P@1, with substantial future gains possible as QC performance improves.
Published: 2019

36. Declarative Question Answering over Knowledge Bases containing Natural Language Text with Answer Set Programming

Author: Mitra, Arindam, Clark, Peter, Tafjord, Oyvind, and Baral, Chitta
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: While in recent years machine learning (ML) based approaches have been the popular approach in developing end-to-end question answering systems, such systems often struggle when additional knowledge is needed to correctly answer the questions. Proposed alternatives involve translating the question and the natural language text to a logical representation and then use logical reasoning. However, this alternative falters when the size of the text gets bigger. To address this we propose an approach that does logical reasoning over premises written in natural language text. The proposed method uses recent features of Answer Set Programming (ASP) to call external NLP modules (which may be based on ML) which perform simple textual entailment. To test our approach we develop a corpus based on the life cycle questions and showed that Our system achieves up to $18\%$ performance gain when compared to standard MCQ solvers.
Published: 2019

37. QuaRel: A Dataset and Models for Answering Questions about Qualitative Relationships

Author: Tafjord, Oyvind, Clark, Peter, Gardner, Matt, Yih, Wen-tau, and Sabharwal, Ashish
Subjects: Computer Science - Computation and Language
Abstract: Many natural language questions require recognizing and reasoning with qualitative relationships (e.g., in science, economics, and medicine), but are challenging to answer with corpus-based methods. Qualitative modeling provides tools that support such reasoning, but the semantic parsing task of mapping questions into those models has formidable challenges. We present QuaRel, a dataset of diverse story questions involving qualitative relationships that characterize these challenges, and techniques that begin to address them. The dataset has 2771 questions relating 19 different types of quantities. For example, "Jenny observes that the robot vacuum cleaner moves slower on the living room carpet than on the bedroom carpet. Which carpet has more friction?" We contribute (1) a simple and flexible conceptual framework for representing these kinds of questions; (2) the QuaRel dataset, including logical forms, exemplifying the parsing challenges; and (3) two novel models for this task, built as extensions of type-constrained semantic parsing. The first of these models (called QuaSP+) significantly outperforms off-the-shelf tools on QuaRel. The second (QuaSP+Zero) demonstrates zero-shot capability, i.e., the ability to handle new qualitative relationships without requiring additional training data, something not possible with previous models. This work thus makes inroads into answering complex, qualitative questions that require reasoning, and scaling to new relationships at low cost. The dataset and models are available at http://data.allenai.org/quarel., Comment: 9 pages, AAAI 2019
Published: 2018

38. AllenNLP: A Deep Semantic Natural Language Processing Platform

Author: Gardner, Matt, Grus, Joel, Neumann, Mark, Tafjord, Oyvind, Dasigi, Pradeep, Liu, Nelson, Peters, Matthew, Schmitz, Michael, and Zettlemoyer, Luke
Subjects: Computer Science - Computation and Language
Abstract: This paper describes AllenNLP, a platform for research on deep learning methods in natural language understanding. AllenNLP is designed to support researchers who want to build novel language understanding models quickly and easily. It is built on top of PyTorch, allowing for dynamic computation graphs, and provides (1) a flexible data API that handles intelligent batching and padding, (2) high-level abstractions for common operations in working with text, and (3) a modular and extensible experiment framework that makes doing good science easy. It also includes reference implementations of high quality approaches for both core semantic problems (e.g. semantic role labeling (Palmer et al., 2005)) and language understanding applications (e.g. machine comprehension (Rajpurkar et al., 2016)). AllenNLP is an ongoing open-source effort maintained by engineers and researchers at the Allen Institute for Artificial Intelligence., Comment: Describes the initial version of AllenNLP. Many features and models have been added since the first release. This is the paper to cite if you use AllenNLP in your research. Updated 5/31/2018 with version accepted to the NLP OSS workshop help at ACL 2018
Published: 2018

39. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Author: Clark, Peter, Cowhey, Isaac, Etzioni, Oren, Khot, Tushar, Sabharwal, Ashish, Schoenick, Carissa, and Tafjord, Oyvind
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community., Comment: 10 pages, 7 tables, 2 figures
Published: 2018

40. Faithful Reasoning over Scientific Claims

Author: Tan, Neşet Özkan, primary, Tandon, Niket, additional, Wadden, David, additional, Tafjord, Oyvind, additional, Gahegan, Mark, additional, and Witbrock, Michael, additional
Published: 2024
Full Text: View/download PDF

41. Semantic Parsing to Probabilistic Programs for Situated Question Answering

Author: Krishnamurthy, Jayant, Tafjord, Oyvind, and Kembhavi, Aniruddha
Subjects: Computer Science - Computation and Language
Abstract: Situated question answering is the problem of answering questions about an environment such as an image or diagram. This problem requires jointly interpreting a question and an environment using background knowledge to select the correct answer. We present Parsing to Probabilistic Programs (P3), a novel situated question answering model that can use background knowledge and global features of the question/environment interpretation while retaining efficient approximate inference. Our key insight is to treat semantic parses as probabilistic programs that execute nondeterministically and whose possible executions represent environmental uncertainty. We evaluate our approach on a new, publicly-released data set of 5000 science diagram questions, outperforming several competitive classical and neural baselines., Comment: EMNLP 2016, 11 pages
Published: 2016

42. Moving Beyond the Turing Test with the Allen AI Science Challenge

Author: Schoenick, Carissa, Clark, Peter, Tafjord, Oyvind, Turney, Peter, and Etzioni, Oren
Subjects: Computer Science - Artificial Intelligence
Abstract: Given recent successes in AI (e.g., AlphaGo's victory against Lee Sedol in the game of GO), it's become increasingly important to assess: how close are AI systems to human-level intelligence? This paper describes the Allen AI Science Challenge---an approach towards that goal which led to a unique Kaggle Competition, its results, the lessons learned, and our next steps., Comment: 7 pages
Published: 2016

43. From F to A on the New York Regents Science Exams--An Overview of the Aristo Project

Author: Clark, Peter, Etzioni, Oren, Khashabi, Daniel, Khot, Tushar, Mishra, Bhavana Dalvi, Richardson, Kyle, Sabharwal, Ashish, Schoenick, Carissa, Tafjord, Oyvind, Tandon, Niket, Bhakthavatsalam, Sumithra, Groeneveld, Dirk, Guerquin, Michal, and Schmitz, Michael
Subjects: Artificial intelligence -- Usage, Standardized tests -- Technology application, Natural language interfaces -- Usage, Computational linguistics -- Usage, Language processing -- Usage, Sciences education -- Technology application, Artificial intelligence, Technology application, Business
Abstract: Artificial intelligence has achieved remarkable mastery over games such as Chess, Go, and poker, and even Jeopardy!, but the rich variety of standardized exams has remained a landmark challenge. Even [...]
Published: 2020

44. Superstars and Giant Gravitons

Author: Myers, Robert C. and Tafjord, Oyvind
Subjects: High Energy Physics - Theory
Abstract: We examine a family of BPS solutions of ten-dimensional type IIb supergravity. These solutions asymptotically approach AdS_5 X S^5 and carry internal `angular' momentum on the five-sphere. While a naked singularity appears at the center of the anti-de Sitter space, we show that it has a natural physical interpretation in terms of a collection of giant gravitons. We calculate the distribution of giant gravitons from the dipole field induced in the Ramond-Ramond five-form, and show that these sources account for the entire internal momentum carried by the BPS solutions., Comment: 15 pages, Latex
Published: 2001
Full Text: View/download PDF

45. Fuzzy Funnels: Non-abelian Brane Intersections

Author: Constable, Neil R., Myers, Robert C., and Tafjord, Oyvind
Subjects: High Energy Physics - Theory
Abstract: We discuss dual formulations of D-brane intersections. The duality is between world volume field theories of different dimensionalities which both describe the same D-brane configuration but are valid in complementary regions of parameter space. We discuss the duality in terms of bion configurations involving D-strings orthogonally intersecting both D3-branes and D5-branes., Comment: Talk presented by R. C. Myers at Strings 2001, Mumbai, India. 12 pages
Published: 2001

46. Non-abelian Brane Intersections

Author: Constable, Neil R., Myers, Robert C., and Tafjord, Oyvind
Subjects: High Energy Physics - Theory
Abstract: We study new solutions of the low-energy equations of motion for the non-abelian D-string. We find a "fuzzy funnel" solution consisting of a noncommutative four-sphere geometry which expands along the length of the D-string. We show that this funnel solution has an interpretation as D-strings ending on a set of orthogonal D5-branes. Although not supersymmetric, the system appears to be stable within this framework. We also give a dual description of this configuration as a bion spike in the non-abelian world volume theory of coincident D5-branes., Comment: 33 pages, 2 figures. v2: added refs
Published: 2001
Full Text: View/download PDF

47. SUSY and Goliath

Author: Grisaru, Marcus T., Myers, Robert C., and Tafjord, Oyvind
Subjects: High Energy Physics - Theory
Abstract: We investigate the `giant gravitons' of McGreevy, Susskind and Toumbas [hep-th/0003075]. We demonstrate that these are BPS configurations which preserve precisely the same supersymmetries as a `point-like' graviton. We also show that there exist `dual' giant gravitons consisting of spherical branes expanding into the AdS component of the spacetime. Finally, we discuss the realization of the stringy exclusion principle within this expanded framework., Comment: 25 pages, 10 figures
Published: 2000
Full Text: View/download PDF

48. The Noncommutative Bion Core

Author: Constable, Neil R., Myers, Robert C., and Tafjord, Oyvind
Subjects: High Energy Physics - Theory
Abstract: We examine noncommutative solutions of the nonabelian theory on the world-volume of N coincident D-strings. These solutions can be interpreted in terms of noncommutative geometry as funnels describing the nonabelian D-string expanding out into an orthogonal D3-brane. These configurations are `dual' to the bion solutions in the abelian world-volume theory of the D3-brane. In the latter, a charge N magnetic monopole describes N D-strings attached to the D3-brane with a spike deformation of the world-volume. The noncommutative D-string solutions give a reliable account of physics at the core of the monopole, where the bion description is expected to breakdown. In the large N limit, we find good agreement between the two points of view, including the energy, couplings to background fields, and the shape of the funnel. We also study fluctuations traveling along the D-string, again obtaining agreement in the large N limit. At finite N, our results give a limit on the number of modes that can travel to infinity along the N D-strings attached to the D3-brane., Comment: 22 pages, refs added
Published: 1999
Full Text: View/download PDF

49. Baryons and Flux Tubes in Confining Gauge Theories from Brane Actions

Author: Callan, Curtis G., Guijosa, Alberto, Savvidy, Konstantin G., and Tafjord, Oyvind
Subjects: High Energy Physics - Theory
Abstract: We study baryon configurations in large N non-supersymmetric SU(N) gauge theories, applying the AdS/CFT correspondence. Using the D5-brane worldvolume theory in the near-horizon geometry of non-extremal D3-branes, we find embeddings which describe baryonic states in three-dimensional QCD. In particular, we construct solutions corresponding to a baryon made of N quarks, and study what happens when some fraction $\nu$ of the total number of quarks are bodily moved to a large spatial separation from the others. The individual clumps of quarks are represented by Born-Infeld string tubes obtained from a D5-brane whose spatial section has topology $R \times S^4$. They are connected by a confining color flux tube, described by a portion of the fivebrane that runs very close and parallel to the horizon. We find that this flux tube has a tension with a nontrivial $\nu$-dependence (not previously obtained by other methods). A similar picture is presented for the four-dimensional case., Comment: LaTeX, 20 pages, 6 eps figures; v2: added reference, corrected numerical error in Eqs. (13) and (23)
Published: 1999
Full Text: View/download PDF

50. A finite cutoff on the string worldsheet?

Author: Periwal, Vipul and Tafjord, Øyvind
Subjects: High Energy Physics - Theory
Abstract: D-brane backgrounds are specified in closed string theories by holes with appropriate mixed Dirichlet and Neumann boundary conditions on the string worldsheet. As presently stated, the prescription defining D-brane backgrounds is such that the Einstein equation is not equivalent to the condition for scale invariance on the string worldsheet. A modified D-brane prescription is found, that leads to the desired equivalence, while preserving all known D-brane lore. A possible interpretation is that the worldsheet cutoff is finite. Possible connections to recent work of Maldacena and Strominger, and Gopakumar and Vafa are suggested., Comment: 7 pages, RevTex; v2: typos corrected, superstring calculation included, discussion expanded - to be published in Phys.Rev. D
Published: 1998
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

141 results on '"Tafjord, Oyvind"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources