Author: "Roth, Dan" / Publication Year Range: This year - Searchworks@Jio Institute Digital Library Search Results

Your search keyword '"Roth, Dan"' showing total 36 results

Start Over Author "Roth, Dan" Publication Year Range This year

36 results on '"Roth, Dan"'

1. Contextualized Evaluations: Taking the Guesswork Out of Language Model Evaluations

Author: Malaviya, Chaitanya, Chang, Joseph Chee, Roth, Dan, Iyyer, Mohit, Yatskar, Mark, and Lo, Kyle
Subjects: Computer Science - Computation and Language
Abstract: Language model users often issue queries that lack specification, where the context under which a query was issued -- such as the user's identity, the query's intent, and the criteria for a response to be useful -- is not explicit. For instance, a good response to a subjective query like "What book should I read next?" would depend on the user's preferences, and a good response to an open-ended query like "How do antibiotics work against bacteria?" would depend on the user's expertise. This makes evaluation of responses to such queries an ill-posed task, as evaluators may make arbitrary judgments about the response quality. To remedy this, we present contextualized evaluations, a protocol that synthetically constructs context surrounding an underspecified query and provides it during evaluation. We find that the presence of context can 1) alter conclusions drawn from evaluation, even flipping win rates between model pairs, 2) nudge evaluators to make fewer judgments based on surface-level criteria, like style, and 3) provide new insights about model behavior across diverse contexts. Specifically, our procedure uncovers an implicit bias towards WEIRD contexts in models' "default" responses and we find that models are not equally sensitive to following different contexts, even when they are provided in prompts., Comment: Code & data available at https://github.com/allenai/ContextEval
Published: 2024

2. Benchmarking LLM Guardrails in Handling Multilingual Toxicity

Author: Yang, Yahan, Dan, Soham, Roth, Dan, and Lee, Insup
Subjects: Computer Science - Computation and Language
Abstract: With the ubiquity of Large Language Models (LLMs), guardrails have become crucial to detect and defend against toxic content. However, with the increasing pervasiveness of LLMs in multilingual scenarios, their effectiveness in handling multilingual toxic inputs remains unclear. In this work, we introduce a comprehensive multilingual test suite, spanning seven datasets and over ten languages, to benchmark the performance of state-of-the-art guardrails. We also investigates the resilience of guardrails against recent jailbreaking techniques, and assess the impact of in-context safety policies and language resource availability on guardrails' performance. Our findings show that existing guardrails are still ineffective at handling multilingual toxicity and lack robustness against jailbreaking prompts. This work aims to identify the limitations of guardrails and to build a more reliable and trustworthy LLMs in multilingual scenarios.
Published: 2024

3. ReasonAgain: Using Extractable Symbolic Programs to Evaluate Mathematical Reasoning

Author: Yu, Xiaodong, Zhou, Ben, Cheng, Hao, and Roth, Dan
Subjects: Computer Science - Artificial Intelligence
Abstract: Existing math datasets evaluate the reasoning abilities of large language models (LLMs) by either using the final answer or the intermediate reasoning steps derived from static examples. However, the former approach fails to surface model's uses of shortcuts and wrong reasoning while the later poses challenges in accommodating alternative solutions. In this work, we seek to use symbolic programs as a means for automated evaluation if a model can consistently produce correct final answers across various inputs to the program. We begin by extracting programs for popular math datasets (GSM8K and MATH) using GPT4-o. For those executable programs verified using the original input-output pairs, they are found to encapsulate the proper reasoning required to solve the original text questions. We then prompt GPT4-o to generate new questions using alternative input-output pairs based the extracted program. We apply the resulting datasets to evaluate a collection of LLMs. In our experiments, we observe significant accuracy drops using our proposed evaluation compared with original static examples, suggesting the fragility of math reasoning in state-of-the-art LLMs.
Published: 2024

4. Open Domain Question Answering with Conflicting Contexts

Author: Liu, Siyi, Ning, Qiang, Halder, Kishaloy, Xiao, Wei, Qi, Zheng, Htut, Phu Mon, Zhang, Yi, John, Neha Anna, Min, Bonan, Benajiba, Yassine, and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Open domain question answering systems frequently rely on information retrieved from large collections of text (such as the Web) to answer questions. However, such collections of text often contain conflicting information, and indiscriminately depending on this information may result in untruthful and inaccurate answers. To understand the gravity of this problem, we collect a human-annotated dataset, Question Answering with Conflicting Contexts (QACC), and find that as much as 25% of unambiguous, open domain questions can lead to conflicting contexts when retrieved using Google Search. We evaluate and benchmark three powerful Large Language Models (LLMs) with our dataset QACC and demonstrate their limitations in effectively addressing questions with conflicting information. To explore how humans reason through conflicting contexts, we request our annotators to provide explanations for their selections of correct answers. We demonstrate that by finetuning LLMs to explain their answers, we can introduce richer information into their training that guide them through the process of reasoning with conflicting contexts.
Published: 2024

5. GIVE: Structured Reasoning with Knowledge Graph Inspired Veracity Extrapolation

Author: He, Jiashu, Ma, Mingyu Derek, Fan, Jinxuan, Roth, Dan, Wang, Wei, and Ribeiro, Alejandro
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Existing retrieval-based reasoning approaches for large language models (LLMs) heavily rely on the density and quality of the non-parametric knowledge source to provide domain knowledge and explicit reasoning chain. However, inclusive knowledge sources are expensive and sometimes infeasible to build for scientific or corner domains. To tackle the challenges, we introduce Graph Inspired Veracity Extrapolation (GIVE), a novel reasoning framework that integrates the parametric and non-parametric memories to enhance both knowledge retrieval and faithful reasoning processes on very sparse knowledge graphs. By leveraging the external structured knowledge to inspire LLM to model the interconnections among relevant concepts, our method facilitates a more logical and step-wise reasoning approach akin to experts' problem-solving, rather than gold answer retrieval. Specifically, the framework prompts LLMs to decompose the query into crucial concepts and attributes, construct entity groups with relevant entities, and build an augmented reasoning chain by probing potential relationships among node pairs across these entity groups. Our method incorporates both factual and extrapolated linkages to enable comprehensive understanding and response generation. Extensive experiments on reasoning-intense benchmarks on biomedical and commonsense QA demonstrate the effectiveness of our proposed method. Specifically, GIVE enables GPT3.5-turbo to outperform advanced models like GPT4 without any additional training cost, thereby underscoring the efficacy of integrating structured information and internal reasoning ability of LLMs for tackling specialized tasks with limited external resources.
Published: 2024

6. Beyond correlation: The impact of human uncertainty in measuring the effectiveness of automatic evaluation and LLM-as-a-judge

Author: Elangovan, Aparna, Ko, Jongwoo, Xu, Lei, Elyasi, Mahsa, Liu, Ling, Bodapati, Sravan, and Roth, Dan
Subjects: Computer Science - Human-Computer Interaction, Computer Science - Artificial Intelligence
Abstract: The effectiveness of automatic evaluation of generative models is typically measured by comparing it to human evaluation using correlation metrics. However, metrics like Krippendorff's $\alpha$ and Randolph's $\kappa$, originally designed to measure the reliability of human labeling, make assumptions about human behavior and the labeling process. In this paper, we show how *relying on a single aggregate correlation score* can obscure fundamental differences between human behavior and automatic evaluation methods, including LLM-as-a-Judge. Specifically, we demonstrate that when the proportion of samples with variation or uncertainty in human labels (gathered during human evaluation) is relatively high, machine labels (generated by automatic evaluation methods) may superficially appear to have similar or better correlation with the human majority label compared to human-to-human (HH) correlation. This can create the illusion that automatic evaluation approximates the human majority label. However, as the proportion of samples with consistent human labels increases, the correlation between machine and human labels fall well below HH correlation. Based on these findings, we first propose stratifying results by human label uncertainty to provide a more robust analysis of automatic evaluation performance. Second, recognizing that uncertainty and variation are inherent in perception-based human evaluations, such as those involving attitudes or preferences, we introduce a new metric - *binned Jensen-Shannon Divergence for perception* for such scenarios to better measure the effectiveness of automatic evaluations. Third, we present visualization techniques -- *perception charts*, to compare the strengths and limitations of automatic evaluation and to contextualize correlation measures appropriately
Published: 2024

7. Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

Author: Ou, Tianyue, Xu, Frank F., Madaan, Aman, Liu, Jiarui, Lo, Robert, Sridhar, Abishek, Sengupta, Sudipta, Roth, Dan, Neubig, Graham, and Zhou, Shuyan
Subjects: Computer Science - Artificial Intelligence
Abstract: LLMs can now act as autonomous agents that interact with digital environments and complete specific objectives (e.g., arranging an online meeting). However, accuracy is still far from satisfactory, partly due to a lack of large-scale, direct demonstrations for digital tasks. Obtaining supervised data from humans is costly, and automatic data collection through exploration or reinforcement learning relies on complex environmental and content setup, resulting in datasets that lack comprehensive coverage of various scenarios. On the other hand, there is abundant knowledge that may indirectly assist task completion, such as online tutorials that were created for human consumption. In this work, we present Synatra, an approach that effectively transforms this indirect knowledge into direct supervision at scale. We define different types of indirect knowledge, and carefully study the available sources to obtain it, methods to encode the structure of direct demonstrations, and finally methods to transform indirect knowledge into direct demonstrations. We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web. In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at $0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.
Published: 2024

8. Model Tells Itself Where to Attend: Faithfulness Meets Automatic Attention Steering

Author: Zhang, Qingru, Yu, Xiaodong, Singh, Chandan, Liu, Xiaodong, Liu, Liyuan, Gao, Jianfeng, Zhao, Tuo, Roth, Dan, and Cheng, Hao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Large language models (LLMs) have demonstrated remarkable performance across various real-world tasks. However, they often struggle to fully comprehend and effectively utilize their input contexts, resulting in responses that are unfaithful or hallucinated. This difficulty increases for contexts that are long or contain distracting information, which can divert LLMs from fully capturing essential evidence. To address this issue, many works use prompting to help LLMs utilize contextual information more faithfully. For instance, iterative prompting highlights key information in two steps that first ask the LLM to identify important pieces of context and then derive answers accordingly. However, prompting methods are constrained to highlighting key information implicitly in token space, which is often insufficient to fully steer the model's attention. To improve model faithfulness more reliably, we propose AutoPASTA, a method that automatically identifies key contextual information and explicitly highlights it by steering an LLM's attention scores. Like prompting, AutoPASTA is applied at inference time and does not require changing any model parameters. Our experiments on open-book QA demonstrate that AutoPASTA effectively enables models to grasp essential contextual information, leading to substantially improved model faithfulness and performance, e.g., an average improvement of 7.95% for LLAMA3-70B-Instruct. Code will be publicly available at https://github.com/QingruZhang/AutoPASTA ., Comment: 12 pages, 4 figures
Published: 2024

9. MAPWise: Evaluating Vision-Language Models for Advanced Map Queries

Author: Mukhopadhyay, Srija, Rajgaria, Abhishek, Khatiwada, Prerana, Gupta, Vivek, and Roth, Dan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Graphics, Computer Science - Human-Computer Interaction
Abstract: Vision-language models (VLMs) excel at tasks requiring joint understanding of visual and linguistic information. A particularly promising yet under-explored application for these models lies in answering questions based on various kinds of maps. This study investigates the efficacy of VLMs in answering questions based on choropleth maps, which are widely used for data analysis and representation. To facilitate and encourage research in this area, we introduce a novel map-based question-answering benchmark, consisting of maps from three geographical regions (United States, India, China), each containing 1000 questions. Our benchmark incorporates 43 diverse question templates, requiring nuanced understanding of relative spatial relationships, intricate map features, and complex reasoning. It also includes maps with discrete and continuous values, encompassing variations in color-mapping, category ordering, and stylistic patterns, enabling comprehensive analysis. We evaluate the performance of multiple VLMs on this benchmark, highlighting gaps in their abilities and providing insights for improving such models., Comment: 30 Pages, 46 Tables, 6 Figure
Published: 2024

10. Knowledge-Aware Reasoning over Multimodal Semi-structured Tables

Author: Mathur, Suyash Vardhan, Bafna, Jainit Sushil, Kartik, Kunal, Khandelwal, Harshita, Shrivastava, Manish, Gupta, Vivek, Bansal, Mohit, and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition
Abstract: Existing datasets for tabular question answering typically focus exclusively on text within cells. However, real-world data is inherently multimodal, often blending images such as symbols, faces, icons, patterns, and charts with textual content in tables. With the evolution of AI models capable of multimodal reasoning, it is pertinent to assess their efficacy in handling such structured data. This study investigates whether current AI models can perform knowledge-aware reasoning on multimodal structured data. We explore their ability to reason on tables that integrate both images and text, introducing MMTabQA, a new dataset designed for this purpose. Our experiments highlight substantial challenges for current AI models in effectively integrating and interpreting multiple text and image inputs, understanding visual context, and comparing visual content across images. These findings establish our dataset as a robust benchmark for advancing AI's comprehension and capabilities in analyzing multimodal structured data.
Published: 2024

11. Enhancing Temporal Understanding in LLMs for Semi-structured Tables

Author: Deng, Irwin, Dixit, Kushagra, Gupta, Vivek, and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Databases, Computer Science - Machine Learning
Abstract: Temporal reasoning over tabular data presents substantial challenges for large language models (LLMs), as evidenced by recent research. In this study, we conduct a comprehensive analysis of temporal datasets to pinpoint the specific limitations of LLMs. Our investigation leads to enhancements in TempTabQA, a dataset specifically designed for tabular temporal question answering. We provide critical insights for improving LLM performance in temporal reasoning tasks with tabular data. Furthermore, we introduce a novel approach, C.L.E.A.R to strengthen LLM capabilities in this domain. Our findings demonstrate that our method significantly improves evidence-based reasoning across various models. Additionally, our experimental results reveal that indirect supervision with auxiliary data substantially boosts model performance in these tasks. This work contributes to a deeper understanding of LLMs' temporal reasoning abilities over tabular data and promotes advancements in their application across diverse fields., Comment: Total Pages 18, Total Tables 6, Total figures 7
Published: 2024

12. Unraveling the Truth: Do VLMs really Understand Charts? A Deep Dive into Consistency and Robustness

Author: Mukhopadhyay, Srija, Qidwai, Adnan, Garimella, Aparna, Ramu, Pritika, Gupta, Vivek, and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Human-Computer Interaction, Computer Science - Machine Learning
Abstract: Chart question answering (CQA) is a crucial area of Visual Language Understanding. However, the robustness and consistency of current Visual Language Models (VLMs) in this field remain under-explored. This paper evaluates state-of-the-art VLMs on comprehensive datasets, developed specifically for this study, encompassing diverse question categories and chart formats. We investigate two key aspects: 1) the models' ability to handle varying levels of chart and question complexity, and 2) their robustness across different visual representations of the same underlying data. Our analysis reveals significant performance variations based on question and chart types, highlighting both strengths and weaknesses of current models. Additionally, we identify areas for improvement and propose future research directions to build more robust and reliable CQA systems. This study sheds light on the limitations of current models and paves the way for future advancements in the field., Comment: 22 pages, 9 Tables, 5 figures, 22 examples
Published: 2024

13. NTSEBENCH: Cognitive Reasoning Benchmark for Vision Language Models

Author: Pandya, Pranshu, Talwarr, Agney S, Gupta, Vatsal, Kataria, Tushar, Gupta, Vivek, and Roth, Dan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Information Retrieval
Abstract: Cognitive textual and visual reasoning tasks, such as puzzles, series, and analogies, demand the ability to quickly reason, decipher, and evaluate patterns both textually and spatially. While LLMs and VLMs, through extensive training on large amounts of human-curated data, have attained a high level of pseudo-human intelligence in some common sense reasoning tasks, they still struggle with more complex reasoning tasks that require cognitive understanding. In this work, we introduce a new dataset, NTSEBench, designed to evaluate the cognitive multi-modal reasoning and problem-solving skills of large models. The dataset comprises 2,728 multiple-choice questions comprising of a total of 4,642 images across 26 categories sampled from the NTSE examination conducted nationwide in India, featuring both visual and textual general aptitude questions that do not rely on rote learning. We establish baselines on the dataset using state-of-the-art LLMs and VLMs. To facilitate a comparison between open source and propriety models, we propose four distinct modeling strategies to handle different modalities (text and images) in the dataset instances., Comment: 15 pages, 2 figures, 5 tables
Published: 2024

14. On Characterizing and Mitigating Imbalances in Multi-Instance Partial Label Learning

Author: Wang, Kaifu, Tsamoura, Efthymia, and Roth, Dan
Subjects: Computer Science - Machine Learning, Statistics - Machine Learning
Abstract: *Multi-Instance Partial Label Learning* (MI-PLL) is a weakly-supervised learning setting encompassing *partial label learning*, *latent structural learning*, and *neurosymbolic learning*. Unlike supervised learning, in MI-PLL, the inputs to the classifiers at training-time are tuples of instances $\mathbf{x}$. At the same time, the supervision signal is generated by a function $\sigma$ over the (hidden) gold labels of $\mathbf{x}$. In this work, we make multiple contributions towards addressing a problem that hasn't been studied so far in the context of MI-PLL: that of characterizing and mitigating *learning imbalances*, i.e., major differences in the errors occurring when classifying instances of different classes (aka *class-specific risks*). In terms of theory, we derive class-specific risk bounds for MI-PLL, while making minimal assumptions. Our theory reveals a unique phenomenon: that $\sigma$ can greatly impact learning imbalances. This result is in sharp contrast with previous research on supervised and weakly-supervised learning, which only studies learning imbalances under the prism of data imbalances. On the practical side, we introduce a technique for estimating the marginal of the hidden labels using only MI-PLL data. Then, we introduce algorithms that mitigate imbalances at training- and testing-time, by treating the marginal of the hidden labels as a constraint. We demonstrate the effectiveness of our techniques using strong baselines from neurosymbolic and long-tail learning, suggesting performance improvements of up to 14\%.
Published: 2024

15. H-STAR: LLM-driven Hybrid SQL-Text Adaptive Reasoning on Tables

Author: Abhyankar, Nikhil, Gupta, Vivek, Roth, Dan, and Reddy, Chandan K.
Subjects: Computer Science - Databases, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning
Abstract: Tabular reasoning involves interpreting natural language queries about tabular data, which presents a unique challenge of combining language understanding with structured data analysis. Existing methods employ either textual reasoning, which excels in semantic interpretation but struggles with mathematical operations, or symbolic reasoning, which handles computations well but lacks semantic understanding. This paper introduces a novel algorithm H-STAR that integrates both symbolic and semantic (textual) approaches in a two-stage process to address these limitations. H-STAR employs: (1) step-wise table extraction using `multi-view' column retrieval followed by row extraction, and (2) adaptive reasoning that adapts reasoning strategies based on question types, utilizing semantic reasoning for direct lookup and complex lexical queries while augmenting textual reasoning with symbolic reasoning support for quantitative and logical tasks. Our extensive experiments demonstrate that H-STAR significantly outperforms state-of-the-art methods across three tabular question-answering (QA) and fact-verification datasets, underscoring its effectiveness and efficiency., Comment: 22 pages, 16 tables, 21 figures
Published: 2024

16. FlowVQA: Mapping Multimodal Logic in Visual Question Answering with Flowcharts

Author: Singh, Shubhankar, Chaurasia, Purvi, Varun, Yerram, Pandya, Pranshu, Gupta, Vatsal, Gupta, Vivek, and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Computer Vision and Pattern Recognition, Computer Science - Information Retrieval, Computer Science - Machine Learning
Abstract: Existing benchmarks for visual question answering lack in visual grounding and complexity, particularly in evaluating spatial reasoning skills. We introduce FlowVQA, a novel benchmark aimed at assessing the capabilities of visual question-answering multimodal language models in reasoning with flowcharts as visual contexts. FlowVQA comprises 2,272 carefully generated and human-verified flowchart images from three distinct content sources, along with 22,413 diverse question-answer pairs, to test a spectrum of reasoning tasks, including information localization, decision-making, and logical progression. We conduct a thorough baseline evaluation on a suite of both open-source and proprietary multimodal language models using various strategies, followed by an analysis of directional bias. The results underscore the benchmark's potential as a vital tool for advancing the field of multimodal modeling, providing a focused and challenging environment for enhancing model performance in visual and logical reasoning tasks., Comment: Accepted in ACL 2024 (Findings), 21 pages, 7 figures, 9 Tables
Published: 2024

17. FamiCom: Further Demystifying Prompts for Language Models with Task-Agnostic Performance Estimation

Author: Li, Bangzheng, Zhou, Ben, Fu, Xingyu, Wang, Fei, Roth, Dan, and Chen, Muhao
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: Language models have shown impressive in-context-learning capabilities, which allow them to benefit from input prompts and perform better on downstream end tasks. Existing works investigate the mechanisms behind this observation, and propose label-agnostic prompt metrics that can better estimate end-task performances. One popular approach is using perplexity as a way to measure models' familiarity with the prompt. While showing consistent improvements on in-domain tasks, we found that familiarity metrics such as perplexity cannot accurately estimate performance in complicated situations such as task or domain transferring scenarios. In this work, we propose a revised measure called FamiCom, providing a more comprehensive measure for task-agnostic performance estimation. Specifically, FamiCom combines familiarity with \textit{complexity} -- the inherent difficulty of end tasks, which is an important factor missing from current metrics. Experiments show that FamiCom strongly correlates with end-task performances, producing a 0.85 Spearman's correlation, versus 0.43 of familiarity-only ones'. We further apply FamiCom to automatic prompt and demonstration selection, and outperform existing methods and baselines by more than 7.0% in accuracy.
Published: 2024

18. A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners

Author: Jiang, Bowen, Xie, Yangxinyu, Hao, Zhuoqun, Wang, Xiaomeng, Mallick, Tanwi, Su, Weijie J., Taylor, Camillo J., and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: This study introduces a hypothesis-testing framework to assess whether large language models (LLMs) possess genuine reasoning abilities or primarily depend on token bias. We go beyond evaluating LLMs on accuracy; rather, we aim to investigate their token bias in solving logical reasoning tasks. Specifically, we develop carefully controlled synthetic datasets, featuring conjunction fallacy and syllogistic problems. Our framework outlines a list of hypotheses where token biases are readily identifiable, with all null hypotheses assuming genuine reasoning capabilities of LLMs. The findings in this study suggest, with statistical guarantee, that most LLMs still struggle with logical reasoning. While they may perform well on classic problems, their success largely depends on recognizing superficial patterns with strong token bias, thereby raising concerns about their actual reasoning and generalization abilities. Codes and data are open-sourced at https://github.com/bowen-upenn/llm_token_bias., Comment: This paper has been accepted at EMNLP 2024
Published: 2024

19. MuirBench: A Comprehensive Benchmark for Robust Multi-image Understanding

Author: Wang, Fei, Fu, Xingyu, Huang, James Y., Li, Zekun, Liu, Qin, Liu, Xiaogeng, Ma, Mingyu Derek, Xu, Nan, Zhou, Wenxuan, Zhang, Kai, Yan, Tianyi Lorena, Mo, Wenjie Jacky, Liu, Hsiang-Hui, Lu, Pan, Li, Chunyuan, Xiao, Chaowei, Chang, Kai-Wei, Roth, Dan, Zhang, Sheng, Poon, Hoifung, and Chen, Muhao
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We introduce MuirBench, a comprehensive benchmark that focuses on robust multi-image understanding capabilities of multimodal LLMs. MuirBench consists of 12 diverse multi-image tasks (e.g., scene understanding, ordering) that involve 10 categories of multi-image relations (e.g., multiview, temporal relations). Comprising 11,264 images and 2,600 multiple-choice questions, MuirBench is created in a pairwise manner, where each standard instance is paired with an unanswerable variant that has minimal semantic differences, in order for a reliable assessment. Evaluated upon 20 recent multi-modal LLMs, our results reveal that even the best-performing models like GPT-4o and Gemini Pro find it challenging to solve MuirBench, achieving 68.0% and 49.3% in accuracy. Open-source multimodal LLMs trained on single images can hardly generalize to multi-image questions, hovering below 33.3% in accuracy. These results highlight the importance of MuirBench in encouraging the community to develop multimodal LLMs that can look beyond a single image, suggesting potential pathways for future improvements., Comment: typos corrected, references added, Project Page: https://muirbench.github.io/
Published: 2024

20. Visual Sketchpad: Sketching as a Visual Chain of Thought for Multimodal Language Models

Author: Hu, Yushi, Shi, Weijia, Fu, Xingyu, Roth, Dan, Ostendorf, Mari, Zettlemoyer, Luke, Smith, Noah A, and Krishna, Ranjay
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Computation and Language
Abstract: Humans draw to facilitate reasoning: we draw auxiliary lines when solving geometry problems; we mark and circle when reasoning on maps; we use sketches to amplify our ideas and relieve our limited-capacity working memory. However, such actions are missing in current multimodal language models (LMs). Current chain-of-thought and tool-use paradigms only use text as intermediate reasoning steps. In this work, we introduce Sketchpad, a framework that gives multimodal LMs a visual sketchpad and tools to draw on the sketchpad. The LM conducts planning and reasoning according to the visual artifacts it has drawn. Different from prior work, which uses text-to-image models to enable LMs to draw, Sketchpad enables LMs to draw with lines, boxes, marks, etc., which is closer to human sketching and better facilitates reasoning. Sketchpad can also use specialist vision models during the sketching process (e.g., draw bounding boxes with object detection models, draw masks with segmentation models), to further enhance visual perception and reasoning. We experiment with a wide range of math tasks (including geometry, functions, graphs, and chess) and complex visual reasoning tasks. Sketchpad substantially improves performance on all tasks over strong base models with no sketching, yielding an average gain of 12.7% on math tasks, and 8.6% on vision tasks. GPT-4o with Sketchpad sets a new state of the art on all tasks, including V*Bench (80.3%), BLINK spatial reasoning (83.9%), and visual correspondence (80.8%). All codes and data are in https://visualsketchpad.github.io/., Comment: Accepted to NeurIPS 2024. Project and codes url: https://visualsketchpad.github.io/
Published: 2024

21. Commonsense-T2I Challenge: Can Text-to-Image Generation Models Understand Commonsense?

Author: Fu, Xingyu, He, Muyu, Lu, Yujie, Wang, William Yang, and Roth, Dan
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We present a novel task and benchmark for evaluating the ability of text-to-image(T2I) generation models to produce images that align with commonsense in real life, which we call Commonsense-T2I. Given two adversarial text prompts containing an identical set of action words with minor differences, such as "a lightbulb without electricity" v.s. "a lightbulb with electricity", we evaluate whether T2I models can conduct visual-commonsense reasoning, e.g. produce images that fit "the lightbulb is unlit" vs. "the lightbulb is lit" correspondingly. Commonsense-T2I presents an adversarial challenge, providing pairwise text prompts along with expected outputs. The dataset is carefully hand-curated by experts and annotated with fine-grained labels, such as commonsense type and likelihood of the expected outputs, to assist analyzing model behavior. We benchmark a variety of state-of-the-art (sota) T2I models and surprisingly find that, there is still a large gap between image synthesis and real life photos--even the DALL-E 3 model could only achieve 48.92% on Commonsense-T2I, and the stable diffusion XL model only achieves 24.92% accuracy. Our experiments show that GPT-enriched prompts cannot solve this challenge, and we include a detailed analysis about possible reasons for such deficiency. We aim for Commonsense-T2I to serve as a high-quality evaluation benchmark for T2I commonsense checking, fostering advancements in real life image generation., Comment: COLM 2024, Project Url: https://zeyofu.github.io/CommonsenseT2I/
Published: 2024

22. ConSiDERS-The-Human Evaluation Framework: Rethinking Human Evaluation for Generative Large Language Models

Author: Elangovan, Aparna, Liu, Ling, Xu, Lei, Bodapati, Sravan, and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence
Abstract: In this position paper, we argue that human evaluation of generative large language models (LLMs) should be a multidisciplinary undertaking that draws upon insights from disciplines such as user experience research and human behavioral psychology to ensure that the experimental design and results are reliable. The conclusions from these evaluations, thus, must consider factors such as usability, aesthetics, and cognitive biases. We highlight how cognitive biases can conflate fluent information and truthfulness, and how cognitive uncertainty affects the reliability of rating scores such as Likert. Furthermore, the evaluation should differentiate the capabilities and weaknesses of increasingly powerful large language models -- which requires effective test sets. The scalability of human evaluation is also crucial to wider adoption. Hence, to design an effective human evaluation system in the age of generative NLP, we propose the ConSiDERS-The-Human evaluation framework consisting of 6 pillars -- Consistency, Scoring Criteria, Differentiating, User Experience, Responsible, and Scalability., Comment: Accepted in ACL 2024
Published: 2024
Full Text: View/download PDF

23. Devil's Advocate: Anticipatory Reflection for LLM Agents

Author: Wang, Haoyu, Li, Tao, Deng, Zhiwei, Roth, Dan, and Li, Yang
Subjects: Computer Science - Artificial Intelligence
Abstract: In this work, we introduce a novel approach that equips LLM agents with introspection, enhancing consistency and adaptability in solving complex tasks. Our approach prompts LLM agents to decompose a given task into manageable subtasks (i.e., to make a plan), and to continuously introspect upon the suitability and results of their actions. %; and when necessary, to explore ``the road not taken.'' We implement a three-fold introspective intervention: 1) anticipatory reflection on potential failures and alternative remedy before action execution, 2) post-action alignment with subtask objectives and backtracking with remedy to ensure utmost effort in plan execution, and 3) comprehensive review upon plan completion for future strategy refinement. By deploying and experimenting with this methodology -- a zero-shot approach -- within WebArena for practical tasks in web environments, our agent demonstrates superior performance with a success rate of 23.5% over existing zero-shot methods by 3.5%. The experimental results suggest that our introspection-driven approach not only enhances the agent's ability to navigate unanticipated challenges through a robust mechanism of plan execution, but also improves efficiency by reducing the number of trials and plan revisions by 45% needed to achieve a task., Comment: 13 pages, 6 figures
Published: 2024

24. BIRD: A Trustworthy Bayesian Inference Framework for Large Language Models

Author: Feng, Yu, Zhou, Ben, Lin, Weidong, and Roth, Dan
Subjects: Computer Science - Computation and Language
Abstract: Predictive models often need to work with incomplete information in real-world tasks. Consequently, they must provide reliable probability or confidence estimation, especially in large-scale decision making and planning tasks. Current large language models (LLM) are insufficient for such accurate estimations, but they can generate relevant factors that may affect the probabilities, produce coarse-grained probabilities when the information is more complete, and help determine which factors are relevant to specific downstream contexts. In this paper, we make use of these capabilities of LLMs to provide a significantly more accurate probabilistic estimation. We propose BIRD, a novel probabilistic inference framework that aligns a Bayesian network with LLM abductions and then estimates more accurate probabilities in a deduction step. We show BIRD provides reliable probability estimations that are 30\% better than those provided directly by LLM baselines. These estimates can further contribute to better and more trustworthy decision-making.
Published: 2024

25. BLINK: Multimodal Large Language Models Can See but Not Perceive

Author: Fu, Xingyu, Hu, Yushi, Li, Bangzheng, Feng, Yu, Wang, Haoyu, Lin, Xudong, Roth, Dan, Smith, Noah A., Ma, Wei-Chiu, and Krishna, Ranjay
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: We introduce Blink, a new benchmark for multimodal language models (LLMs) that focuses on core visual perception abilities not found in other evaluations. Most of the Blink tasks can be solved by humans "within a blink" (e.g., relative depth estimation, visual correspondence, forensics detection, and multi-view reasoning). However, we find these perception-demanding tasks cast significant challenges for current multimodal LLMs because they resist mediation through natural language. Blink reformats 14 classic computer vision tasks into 3,807 multiple-choice questions, paired with single or multiple images and visual prompting. While humans get 95.70% accuracy on average, Blink is surprisingly challenging for existing multimodal LLMs: even the best-performing GPT-4V and Gemini achieve accuracies of 51.26% and 45.72%, only 13.17% and 7.63% higher than random guessing, indicating that such perception abilities have not "emerged" yet in recent multimodal LLMs. Our analysis also highlights that specialist CV models could solve these problems much better, suggesting potential pathways for future improvements. We believe Blink will stimulate the community to help multimodal LLMs catch up with human-level visual perception., Comment: Multimodal Benchmark, Project Url: https://zeyofu.github.io/blink/, ECCV 2024
Published: 2024

26. Fewer Truncations Improve Language Modeling

Author: Ding, Hantian, Wang, Zijian, Paolini, Giovanni, Kumar, Varun, Deoras, Anoop, Roth, Dan, and Soatto, Stefano
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: In large language model training, input documents are typically concatenated together and then split into sequences of equal length to avoid padding tokens. Despite its efficiency, the concatenation approach compromises data integrity -- it inevitably breaks many documents into incomplete pieces, leading to excessive truncations that hinder the model from learning to compose logically coherent and factually consistent content that is grounded on the complete context. To address the issue, we propose Best-fit Packing, a scalable and efficient method that packs documents into training sequences through length-aware combinatorial optimization. Our method completely eliminates unnecessary truncations while retaining the same training efficiency as concatenation. Empirical results from both text and code pre-training show that our method achieves superior performance (e.g., relatively +4.7% on reading comprehension; +16.8% in context following; and +9.2% on program synthesis), and reduces closed-domain hallucination effectively by up to 58.3%., Comment: ICML 2024
Published: 2024

27. Is Table Retrieval a Solved Problem? Exploring Join-Aware Multi-Table Retrieval

Author: Chen, Peter Baile, Zhang, Yi, and Roth, Dan
Subjects: Computer Science - Information Retrieval, Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Retrieving relevant tables containing the necessary information to accurately answer a given question over tables is critical to open-domain question-answering (QA) systems. Previous methods assume the answer to such a question can be found either in a single table or multiple tables identified through question decomposition or rewriting. However, neither of these approaches is sufficient, as many questions require retrieving multiple tables and joining them through a join plan that cannot be discerned from the user query itself. If the join plan is not considered in the retrieval stage, the subsequent steps of reasoning and answering based on those retrieved tables are likely to be incorrect. To address this problem, we introduce a method that uncovers useful join relations for any query and database during table retrieval. We use a novel re-ranking method formulated as a mixed-integer program that considers not only table-query relevance but also table-table relevance that requires inferring join relationships. Our method outperforms the state-of-the-art approaches for table retrieval by up to 9.3% in F1 score and for end-to-end QA by up to 5.4% in accuracy., Comment: ACL 2024 camera ready
Published: 2024

28. Conceptual and Unbiased Reasoning in Language Models

Author: Zhou, Ben, Zhang, Hongming, Chen, Sihao, Yu, Dian, Wang, Hongwei, Peng, Baolin, Roth, Dan, and Yu, Dong
Subjects: Computer Science - Computation and Language
Abstract: Conceptual reasoning, the ability to reason in abstract and high-level perspectives, is key to generalization in human cognition. However, limited study has been done on large language models' capability to perform conceptual reasoning. In this work, we bridge this gap and propose a novel conceptualization framework that forces models to perform conceptual reasoning on abstract questions and generate solutions in a verifiable symbolic space. Using this framework as an analytical tool, we show that existing large language models fall short on conceptual reasoning, dropping 9% to 28% on various benchmarks compared to direct inference methods. We then discuss how models can improve since high-level abstract reasoning is key to unbiased and generalizable decision-making. We propose two techniques to add trustworthy induction signals by generating familiar questions with similar underlying reasoning paths and asking models to perform self-refinement. Experiments show that our proposed techniques improve models' conceptual reasoning performance by 8% to 11%, achieving a more robust reasoning system that relies less on inductive biases., Comment: Preprint under review
Published: 2024

29. Multi-Agent VQA: Exploring Multi-Agent Foundation Models in Zero-Shot Visual Question Answering

Author: Jiang, Bowen, Zhuang, Zhijun, Shivakumar, Shreyas S., Roth, Dan, and Taylor, Camillo J.
Subjects: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Computation and Language, Computer Science - Machine Learning, Computer Science - Multiagent Systems
Abstract: This work explores the zero-shot capabilities of foundation models in Visual Question Answering (VQA) tasks. We propose an adaptive multi-agent system, named Multi-Agent VQA, to overcome the limitations of foundation models in object detection and counting by using specialized agents as tools. Unlike existing approaches, our study focuses on the system's performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world. We present preliminary experimental results under zero-shot scenarios and highlight some failure cases, offering new directions for future research., Comment: A full version of the paper will be released soon. The codes are available at https://github.com/bowen-upenn/Multi-Agent-VQA
Published: 2024

30. From Instructions to Constraints: Language Model Alignment with Automatic Constraint Verification

Author: Wang, Fei, Shang, Chao, Jain, Sarthak, Wang, Shuai, Ning, Qiang, Min, Bonan, Castelli, Vittorio, Benajiba, Yassine, and Roth, Dan
Subjects: Computer Science - Computation and Language, Computer Science - Artificial Intelligence, Computer Science - Machine Learning
Abstract: User alignment is crucial for adapting general-purpose language models (LMs) to downstream tasks, but human annotations are often not available for all types of instructions, especially those with customized constraints. We observe that user instructions typically contain constraints. While assessing response quality in terms of the whole instruction is often costly, efficiently evaluating the satisfaction rate of constraints is feasible. We investigate common constraints in NLP tasks, categorize them into three classes based on the types of their arguments, and propose a unified framework, ACT (Aligning to ConsTraints), to automatically produce supervision signals for user alignment with constraints. Specifically, ACT uses constraint verifiers, which are typically easy to implement in practice, to compute constraint satisfaction rate (CSR) of each response. It samples multiple responses for each prompt and collect preference labels based on their CSR automatically. Subsequently, ACT adapts the LM to the target task through a ranking-based learning process. Experiments on fine-grained entity typing, abstractive summarization, and temporal question answering show that ACT is able to enhance LMs' capability to adhere to different classes of constraints, thereby improving task performance. Further experiments show that the constraint-following capabilities are transferable.
Published: 2024

31. Evaluating LLMs' Mathematical Reasoning in Financial Document Question Answering

Author: Srivastava, Pragya, Malik, Manuj, Gupta, Vivek, Ganu, Tanuja, and Roth, Dan
Subjects: Computer Science - Computation and Language
Abstract: Large Language Models (LLMs), excel in natural language understanding, but their capability for complex mathematical reasoning with an amalgamation of structured tables and unstructured text is uncertain. This study explores LLMs' mathematical reasoning on four financial tabular question-answering datasets: TATQA, FinQA, ConvFinQA, and Multihiertt. Through extensive experiments with various models and prompting techniques, we assess how LLMs adapt to complex tables and mathematical tasks. We focus on sensitivity to table complexity and performance variations with an increasing number of arithmetic reasoning steps. The results provide insights into LLMs' capabilities and limitations in handling complex mathematical scenarios for semi-structured tables. Ultimately, we introduce a novel prompting technique tailored to semi-structured documents, matching or outperforming other baselines in performance while providing a nuanced understanding of LLMs abilities for such a task., Comment: 25 pages, 17 figures
Published: 2024

32. DeAL: Decoding-time Alignment for Large Language Models

Author: Huang, James Y., Sengupta, Sailik, Bonadiman, Daniele, Lai, Yi-an, Gupta, Arshit, Pappas, Nikolaos, Mansour, Saab, Kirchhoff, Katrin, and Roth, Dan
Subjects: Computer Science - Artificial Intelligence, Computer Science - Computation and Language
Abstract: Large Language Models (LLMs) are nowadays expected to generate content aligned with human preferences. Current work focuses on alignment at model training time, through techniques such as Reinforcement Learning with Human Feedback (RLHF). However, it is unclear if such methods are an effective choice to teach alignment objectives to the model. First, the inability to incorporate multiple, custom rewards and reliance on a model developer's view of universal and static principles are key limitations. Second, the residual gaps in model training and the reliability of such approaches are also questionable (e.g. susceptibility to jail-breaking even after safety training). To address these, we propose DeAL, a framework that allows the user to customize reward functions and enables Decoding-time Alignment of LLMs (DeAL). At its core, we view decoding as a heuristic-guided search process and facilitate the use of a wide variety of alignment objectives. Our experiments with programmatic constraints such as keyword and length constraints (studied widely in the pre-LLM era) and abstract objectives such as harmlessness and helpfulness (proposed in the post-LLM era) show that we can DeAL with fine-grained trade-offs, improve adherence to alignment objectives, and address residual gaps in LLMs. Lastly, while DeAL can be effectively paired with RLHF and prompting techniques, its generality makes decoding slower, an optimization we leave for future work., Comment: The appendix contains data that is offensive / disturbing in nature
Published: 2024

33. Code Representation Learning At Scale

Author: Zhang, Dejiao, Ahmad, Wasi, Tan, Ming, Ding, Hantian, Nallapati, Ramesh, Roth, Dan, Ma, Xiaofei, and Xiang, Bing
Subjects: Computer Science - Computation and Language
Abstract: Recent studies have shown that code language models at scale demonstrate significant performance gains on downstream tasks, i.e., code generation. However, most of the existing works on code representation learning train models at a hundred million parameter scale using very limited pretraining corpora. In this work, we fuel code representation learning with a vast amount of code data via a two-stage pretraining scheme. We first train the encoders via a mix that leverages both randomness in masking language modeling and the structure aspect of programming language. We then enhance the representations via contrastive learning with hard negative and hard positive constructed in an unsupervised manner. We establish an off-the-shelf encoder model that persistently outperforms the existing models on a wide variety of downstream tasks by large margins. To comprehend the factors contributing to successful code representation learning, we conduct detailed ablations and share our findings on (i) a customized and effective token-level denoising scheme for source code; (ii) the importance of hard negatives and hard positives; (iii) how the proposed bimodal contrastive learning boost the cross-lingual semantic search performance; and (iv) how the pretraining schemes decide the downstream task performance scales with the model size., Comment: 10 pages
Published: 2024

34. Causal inference with textual data: A quasi-experimental design assessing the association between author metadata and acceptance among ICLR submissions from 2017 to 2022

Author: Chen Chang, Zhang Jiayao, Ye Ting, Roth Dan, and Zhang Bo
Subjects: matched observational study, natural language processing, peer-review, quasi-experimental design, status bias, 62a01, 62p25, Mathematics, QA1-939, Probabilities. Mathematical statistics, QA273-280
Abstract: Many recent studies have probed status bias in the peer-review process of academic journals and conferences. In this article, we investigated the association between author metadata and area chairs’ final decisions (Accept/Reject) using our compiled database of 5,313 borderline submissions to the International Conference on Learning Representations from 2017 to 2022 under a matched observational study framework. We carefully defined elements in a cause-and-effect analysis, including the treatment and its timing, pre-treatment variables, potential outcomes (POs) and causal null hypothesis of interest, all in the context of study units being textual data and under Neyman and Rubin’s PO framework. We found some weak evidence that author metadata was associated with articles’ final decisions. We also found that, under an additional stability assumption, borderline articles from high-ranking institutions (top-30% or top-20%) were less favored by area chairs compared to their matched counterparts. The results were consistent in two different matched designs (odds ratio = 0.82 [95% confidence interval (CI): 0.67 to 1.00] in a first design and 0.83 [95% CI: 0.64 to 1.07] in a strengthened design) and most pronounced in the subgroup of articles with low ratings. We discussed how to interpret these results in the context of multiple interactions between a study unit and different agents (reviewers and area chairs) in the peer-review system.
Published: 2024
Full Text: View/download PDF

35. Disparities in seizure outcomes revealed by large language models

Author: Xie, Kevin, primary, Ojemann, William K S, additional, Gallagher, Ryan S, additional, Shinohara, Russell T, additional, Lucas, Alfredo, additional, Hill, Chloé E, additional, Hamilton, Roy H, additional, Johnson, Kevin B, additional, Roth, Dan, additional, Litt, Brian, additional, and Ellis, Colin A, additional
Published: 2024
Full Text: View/download PDF

36. Recent Advances in Natural Language Processing via Large Pre-trained Language Models: A Survey.

Author: MIN, BONAN, ROSS, HAYLEY, SULEM, ELIOR, BEN VEYSEH, AMIR POURAN, THIEN HUU NGUYEN, SAINZ, OSCAR, AGIRRE, ENEKO, HEINTZ, ILANA, and ROTH, DAN
Subjects: NATURAL language processing, LANGUAGE models, DEEP learning, ARTIFICIAL neural networks, MACHINE translating, SENTIMENT analysis
Published: 2024
Full Text: View/download PDF

Catalog

Books, media, physical & digital resources

See catalog results

Searchworks

Select search scope, currently: Articles Catalog books, media & more in Jio Institute collections Articles journal articles & other e-resources

Search

Search Constraints

Refine your results

Search Limiters

Topic

Publication Year Range

Language

Publication Type

Journal

Database

Publisher

36 results on '"Roth, Dan"'

Search Results

Catalog

Select search scope, currently: Articles

Catalog

books, media & more in Jio Institute collections

Articles

journal articles & other e-resources