Back to Search Start Over

BabelBench: An Omni Benchmark for Code-Driven Analysis of Multimodal and Multistructured Data

Authors :
Wang, Xuwu
Cui, Qiwen
Tao, Yunzhe
Wang, Yiran
Chai, Ziwei
Han, Xiaotian
Liu, Boyi
Yuan, Jianbo
Su, Jing
Wang, Guoyin
Liu, Tingkai
Chen, Liyu
Liu, Tianyi
Sun, Tao
Zhang, Yufeng
Zheng, Sirui
You, Quanzeng
Yang, Yang
Yang, Hongxia
Publication Year :
2024

Abstract

Large language models (LLMs) have become increasingly pivotal across various domains, especially in handling complex data types. This includes structured data processing, as exemplified by ChartQA and ChatGPT-Ada, and multimodal unstructured data processing as seen in Visual Question Answering (VQA). These areas have attracted significant attention from both industry and academia. Despite this, there remains a lack of unified evaluation methodologies for these diverse data handling scenarios. In response, we introduce BabelBench, an innovative benchmark framework that evaluates the proficiency of LLMs in managing multimodal multistructured data with code execution. BabelBench incorporates a dataset comprising 247 meticulously curated problems that challenge the models with tasks in perception, commonsense reasoning, logical reasoning, and so on. Besides the basic capabilities of multimodal understanding, structured data processing as well as code generation, these tasks demand advanced capabilities in exploration, planning, reasoning and debugging. Our experimental findings on BabelBench indicate that even cutting-edge models like ChatGPT 4 exhibit substantial room for improvement. The insights derived from our comprehensive analysis offer valuable guidance for future research within the community. The benchmark data can be found at https://github.com/FFD8FFE/babelbench.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2410.00773
Document Type :
Working Paper