Back to Search Start Over

SuperGPQA: Scaling LLM Evaluation across 285 Graduate Disciplines

Authors :
Team, M-A-P
Du, Xinrun
Yao, Yifan
Ma, Kaijing
Wang, Bingli
Zheng, Tianyu
Zhu, Kang
Liu, Minghao
Liang, Yiming
Jin, Xiaolong
Wei, Zhenlin
Zheng, Chujie
Deng, Kaixing
Guo, Shuyue
Jia, Shian
Jiang, Sichao
Liao, Yiyan
Li, Rui
Li, Qinrui
Li, Sirun
Li, Yizhi
Li, Yunwen
Ma, Dehua
Ni, Yuansheng
Que, Haoran
Wang, Qiyao
Wen, Zhoufutu
Wu, Siwei
Xing, Tianshun
Xu, Ming
Yang, Zhenzhu
Wang, Zekun Moore
Zhou, Junting
Bai, Yuelin
Bu, Xingyuan
Cai, Chenglin
Chen, Liang
Chen, Yifan
Cheng, Chengtuo
Cheng, Tianhao
Ding, Keyi
Huang, Siming
Huang, Yun
Li, Yaoru
Li, Yizhe
Li, Zhaoqun
Liang, Tianhao
Lin, Chengdong
Lin, Hongquan
Ma, Yinghao
Peng, Zhongyuan
Peng, Zifan
Qi, Qige
Qiu, Shi
Qu, Xingwei
Tan, Yizhou
Wang, Zili
Wang, Chenqing
Wang, Hao
Wang, Yiya
Wang, Yubo
Xu, Jiajun
Yang, Kexin
Yuan, Ruibin
Yue, Yuanhao
Zhan, Tianyang
Zhang, Chun
Zhang, Jingyang
Zhang, Xiyue
Zhang, Xingjian
Zhang, Yue
Zhao, Yongchi
Zheng, Xiangyu
Zhong, Chenghua
Gao, Yang
Li, Zhoujun
Liu, Dayiheng
Liu, Qian
Liu, Tianyu
Ni, Shiwen
Peng, Junran
Qin, Yujia
Su, Wenbo
Wang, Guoyin
Wang, Shi
Yang, Jian
Yang, Min
Cao, Meng
Yue, Xiang
Zhang, Zhaoxiang
Zhou, Wangchunshu
Liu, Jiaheng
Lin, Qunshu
Huang, Wenhao
Zhang, Ge
Publication Year :
2025

Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in mainstream academic disciplines such as mathematics, physics, and computer science. However, human knowledge encompasses over 200 specialized disciplines, far exceeding the scope of existing benchmarks. The capabilities of LLMs in many of these specialized fields-particularly in light industry, agriculture, and service-oriented disciplines-remain inadequately evaluated. To address this gap, we present SuperGPQA, a comprehensive benchmark that evaluates graduate-level knowledge and reasoning capabilities across 285 disciplines. Our benchmark employs a novel Human-LLM collaborative filtering mechanism to eliminate trivial or ambiguous questions through iterative refinement based on both LLM responses and expert feedback. Our experimental results reveal significant room for improvement in the performance of current state-of-the-art LLMs across diverse knowledge domains (e.g., the reasoning-focused model DeepSeek-R1 achieved the highest accuracy of 61.82% on SuperGPQA), highlighting the considerable gap between current model capabilities and artificial general intelligence. Additionally, we present comprehensive insights from our management of a large-scale annotation process, involving over 80 expert annotators and an interactive Human-LLM collaborative system, offering valuable methodological guidance for future research initiatives of comparable scope.

Details

Database :
arXiv
Publication Type :
Report
Accession number :
edsarx.2502.14739
Document Type :
Working Paper