1. How Should We Build A Benchmark? Revisiting 274 Code-Related Benchmarks For LLMs
- Author
-
Cao, Jialun, Chan, Yuk-Kit, Ling, Zixuan, Wang, Wenxuan, Li, Shuqing, Liu, Mingwei, Qiao, Ruixi, Han, Yuting, Wang, Chaozheng, Yu, Boxi, He, Pinjia, Wang, Shuai, Zheng, Zibin, Lyu, Michael R., and Cheung, Shing-Chi
- Subjects
Computer Science - Software Engineering ,Computer Science - Artificial Intelligence ,Computer Science - Computation and Language - Abstract
Various benchmarks have been proposed to assess the performance of large language models (LLMs) in different coding scenarios. We refer to them as code-related benchmarks. However, there are no systematic guidelines by which such a benchmark should be developed to ensure its quality, reliability, and reproducibility. We propose How2Bench, which is comprised of a 55-criteria checklist as a set of guidelines to govern the development of code-related benchmarks comprehensively. Using HOW2BENCH, we profiled 274 benchmarks released within the past decade and found concerning issues. Nearly 70% of the benchmarks did not take measures for data quality assurance; over 10% did not even open source or only partially open source. Many highly cited benchmarks have loopholes, including duplicated samples, incorrect reference codes/tests/prompts, and unremoved sensitive/confidential information. Finally, we conducted a human study involving 49 participants, which revealed significant gaps in awareness of the importance of data quality, reproducibility, and transparency., Comment: 42 pages
- Published
- 2025