1. Evaluating the performance of large language models in haematopoietic stem cell transplantation decision-making
- Author
-
Civettini, I, Zappaterra, A, Granelli, B, Rindone, G, Aroldi, A, Bonfanti, S, Colombo, F, Fedele, M, Grillo, G, Parma, M, Perfetti, P, Terruzzi, E, Gambacorti-Passerini, C, Ramazzotti, D, Cavalca, F, Civettini, Ivan, Zappaterra, Arianna, Granelli, Bianca Maria, Rindone, Giovanni, Aroldi, Andrea, Bonfanti, Stefano, Colombo, Federica, Fedele, Marilena, Grillo, Giovanni, Parma, Matteo, Perfetti, Paola, Terruzzi, Elisabetta, Gambacorti-Passerini, Carlo, Ramazzotti, Daniele, Cavalca, Fabrizio, Civettini, I, Zappaterra, A, Granelli, B, Rindone, G, Aroldi, A, Bonfanti, S, Colombo, F, Fedele, M, Grillo, G, Parma, M, Perfetti, P, Terruzzi, E, Gambacorti-Passerini, C, Ramazzotti, D, Cavalca, F, Civettini, Ivan, Zappaterra, Arianna, Granelli, Bianca Maria, Rindone, Giovanni, Aroldi, Andrea, Bonfanti, Stefano, Colombo, Federica, Fedele, Marilena, Grillo, Giovanni, Parma, Matteo, Perfetti, Paola, Terruzzi, Elisabetta, Gambacorti-Passerini, Carlo, Ramazzotti, Daniele, and Cavalca, Fabrizio
- Abstract
In a first-of-its-kind study, we assessed the capabilities of large language models (LLMs) in making complex decisions in haematopoietic stem cell transplantation. The evaluation was conducted not only for Generative Pre-trained Transformer 4 (GPT-4) but also conducted on other artificial intelligence models: PaLm 2 and Llama-2. Using detailed haematological histories that include both clinical, molecular and donor data, we conducted a triple-blind survey to compare LLMs to haematology residents. We found that residents significantly outperformed LLMs (p = 0.02), particularly in transplant eligibility assessment (p = 0.01). Our triple-blind methodology aimed to mitigate potential biases in evaluating LLMs and revealed both their promise and limitations in deciphering complex haematological clinical scenarios.
- Published
- 2023