Dong, Zehua, Tao, Xiao, Du, Hongliu, Wang, Junxiao, Huang, Li, He, Chiyi, Zhao, Zhifeng, Mao, Xinli, Ai, Yaowei, Zhang, Beiping, Liu, Mei, Xu, Hong, Jiang, Zhenyu, Sun, Yunwei, Li, Xiuling, Liu, Zhihong, Chen, Jinzhong, Song, Ying, Liu, Guowei, and Luo, Chaijie
Background: Artificial intelligence (AI) performed variously among test sets with different diversity due to sample selection bias, which can be stumbling block for AI applications. We previously tested AI named ENDOANGEL, diagnosing early gastric cancer (EGC) on single-center videos in man–machine competition. We aimed to re-test ENDOANGEL on multi-center videos to explore challenges applying AI in multiple centers, then upgrade ENDOANGEL and explore solutions to the challenge. Methods: ENDOANGEL was re-tested on multi-center videos retrospectively collected from 12 institutions and compared with performance in previously reported single-center videos. We then upgraded ENDOANGEL to ENDOANGEL-2022 with more training samples and novel algorithms and conducted competition between ENDOANGEL-2022 and endoscopists. ENDOANGEL-2022 was then tested on single-center videos and compared with performance in multi-center videos; the two AI systems were also compared with each other and endoscopists. Results: Forty-six EGCs and 54 non-cancers were included in multi-center video cohort. On diagnosing EGCs, compared with single-center videos, ENDOANGEL showed stable sensitivity (97.83% vs. 100.00%) while sharply decreased specificity (61.11% vs. 82.54%); ENDOANGEL-2022 showed similar tendency while achieving significantly higher specificity (79.63%, p < 0.01) making fewer mistakes on typical lesions than ENDOANGEL. On detecting gastric neoplasms, both AI showed stable sensitivity while sharply decreased specificity. Nevertheless, both AI outperformed endoscopists in the two competitions. Conclusions: Great increase of false positives is a prominent challenge for applying EGC diagnostic AI in multiple centers due to high heterogeneity of negative cases. Optimizing AI by adding samples and using novel algorithms is promising to overcome this challenge. [ABSTRACT FROM AUTHOR]