Si, Liping, Zhong, Jingyu, Huo, Jiayu, Xuan, Kai, Zhuang, Zixu, Hu, Yangfan, Wang, Qian, Zhang, Huan, and Yao, Weiwu
Purpose: Our purposes were (1) to explore the methodologic quality of the studies on the deep learning in knee imaging with CLAIM criterion and (2) to offer our vision for the development of CLAIM to assure high-quality reports about the application of AI to medical imaging in knee joint. Materials and methods: A Checklist for Artificial Intelligence in Medical Imaging systematic review was conducted from January 1, 2015, to June 1, 2020, using PubMed, EMBASE, and Web of Science databases. A total of 36 articles discussing deep learning applications in knee joint imaging were identified, divided by imaging modality, and characterized by imaging task, data source, algorithm type, and outcome metrics. Results: A total of 36 studies were identified and divided into: X-ray (44.44%) and MRI (55.56%). The mean CLAIM score of the 36 studies was 27.94 (standard deviation, 4.26), which was 66.53% of the ideal score of 42.00. The CLAIM items achieved an average good inter-rater agreement (ICC 0.815, 95% CI 0.660–0.902). In total, 32 studies performed internal cross-validation on the data set, while only 4 studies conducted external validation of the data set. Conclusions: The overall scientific quality of deep learning in knee imaging is insufficient; however, deep learning remains a promising technology for diagnostic or predictive purpose. Improvements in study design, validation, and open science need to be made to demonstrate the generalizability of findings and to achieve clinical applications. Widespread application, pre-trained scoring procedure, and modification of CLAIM in response to clinical needs are necessary in the future. Key Points: • Limited deep learning studies were established in knee imaging with mean score of 27.94, which was 66.53% of the ideal score of 42.00, commonly due to invalidated results, retrospective study design, and absence of a clear definition of the CLAIM items in detail. • A previous trained data extraction instrument allowed reaching moderate inter-rater agreement in the application of the CLAIM, while CLAIM still needs improvement in scoring items and result reporting to become a wide adaptive tool in reviews of deep learning studies. [ABSTRACT FROM AUTHOR]