Charles B. Simone, Dewey Odhner, Ontida Apinorasethkul, Dimitris Mihailidis, Jayaram K. Udupa, Joseph Camaratta, David J. McLaughlin, Xingyu Wu, Paul A. James, Gargi Pednekar, Lisa Wojtowicz, Chavanon Apinorasethkul, John N. Lukens, Akhil Tiwari, Geraldine Shammo, Yubing Tong, and Drew A. Torigian
Contouring (segmentation) of Organs at Risk (OARs) in medical images is required for accurate radiation therapy (RT) planning. In current clinical practice, OAR contouring is performed with low levels of automation. Although several approaches have been proposed in the literature for improving automation, it is difficult to gain an understanding of how well these methods would perform in a realistic clinical setting. This is chiefly due to three key factors - small number of patient studies used for evaluation, lack of performance evaluation as a function of input image quality, and lack of precise anatomic definitions of OARs. In this paper, extending our previous body-wide Automatic Anatomy Recognition (AAR) framework to RT planning of OARs in the head and neck (H&N) and thoracic body regions, we present a methodology called AAR-RT to overcome some of these hurdles. AAR-RT follows AAR's 3-stage paradigm of model-building, object-recognition, and object-delineation. Model-building: Three key advances were made over AAR. (i) AAR-RT (like AAR) starts off with a computationally precise definition of the two body regions and all of their OARs. Ground truth delineations of OARs are then generated following these definitions strictly. We retrospectively gathered patient data sets and the associated contour data sets that have been created previously in routine clinical RT planning from our Radiation Oncology department and mended the contours to conform to these definitions. We then derived an Object Quality Score (OQS) for each OAR sample and an Image Quality Score (IQS) for each study, both on a 1-to-10 scale, based on quality grades assigned to each OAR sample following 9 key quality criteria. Only studies with high IQS and high OQS for all of their OARs were selected for model building. IQS and OQS were employed for evaluating AAR-RT's performance as a function of image/object quality. (ii) In place of the previous hand-crafted hierarchy for organizing OARs in AAR, we devised a method to find an optimal hierarchy for each body region. Optimality was based on minimizing object recognition error. (iii) In addition to the parent-to-child relationship encoded in the hierarchy in previous AAR, we developed a directed probability graph technique to further improve recognition accuracy by learning and encoding in the model "steady" relationships that may exist among OAR boundaries in the three orthogonal planes. Object-recognition: The two key improvements over the previous approach are (i) use of the optimal hierarchy for actual recognition of OARs in a given image, and (ii) refined recognition by making use of the trained probability graph. Object-delineation: We use a kNN classifier confined to the fuzzy object mask localized by the recognition step and then fit optimally the fuzzy mask to the kNN-derived voxel cluster to bring back shape constraint on the object. We evaluated AAR-RT on 205 thoracic and 298 H&N (total 503) studies, involving both planning and re-planning scans and a total of 21 organs (9 - thorax, 12 - H&N). The studies were gathered from two patient age groups for each gender - 40-59 years and 60-79 years. The number of 3D OAR samples analyzed from the two body regions was 4301. IQS and OQS tended to cluster at the two ends of the score scale. Accordingly, we considered two quality groups for each gender - good and poor. Good quality data sets typically had OQS ≥ 6 and had distortions, artifacts, pathology etc. in not more than 3 slices through the object. The number of model-worthy data sets used for training were 38 for thorax and 36 for H&N, and the remaining 479 studies were used for testing AAR-RT. Accordingly, we created 4 anatomy models, one each for: Thorax male (20 model-worthy data sets), Thorax female (18 model-worthy data sets), H&N male (20 model-worthy data sets), and H&N female (16 model-worthy data sets). On "good" cases, AAR-RT's recognition accuracy was within 2 voxels and delineation boundary distance was within ∼1 voxel. This was similar to the variability observed between two dosimetrists in manually contouring 5-6 OARs in each of 169 studies. On "poor" cases, AAR-RT's errors hovered around 5 voxels for recognition and 2 voxels for boundary distance. The performance was similar on planning and replanning cases, and there was no gender difference in performance. AAR-RT's recognition operation is much more robust than delineation. Understanding object and image quality and how they influence performance is crucial for devising effective object recognition and delineation algorithms. OQS seems to be more important than IQS in determining accuracy. Streak artifacts arising from dental implants and fillings and beam hardening from bone pose the greatest challenge to auto-contouring methods.