Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2020 May 1.
Published in final edited form as: Med Image Anal. 2019 Jan 29;54:45–62. doi: 10.1016/j.media.2019.01.008

AAR-RT - A system for auto-contouring organs at risk on CT images for radiation therapy planning: Principles, design, and large-scale evaluation on head-and-neck and thoracic cancer cases

Xingyu Wu 1,, Jayaram K Udupa 1,*, Yubing Tong 1,, Dewey Odhner 1, Gargi V Pednekar 2, Charles B Simone II 3, David McLaughlin 2, Chavanon Apinorasethkul 4, Ontida Apinorasethkul 4, John Lukens 4, Dimitris Mihailidis 4, Geraldine Shammo 4, Paul James 4, Akhil Tiwari 4, Lisa Wojtowicz 4, Joseph Camaratta 2, Drew A Torigian 1
PMCID: PMC6499546  NIHMSID: NIHMS1522990  PMID: 30831357

Abstract

Contouring (segmentation) of Organs at Risk (OARs) in medical images is required for accurate radiation therapy (RT) planning. In current clinical practice, OAR contouring is performed with low levels of automation. Although several approaches have been proposed in the literature for improving automation, it is difficult to gain an understanding of how well these methods would perform in a realistic clinical setting. This is chiefly due to three key factors – small number of patient studies used for evaluation, lack of performance evaluation as a function of input image quality, and lack of precise anatomic definitions of OARs. In this paper, extending our previous body-wide Automatic Anatomy Recognition (AAR) framework to RT planning of OARs in the head and neck (H&N) and thoracic body regions, we present a methodology called AAR-RT to overcome some of these hurdles.

AAR-RT follows AAR’s 3-stage paradigm of model-building, object-recognition, and object-delineation. Model-building: Three key advances were made over AAR. (i) AAR-RT (like AAR) starts off with a computationally precise definition of the two body regions and all of their OARs. Ground truth delineations of OARs are then generated following these definitions strictly. We retrospectively gathered patient data sets and the associated contour data sets that have been created previously in routine clinical RT planning from our Radiation Oncology department and mended the contours to conform to these definitions. We then derived an Object Quality Score (OQS) for each OAR sample and an Image Quality Score (IQS) for each study, both on a 1-to-10 scale, based on quality grades assigned to each OAR sample following 9 key quality criteria. Only studies with high IQS and high OQS for all of their OARs were selected for model building. IQS and OQS were employed for evaluating AAR-RT’s performance as a function of image/object quality. (ii) In place of the previous hand-crafted hierarchy for organizing OARs in AAR, we devised a method to find an optimal hierarchy for each body region. Optimality was based on minimizing object recognition error. (iii) In addition to the parent-to-child relationship encoded in the hierarchy in previous AAR, we developed a directed probability graph technique to further improve recognition accuracy by learning and encoding in the model “steady” relationships that may exist among OAR boundaries in the three orthogonal planes. Object-recognition: The two key improvements over the previous approach are (i) use of the optimal hierarchy for actual recognition of OARs in a given image, and (ii) refined recognition by making use of the trained probability graph. Object-delineation: We use a kNN classifier confined to the fuzzy object mask localized by the recognition step and then fit optimally the fuzzy mask to the kNN-derived voxel cluster to bring back shape constraint on the object.

We evaluated AAR-RT on 205 thoracic and 298 H&N (total 503) studies, involving both planning and re-planning scans and a total of 21 organs (9 - thorax, 12 - H&N). The studies were gathered from two patient age groups for each gender - 40–59 years and 60–79 years. The number of 3D OAR samples analyzed from the two body regions was 4301. IQS and OQS tended to cluster at the two ends of the score scale. Accordingly, we considered two quality groups for each gender - good and poor. Good quality data sets typically had OQS ≥ 6 and had distortions, artifacts, pathology etc. in not more than 3 slices through the object. The number of model-worthy data sets used for training were 38 for thorax and 36 for H&N, and the remaining 479 studies were used for testing AAR-RT. Accordingly, we created 4 anatomy models, one each for: Thorax male (20 model-worthy data sets), Thorax female (18 model-worthy data sets), H&N male (20 model-worthy data sets), and H&N female (16 model-worthy data sets). On “good” cases, AAR-RT’s recognition accuracy was within 2 voxels and delineation boundary distance was within ~1 voxel. This was similar to the variability observed between two dosimetrists in manually contouring 5–6 OARs in each of 169 studies. On “poor” cases, AAR-RT’s errors hovered around 5 voxels for recognition and 2 voxels for boundary distance. The performance was similar on planning and replanning cases, and there was no gender difference in performance.

AAR-RT’s recognition operation is much more robust than delineation. Understanding object and image quality and how they influence performance is crucial for devising effective object recognition and delineation algorithms. OQS seems to be more important than IQS in determining accuracy. Streak artifacts arising from dental implants and fillings and beam hardening from bone pose the greatest challenge to auto-contouring methods.

Graphical Abstract

graphic file with name nihms-1522990-f0001.jpg

1. Introduction

1.1. Background and Rationale

Cancer is a major public health problem worldwide and is the 2nd most common cause of death in the US, with ~1.7 million new cancer cases expected to be diagnosed in the US in 2018, and with an estimated 609,640 American deaths to occur in 2018 (Siegal et al., 2018). Among several therapeutic options, nearly two thirds of cancer patients will have treatment that will involve radiation therapy (RT) (ASTRO website, 2018). Contouring of critical organs, called Organs at Risk (OARs), and target tumor in medical images taken for the purpose of RT planning (referred to as planning images) is required for accurate RT planning to ensure that a proper dose of radiation is delivered to the tumor while minimizing the radiation dose to healthy organs. In current clinical practice, OAR contouring is still performed with low levels of automation due to lack of highly automated commercial contouring software. This deteriorates RT planning. There are two major issues with the current clinical practice of OAR contouring: (1) Poor accuracy. (2) Poor efficiency, throughput, and reproducibility.

Poor accuracy, and consequently poor efficiency/ acceptability, of OAR contours produced by existing software platforms on planning images is the main hurdle in auto-contouring for RT planning. The problem is well summarized in (Whitfield et al., 2013): “Rapid and accurate delineation of target volumes and multiple organs at risk, … is now hugely important in radiotherapy, owing to the rapid proliferation of intensity-modulated radiotherapy … Nevertheless, delineation is still clinically performed with little if any machine assistance, even though it is both time consuming and prone to inter-observer variation.” Many commercial auto-contouring systems are currently available (Thomson et al., 2014, Lustberg et al., 2017), but their poor accuracy leads to poor clinical acceptability of the contours and hence poor efficiency. As we demonstrate in Section 5 involving a large realistic study, in the clinical setting, OAR contouring can take anywhere from 40 minutes to 2 hours depending on the number of OARs to be contoured.

The efficiency problem is exacerbated in advanced RT methods such as intensity modulated radiotherapy (IMRT) and proton beam radiation therapy (PBRT) (McGowan et al., 2013). Adaptive RT can allow for modifying the treatment plan to account for anatomic changes occurring during a 5–8-week course of treatment due to weight loss or deformation of tumor and normal tissues. Such changes are particularly common during head and neck (Simone et al., 2011) and thoracic (Veiga et al., 2016) radiation and can significantly affect the total dose delivered to the tumor and normal surrounding organs and are particularly important when treating most thoracic malignancies (Veresezan et al., 2017). PBRT can allow for ultra-precise delivery of treatment due to the physical characteristics of the proton beam, eliminate exit dose, maximize dose delivered to the tumor, and minimize radiation dose to adjacent OARs, reducing toxicity and patient morbidity (Roelofs et al., 2012), and improving clinical outcomes like overall survival (Leeman et al., 2017). Yet, because of the poor accuracy, and hence efficiency of current software products, re-contouring on images taken during treatment (referred to as evaluation or replanning images) is rarely done. While the impact of this issue on patient outcome has sparsely been studied (Dolz et al., 2016), with accurate automated contouring, advanced IMRT and PBRT methods can be employed more extensively and may allow for these advanced radiotherapy modalities to achieve toxicity reductions or outcomes benefits to a large subset of patients.

The current gaps/challenges in auto-contouring for the RT application, which motivated the development of AAR-RT1, may be summarized as follows. (1) Evaluation: Testing on a large number of independent data sets versus on the same data sets in a multifold cross validation manner is vital to get a real understanding of the behavior of the method independent of the data sets. This is currently lacking. Generally, performance evaluation is done only on planning and not evaluation images. In our study cohort, we found the quality of the images to be lower in evaluation scans than in planning scans. (2) Data quality: The quality of the image data sets used, presence and severity of the artifacts/deviations from normality in these data sets, and how they might influence results are not usually discussed in published methods.

No examples of performance on scans with artifacts are given and there is no discussion of how the training and testing data sets are selected with regard to artifacts and other distortions. (3) OAR definition: Although some contouring guidelines are followed by dosimetrists and oncologists (Brouwer et al., 2015a, 2015b, Kong et al., 2011), the flexibility allowed, site-to-site variations, and the looseness of the definitions make the resulting contours unsuitable for building precise computational population object models/schemas.

In an attempt to address some of these challenges, we adopted our previous body-wide Automatic Anatomy Recognition (AAR) framework (Udupa et al., 2014) and refined its three main steps, namely, fuzzy anatomy model building for a body region, object recognition/ localization, and object delineation, with further advances in each step. Key innovations and improvements over the previous AAR framework are as follows. (1) OAR definition: To overcome the non-standardness hurdle, following published guidelines for head and neck (H&N) (Brouwer et al., 2015a, 2015b, Hall et al., 2008) and thoracic (Kong et al., 2018, Kong et al., 2011, Hall et al., 2008) anatomic OAR definitions, we formulated detailed and precise operational definitions and a reference document for specifying and delineating each of the 21 OARs considered in this work on axial CT slices, as explained in Section 2. (2) Optimal hierarchy: The AAR approach arranges OARs in a hierarchy by learning object relationships. Previously, we used an anatomically motivated hierarchy for OARs. In this work, we find an optimal hierarchy that actually minimizes OAR recognition error, as described in Section 3. (3) Image texture: The best OAR-specific image texture property is found and used for both object recognition and delineation, as outlined in Section 3. (4) Recognition refinement using Directed Probability Graph (Section 3): In the previous approach, object localization accuracy was inferior in the z- (cranio-caudal) direction to that in the xy (axial) plane. We train and employ a Directed Probability Graph to improve this accuracy. (5) Delineation via voxel classification and fuzzy model fitting: The previous approach used fuzzy connectedness which had issues with automatically finding seeds required for its delineation engine. We replace that strategy by a fuzzy classification and fuzzy model fitting step to improve accuracy (Section 3). (6) Large-scale evaluation of recognition and delineation: We evaluate both recognition and delineation performance of AAR-RT on clinical CT scans of over 500 cancer patients randomly selected from our hospital database for the two body regions involving both planning and evaluation scans (Sections 2 and 4). (7) Evaluation as a function of image/ object quality: To understand dependence of performance on image/ object quality, we define image/ object quality metrics, build models using highest quality data sets, and evaluate recognition/ delineation accuracy on all data sets as a function of quality (Sections 2, 3, 4).

1.2. Related Work: Approaches to Segmentation of OARs

There is a large body of literature on segmentation of individual objects/OARs on images from different modalities. However, not all of them are applicable to the problem of body-region-wide OAR segmentation. It takes a lot of effort to understand the application-specific issues, solve each of them satisfactorily, and evaluate them in a realistic manner to gain confidence on the behavior of the method on real clinical data sets. We shall therefore review works specifically related to body-region-wide OAR segmentation for the RT application on CT images of cases involving H&N and thoracic malignancies. We will perform a comparative analysis of AAR-RT and key published works from in Section 4.

Atlas-based methods are quite popular in RT application due to their robustness and requirement for a small number of training samples. These methods register the training images to the test image and correspondingly propagate the training OAR contours to the test image. The anatomy information in the training set is described by one or a group of images called atlas. Reported atlas generation methods include a single training image (Han et al., 2008; Voet et al., 2011), averaging multiple images (Sims et al., 2009), and simulated images with standard anatomy (Isambert et al., 2008). More recently, multi-atlas methods have shown better accuracy with a more elaborate training step which groups patients first for atlas generation (Saito et al., 2016; Schreibmann et al., 2014; Teguh et al., 2011), and then selects the most similar group to the test image subsequently for object segmentation. One disadvantage of the atlas-based methods is that they require accurate registration to align the patient and target image, which is hard to make robust to shape variations, anatomy changes, and image quality variations. More importantly, it is hard to handle non-smooth geometric relationships that exist among objects in their geographic layout, size, and pose (Matsumoto et al., 2016) via smooth registration operations, although grouping helps to circumvent this issue to some extent.

Besides atlas-based methods, the approach of using landmarks on each object to handle local variations (Ghesu et al., 2017; Ibragimov et al., 2014; Zheng et al., 2015) received considerable attention in recent years due to the better local adaptability of such approaches. These methods can be categorized as global approaches because they start from the entire patient image rather than a local region of interest (ROI), so a registration step becomes necessary. However, the orientation and position variations between H&N and thoracic regions and curvature variations of the spine often pose extra difficulties for registration (Daisne and Blumhofer, 2013) which are addressed via the use of landmarks. As an alternative, our previous AAR works (Udupa et al., 2014; Phellan et al., 2016) build fuzzy models for each object and encode object relationships pairwise explicitly in a hierarchical arrangement of objects for facilitating recognition, which eliminates the registration step and can also handle non-smooth object relationships.

More recent approaches tend to explore local methods that start from an ROI for each object. The ROI may be determined either manually or by global methods. This kind of global-to-local strategy has lower requirements on the precision of registration and can become more robust under anatomy variations and image quality vagaries. Some studies cascade atlas-based methods for ROI initialization followed by a local boundary extraction approach, such as geodesic active contours (Fritscher et al., 2014), graph-cut (Fortunati et al., 2015), and appearance models (Wang et al., 2018). In recent years, delineation methods using convolutional neural networks (CNNs) (de Vos et al., 2017; Ibragimov and Xing, 2017a) and fully convolutional networks (FCNs) (Çiçek et al., 2016; Dou et al., 2017; Trullo et al., 2017a; Zhou et al., 2017a) have started showing improved results under the prerequisite of correct local ROI selection. Deep learning approaches seem to outperform other methods in learning local anatomy patterns, but challenges still exist in localizing OARs in the whole given image (object recognition problem), especially for sparse and small objects. It is worth investigating, therefore, how to incorporate the anatomy prior information to reduce the amount of total input information to these networks to make them more effective and specific. Recent research shows the benefit of incorporating shape prior as a constraint for neural network strategies (Oktay et al., 2018), but this is only prior information on each individual OAR. The problem of determining the manner in which to utilize global information, especially the relationship among OARs for localization before delineation, is still unsolved in these approaches.

The progress in research over the years in multi-object segmentation suggests a dual paradigm for segmentation: (1) object recognition (or localization), which uses prior information to define the whereabouts of the object, and (2) object delineation, which employs local information to precisely define the object’s spatial extent in the image. This dichotomous strategy for image segmentation was first suggested in the live wire method (Falcao et al., 1998) where recognition is done manually but delineation is automatic and occurs in real time, and the two processes are tightly coupled. Our entire AAR framework operates on this dual recognition-delineation premise and we try to advance recognition and delineation methods separately and synergistically. This is the key idea behind our AAR-RT framework.

A very preliminary report on this investigation appeared in the proceedings of the 2018 SPIE Medical Imaging Conference (Wu et al., 2018). The present paper includes the following significant enhancements over the conference paper: (i) A comprehensive literature review. (ii) Full description of the methods and the underlying algorithms. None of the object recognition and delineation algorithms were described in the conference paper. (iii) Comprehensive evaluation. The conference paper preliminarily tested and presented results for 6 H&N OARs and none from the thorax. This paper analyzes results for recognition and delineation for all 21 OARs from both H&N and thoracic regions and their dependence on image/ object quality. (iv) Evaluation on both planning and evaluation scans. The conference paper considered only a subset of the planning data sets used in this paper and no evaluation scans. (v) A detailed comparison of AAR-RT with key auto-contouring methods from the literature for the two body regions which was not undertaken in the conference paper.

2. Materials

2.1. Image and Contour Data

This retrospective study was conducted following approval from the Institutional Review Board at the Hospital of the University of Pennsylvania along with a Health Insurance Portability and Accountability Act waiver. We collected planning CT image and contour data sets from existing patient databases from the Department of Radiation Oncology, University of Pennsylvania, under four patient groups: 40–59-year-old males and females (denoted GM1 and GF1, respectively), 60–79-year-old males and females (denoted GM2 and GF2, respectively). For thorax and H&N, data sets respectively from 210 and 216 cancer patients (with different types of cancer) were gathered, with at least 50 data sets per group; pixel size: 1–1.6 mm, slice spacing: 1.5–3 mm. Similarly, we gathered replanning (evaluation) scans from 30 patients (for each body region) who underwent PBRT fractionated treatment serially. For each patient, we selected image data at 2 or more, commonly 3, serial time points, accounting for a total of 87 scans for thorax and 82 scans for H&N. The OARs considered for the two body regions (9 for thorax and 12 for H&N for planning cases and 6 for thorax and 5 for H&N for replanning cases), their abbreviations used, and their total number are listed in Table 1. The total number of 3D OAR samples considered in this study from planning and replanning scans was 4,301 (1,691 for thorax and 2,610 for H&N) from a total of 595 patient scans.

Table 1.

Thoracic and H&N OARs included in our study and some study statistics.

Abbr OAR Abbr OAR Abbr OAR Study statistics Thorax H&N
tSB Thoracic skin outer boundary LBP Left brachial plexus LPG Left parotid gland #Planning scans 118 216
Hrt Heart RBP Right brachial plexus RPG Right parotid gland #OARs 9 12
LLg Left lung hSB H&N skin outer boundary LSG Left submandibular gland #OAR samples 1,175 2,199
RLg Right lung SBi hSB inferior part RSG Right submandibular gland #Good Quality samples
#Poor Quality samples
718
457
905
1,294
TB Trachea & proximal bronchi SBs hSB superior part MD Mandible #Model worthy scans 38 36
tSC Thoracic spinal cord cSC Cervical spinal cord OHP Orohypopharynx constrictor muscle #Replanning scans 87 82
tES Thoracic esophagus LX Larynx cES Cervical esophagus #OAR samples 516 411

OAR contours for the planning cases were previously drawn by the dosimetrists (and approved by attending physicians) in the process of routine clinical RT planning of these patients. Note that not all OARs were delineated in each planning scan. The number of OARs for which dosimetrist-drawn contours were available in each scan was 5–9 for thorax and 5–12 for H&N. Since manual object contouring is impractical to perform and hence not done clinically for every replanning scan associated with treatment fractions, we do not have ground truth OAR contours for the corresponding data sets. Therefore, to generate ground truth data for replanning scans and to gain insight into how contouring is done in practice, we recruited four dosimetrists (two for each body region) from the Penn Radiation Oncology department to perform manual contouring on all 169 replanning studies from the two body regions. The following OARs were considered for the replanning studies (see Table 1). Thorax: RLg, LLg, Hrt, tES, tSC, and TB. H&N: hES, hSC, MD, OHP, and LX. The dosimetrists were asked to record the start time and end time for each contouring session for each object. Also, we noted down other preparatory time and time for ancillary efforts during the contouring process.

2.2. Standardizing OAR Definition and Ground Truth Contouring

Although some object contouring guidelines are followed by dosimetrists and oncologists (Brouwer et al., 2015a, 2015b), the flexibility allowed and the looseness of the definitions make ground truth contouring less precise and the resulting contours unsuitable for building precise computational population object models. To overcome this hurdle, following the above guidelines for anatomic object definitions, we formulated detailed and precise operational definitions and a document (Wu et al., 2017a, 2017b) for specifying each object and for delineating its boundaries on axial CT slices. For illustrating the level of detail involved in our specification, we show in Figure 1 the mandible in the H&N region as an example. Two software engineers (co-authors GVP and DM) were thoroughly trained on these definitions who then mended dosimetrist-drawn contours of all 21 OARs on all 426 planning scans by strictly following this document under the supervision of a radiologist with 22 years of experience (co-author DAT). The resulting contours were used as ground truth object delineations for building models and for evaluating AAR-RT. The two dosimetrists followed these documents as well for contouring the 11 OARs on the 169 replanning scans.

Figure 1.

Figure 1.

Specification of Mandible. Top row: Superior boundary is the superior-most aspect of the mandible (typically the apex of the condyle) as shown in axial slice in the middle. The slices on the left and right are immediately inferior and superior to the slice in the middle, respectively. Bottom row: Inferior boundary of the mandible is the inferior-most aspect of the mandible as shown in slice in the middle. The slices on the left and right are immediately inferior and superior to the slice in the middle, respectively. The slices are displayed at bone window.

2.3. Image/ Object Quality Consideration for Model Building and Evaluation

Algorithms for image segmentation are influenced by the quality of appearance of each object in the image and overall image quality. For holistic evaluation, it is important to define object and image quality metrics and perform segmentation evaluation as a function of these quality metrics. No such efforts seem to have been undertaken to date in segmentation challenges and other quantitative medical imaging application efforts. We developed a method (Pednekar et al., 2018) to assign a quality grade to the image appearance of each object (OAR) in each image based on a set of 9 criteria: neck posture deviation, mouth position, other types of body posture deviations, image noise, beam hardening artifacts (streak artifacts), shape distortion, presence of pathology, object intensity deviation, and object contrast. Figure 2 displays patient cases illustrating some of these criteria. We converted these criterion grades into an object quality score (OQS) on a 1 to 10 scale using logical predicates (Pednekar et al., 2018). The OQSs were also used to determine an integrated image quality score (IQS), also on a 1 to 10 scale. OQS and IQS served two purposes: (i) for determining patient scans in our cohort that can be utilized for model building, which we refer to as model-worthy data sets; and (ii) for segmentation evaluation.

Figure 2.

Figure 2.

Examples of factors that can downgrade the image quality of CT scans. (a) Streak artifacts due to dental fillings and implants. (b) Body posture deviation (neck rotation). (c) Pathology (centrally necrotic lesion predominantly in right masticator space). (d) Shape distortion (post-surgical change). (e) Body posture deviation (mouth open).

The number of scans in our cohort that were completely free of deviations on the basis of the above 9 factors was 0 for thorax and 1 for H&N. Generally, younger patients had better quality than older patients. We observed that OQS and IQS mostly clustered at the low and high end of the score scale (see Figure 3). We therefore defined an OAR sample (i.e., an OAR as a 3D object in a given patient image data set) as of good quality if it did not carry deviations in more than 3 slices (this corresponded roughly to OQS>6); otherwise the sample was considered as of poor quality. A scan (image data set) was considered model-worthy if all of its OARs were good-quality samples. Following the basic principle of the AAR framework (Udupa et al., 2014) of using near-normal data sets for building anatomy models of a body region, only model-worthy data sets were used for model building: Thorax: 20 males, 18 females; H&N: 20 males, 16 females. Table 1 (last column) lists statistics related to good and poor OAR samples and model-worthy data sets for the two body regions among our planning/ evaluation scans. Since the number of model-worthy data sets in each of the 4 patient groups was not large enough, we built only 2 models, called fuzzy anatomy models, one for males and one for females for each body region B: FAM(B, GM) by combining groups GM1 and GM2, and FAM(B, GF) by combining groups GF1 and GF2. These model-worthy data sets did not participate in testing recognition and delineation algorithms. We performed evaluation of OAR recognition and delineation separately for the four categories: male-good, male-poor, female-good, and female-poor.

Figure 3.

Figure 3.

OQS distribution in our planning scans for Hrt and RPG. Colors denote different groups: Blue: GM1. Red: GF1. Gray: GM2. Orange: GF2

3. Methods

3.1. Overview

Our previous AAR approach (Udupa et al., 2014) consists of three stages - model building, object recognition, and object delineation. Model building involves creating a Fuzzy Anatomy Model, FAM(B, G) = (H, M, ρ, λ, η), of the body region B of interest for a group G of subjects. In this expression, H denotes a hierarchical arrangement (tree structure) of the objects (OARs); M is a set of fuzzy models with one model for each object; ρ represents the parent-to-child relationship in G in the hierarchy; λ is a set of scale ranges, one for each object; η includes a host of parameters representing object properties such as the range of variation of size, image intensity and texture properties, etc., of each object. FAM(B, G) is built from a set of good quality (model-worthy) CT images of B and the binary images representing a set of OARs in B for each of these images2. After FAM(B, G) is built, it is used to recognize and delineate any OAR in any patient image of B. Recognition and delineation proceed hierarchically in H, starting from the root OAR, then proceeding to the child.

AAR-RT incorporates several advances made in AAR in each of the three stages. Model building: (i) In place of the handcrafted hierarchy H that was employed in the previous approach to build FAM(B, G), we use an algorithm to construct a hierarchy that yields close to the least recognition error among all possible hierarchies. (ii) Previously, the parent-child relationship ρ was expressed by just the vector connecting the geometric centers of the parent and child and its statistics over G. Now, based on experience with the previous approach, this is further refined by including the relationship among inferior-to-superior (z direction), lateral-to lateral (x direction), and anterior-to-posterior (y direction) boundaries of the OARs using a Directed Probability Graph. Object recognition: (i) The order specified by the optimal hierarchy found in the model building stage is followed for localizing OARs in a given patient image using the previous optimal threshold approach. (ii) This recognition result is refined using the Directed Probability Graph constructed in the model building stage. Object delineation: (i) To overcome seed specification issues, in place of the previous fuzzy connectedness engine, a kNN scheme is used. (ii) The final refined fuzzy model resulting at the recognition stage is fitted optimally to the kNN delineation result to produce the final OAR delineation.

The flow diagram of the overall approach underlying AAR-RT is depicted in Figure 4. The three stages are described separately below in detail.

Figure 4.

Figure 4.

Flow diagram illustrating the overall approach underlying AAR-RT.

3.2. Building Fuzzy Anatomy Model

Given a set of images I={I1,,IN} of B for group G and the associated binary images Ib={In,l:1nN&1lL} representing the L OARs O={O1,,OL} in B, building FAM(B, G) = (H, M, ρ, λ, η) involves determining each of the 5 parameters in this quintuple. Hierarchy H and object relationships ρ in H are found as described below. Other parameters are found as described in the original AAR framework (Udupa et al., 2014). Briefly, M = {FM(Ol): 1 ≤ lL} is a set of fuzzy models, one fuzzy model for each OAR. The fuzzy model FM(Ol) of an OAR Ol is created by scaling all binary samples of Ol to a mean size, repositioning all samples to a mean location, and averaging the result (see Udupa et al. 2014, for details). Parameter λ is a set of scale ranges in which each element of the set indicates the size variation of each OAR. This parameter is utilized in confining recognition search in the pose space to previously known ranges in the population G. Parameter η stores population statistics over G pertaining to OARS such as their intensity and texture properties etc., which are used in recognition and delineation.

(i). Finding optimal hierarchy of OARs

There are several reasons for a hierarchical arrangement of objects. Objects have steady geometric relationships (Matsumoto et al., 2016) and they generally do not depend on image/object quality. This implies that if the relationships can be learned, then OAR recognition can be made quite robust with respect to image/object quality. Furthermore, the relationships are non-smooth and non-linear (Matsumoto et al., 2016), implying that some relationships are much less variable than others. Therefore, we contend that for any object O1, there is a best (most optimal) object O2 to be paired as its child. Since our goal is achieving accurate recognition of objects, the optimality criterion here should be the accuracy of recognition of the child given the parent. This naturally leads to the following formulation for optimal hierarchy: Given image sets I and Ib and the set O of OARs for B, find that hierarchy H over which the total recognition error is minimized. To solve this problem, we may form a complete graph G=(O,E), E={(Oi,Oj)Oi,OjO&OiOj}, in which the nodes are the OARs and every pair of OARs is connected by two directed arcs; then determine all possible LL−2 trees that span G, and find among them the tree that yields the least recognition error. Given that each recognition experiment requires about 30 seconds, when L=12 (H&N body region, for example) and assuming the number of images in I to be N=50, finding a globally optimal tree following a brute-force approach would take about 17.5 million days! We take a greedy approach to find optimal H.

We convert the above graph into a weighted graph G=(O,E,ω) where w(Oi, Oj) is the weight assigned to directed arc (Oi, Oj). Our idea is to make w(Oi, Oj) small when a mini hierarchy, where Oj is a child of Oi, yields small error. Subsequently, we can find an optimum spanning tree OST(G,Or) in G that is rooted at Or using a minimum spanning tree algorithm (Cormen et al., 2009). In our approach, we fix Or to be the skin object (tSB for thorax and hSB for H&N). We take a greedy approach that is computationally feasible although it cannot guarantee that OST(G,Or) is a hierarchy that yields globally the best possible recognition results for the objects in O, to yield minimum total error in recognition of all objects over the images in I. To implement the approach, we form all possible mini hierarchies of the form shown in Figure 5, where Or is the root object and Oi, and Oj are other (non-root) objects. Then, for all arcs of the form (Or, Oj), we set w(Oi, Oj) to the mean of the recognition error of Oj over all images in I resulting by using the mini hierarchy of Figure 5(a). For all arcs (Oi, Oj) of the form shown in Figure 5(b), the arc weight w(Oi, Oj) assigned is the mean over all images of I of the recognition errors of Oj resulting by using the mini hierarchy of Figure 5(b). The idea here is that, in this basic hierarchical form, which is different from that in Figure 5(a), the recognition accuracy of both Oi and Oj should influence the cost assigned to Oj being the child of Oi.

Figure 5.

Figure 5.

Mini hierarchies considered in the greedy algorithm for estimating arc weight based on recognition error. In (a), all mini hierarchies that include the root object Or and any other object Oj are considered. In (b), all mini hierarchies that include arcs (Oi, Oj) where Oi and Oj are different from Or are considered.

In the AAR approach, recognition error for an object O is expressed via its scale error SE(O), location error LE(O), and false positive and false negative volumes, all with respect to the known ground truth object. In our implementation, we set w(Oi, Oj) = LE(Oj) in the situations shown in Figures 5(a) and (b). In finding OST(G,Or), we set a limit of 4 for the depth of the tree to generate more balanced trees and to avoid long paths in resulting hierarchies. We developed an algorithm that finds a hierarchy seeking to minimize the sum of the arc weights while keeping the depth limited. This hierarchy has an arc weight cost close to that of the tree found by the minimum spanning tree algorithm, but a smaller recognition error when the whole tree is used for recognition.

(ii). Refining object relationships

Object hierarchical relationships learned in the previous step allow overall placement of object models in a test image. Based on this placement, model boundary extents in the three anatomic planes are refined by exploiting the relationship that may exist among these boundary planes. Learning and refining this relationship are done independently in the three directions (left-to-right or ±x direction, antero-posterior or ±y direction, and cranio-caudal or ±z direction). We will use the notation xl(O) and xh(O) to denote the boundary extent of an object O in the −x and +x directions, respectively. Similarly, yl(O), yh(O), zh(O), and zh(O) are defined. An example to illustrate the idea is shown in Figure 6 involving 3 OARs: MD, LPG, and RPG. Since the parotid glands are situated close to the mandibular condyle laterally (Figure 6(a)), we expect xl(RPG) and xh(RPG) to have a steady relationship with respect to xl(MD) (Figure 6(b)) due to anatomic constraints. A similar remark applies to xl(LPG) and xh(LPG) with respect to xh(MD). This implies that, if we localize (recognize) MD, and if we learn the above relationships in the ±x direction between MD and the parotid glands, we may be able to refine (in the Bayesian sense) the extents of the localized models of RPG and LPG in the ±x direction. In the model building stage, we learn such relationships and incorporate them into FAM(B, G) (in the ρ component), and in the recognition stage, this information is exploited to predict locations xl(O) and xh(O) for RPG and LPG. Only those relationships that are “steady” are utilized for this learning (modeling) and prediction processes. We will describe in this section the modeling part. The prediction process will be explained in Section 3.3 on object recognition. We assume that all locations (x, y, and z) are specified with respect to the scanner coordinate system and after binary object samples are scaled and aligned in the process of creating fuzzy models FM(Ol). Since the three coordinate directions are handled in exactly the same manner, we present the details for the z direction only.

Figure 6.

Figure 6.

Illustration of boundary relationship among Mandible (MD) and Right and Left Parotid Glands (RPG and LPG) in (a) sagittal view, (b) axial view. Boundary plane locations of RPG in the three coordinate directions are shown as xl(RPG), xh(RPG), yl(RPG), yh(RPG), zl(RPG), and zh(RPG). Reproduced with permission from https://zygotebody.com.

We employ the mechanism of a Directed Probability Graph, which is a directed acyclic graph, to model the above location relationships. In our case, the graph is expressed as DGz = (Vz, εz), whose set of nodes is Vz = (zl(Ok), zh(Ok): 1 ≤ k ≤ L} and set εz of directed arcs is a special subset of the set of all possible directed arcs Az = {(vi, vj): vi, vj & Vz & vi ≠ vj & (vi, vj) ≠ (zl(Ok), zh(Ok)) for any object Ok}. The subset εz is chosen from Az as described below. Note that Vz has 2L nodes and each node represents a random variable (z location). Some elements of Vz are special which we refer to as anchor nodes. They represent z-locations (superior and inferior boundaries) of OARs which coincide with the z-location (superior and inferior boundaries) of the body region B. For example, for B = thorax, the superior boundary of B is defined to be 15 mm above the apex of the lungs and the inferior boundary is 5 mm below the base of the lungs (Udupa et al., 2014; Wu et al., 2017a and 2017b), and so, zl(tSC) = zl(tSB) = zl(tES) = zl(TB) = zl(B), and zh(tSC) = zh(tSB) = zh(B). That is, for these 4 OARs, one (in the superior direction for tES and TB) or both (both superiorly and inferiorly for tSC and tSB) of their z-location boundaries coincides with the corresponding boundaries of B. Since these anchor boundary locations are known precisely in the AAR approach due to the definition of B, we can exploit this prior knowledge to refine the automatically-identified boundary locations of all OARs. The directed arcs in εz represent conditional dependencies between nodes. Nodes that are not connected represent variables that are conditionally independent of each other. Each node has a probability function associated with it which takes as input a particular set of values of the node’s parent variables and gives as output the probability of the variable represented by the node.

If (vi, vj) is a directed arc selected from Az to be included in εz, our desire is to assign a conditional probability to (vi, vj) such that we can reliably estimate the contribution from parent vi, to the probability of the random variable associated with vj. Once we specify how the arcs are selected and the conditional probability Pz(vj/vi) associated with these arcs (vi, vj) are determined, the Directed Probability Graph is fully specified.

We determine εz in two stages. In the first stage, we determine a subset Uz of Az of edges that show a “steady” relationship. Consider any edge e = (vi, vj) ∈ Az. Let d(e) be the distance between locations denoted by vi, and vj and σd(e) be the standard deviation of this distance over all samples of Oi, and Oj in our training set Ib. First, we find a subset Uz of Vz by

Uz={e=(vi,vj):σd(e)τ}, (1)

where τ is a fixed threshold. The idea behind Uz is to include only those pairs of nodes which have a “steady” relationship. Note that since this distance is symmetric, if (u, w) is in Uz, so will be (w, u). σd(e) values obtained for all edges in Az are shown as a color matrix in Figure 7(a) for the H&N body region for the training data cohort used for model building. Figure 7(b) shows the result of thresholding the matrix in (a) at τ= 10 mm.

Figure 7.

Figure 7.

(a) Matrix of the standard deviation ct values of the distances for all edges over our training data set. Edges e here denote arcs connecting boundary locations in the z-direction. The color scale is shown on the right. (b) A binary matrix obtained from (a) where cells with values σd(e) ≤ × mm are shown black. These cells suggest that the associated objects have a “steady” relationship between their z-boundary locations. For each object O, Ol and Oh denote, respectively, zl(O) and zh(O).

In the second stage, we decide which directed edges in Uz are to be retained for inclusion in εz Consider an edge e = (vi, vj) ∈ Uz. We include e in εz iff one of the following conditions holds.

  1. vi, is an anchor node but not vj.

  2. Both vi, and vj, are not anchor nodes and object O′ associated with vj, appears after object O associated with vi, in the breadth first order in the optimal hierarchy.

The rationale for condition (i) is obvious - we would like the known location represented by vi, to be utilized to predict location vj, Note that if (u, w) is in Uz and if only u is an anchor node but not w, then although (w, u) is in Uz it will not be included in εz. If (u, w) is in Uz and both u and w are anchor nodes, then both edges will not be included in εz since this will not be useful as both (anchor) nodes are known and there is no need to predict either node. The reason for condition (ii) is that, in the hierarchical order of recognition, object O will already have been recognized before dealing with object O′.

Finally, we assume that the conditional probabilities Pz(vj/vi) associated with arcs e = (vi, vj) follow a Gaussian distribution pz(e) with mean μd(e) (which is the mean of the d(e) values over the training samples) and standard deviation σd(e). This completes the specification of the network DGz. Similarly, networks DGx and DGy are constructed, except that in these cases, there are no anchor nodes since boundaries of body region B are defined only in the z-direction. In our fuzzy anatomy model FAM(B, G), the ρ component is thought of as consisting of two parts, ρ = R, ρDG), where ρR denotes parent-to-child relationship in hierarchy H (as in Udupa et al., 2014), and ρDG represents the triplet of learned Directed Probability Graphs (DGx, DGy, DGz).

3.3. Object Recognition

The recognition process proceeds in two steps. Initially, the optimal hierarchy H found during model building is used for locating all objects in the hierarchical order. Then, after all objects are recognized in this manner, object localization is refined using the previously-built Directed Probability Graph by again going through the hierarchical order in H.

(i). Recognition via optimal hierarchy

The fuzzy anatomy model FAM(B, G) built for B and G is utilized for recognizing objects in an image I of B of any patient belonging to group G. Recall that the purpose of recognition is to determine the whereabouts of the objects in I and not their precisely delineated boundaries. The AAR-RT recognition process takes, as input, image I, FAM(B, G), and the names of OARs that need to be contoured among the OARs in O, and outputs the recognized (localized) fuzzy model FMt(O) of each O that is optimally transformed to image I starting from the version of the fuzzy model FM(O) in FAM(B, G). This process takes place in several steps. AAR-RT first recognizes the skin object (tSB in thorax and hSB in H&N) following the original AAR approach (Udupa et al., 2014). This initializes the hierarchical recognition process. Subsequently, following the optimal hierarchy H, for any object O, since the parent is already recognized and hence parent-to-child relationship ρR stored in the model is known, it first scales and places FM(O) in I based on just ρR. This is called one-shot recognition in the original AAR approach. This placement (pose) is further refined by using the optimal thresholded-search strategy of the previous AAR approach. Briefly, an object-specific optimal threshold, previously learned in the model building stage and stored in the 5th parameter η of FAM(B, G), is applied to I, and the pose parameters of the fuzzy model FM(O) are adjusted for best fit between the thresholded image and the fuzzy model. For our discussion in the next step, we will refer to this resulting pose-adjusted model of O by FMt(O). At the end of this first step, we have FMt(O) for all O ϵ O in I.

(ii). Recognition refinement via Directed Probability Graphs

In this step, we will refine the model FMt(O) obtained in the previous step to its final recognized form FMT(O) in image I by using the previously trained Directed Probability Graphs. In this process, we will refine locations xl(O), xh(O), yl(O), yh(O), zl(O), and zh(O) of object O as represented in FM (O) by using the information stored in the ρDG component of p. Obviously, these refinements are made for all objects except the root object. As in Section 3.2(ii), we will take the z-direction to describe the refinement process. The x- and y- directions follow the same procedure.

For any node vi, of Vz, let its parent nodes in DGz = (Vz, εz) be denoted by {vi1,,vik}. 3 Examine the situation shown in Figure 8. By the Markov property of DGz, we can write

Pz(vi|v1,,vi1,vi+1,,v2L)=Pz(vi|vi1,,vik). (2)
Figure 8.

Figure 8.

A portion of DGz is shown to illustrate the estimation of a refined z-location vi, of the fuzzy model of object O.

Assuming that the parents are conditionally independent, we can write

Pz(vi)=Pz(vi|vi1)Pz(vi1)++Pz(vi|vik)Pz(vik). (3)

Since our recognition process proceeds hierarchically, when we are dealing with object O whose z-location node v, is being refined, the z-locations of its parents {vi1,,vik} have all been refined already and hence known. Let these refined actual locations for O in I be {ui1,,uik}. Based on these known parent locations and the priors pz(e) (Gaussian distributions associated with each edge e=(vij,vi) with parameters μd(e) and σd(e) as described previously in Section 3.2(ii)), we approximate the probability in Equation (3) of node (location) vi as follows.

Pz(vi)max[gij(vi|vi1,,vik)], where gij(vi|vi1,,vik)=j=1K[uij+pz(e)]. (4)

That is, the individual priors associated with each edge are shifted by the known location of the parent and then added. The result is a mixture of Gaussians whose maximum is taken to be the predicted probability of vi. We denote the predicted location where this maximum occurs by wip.

From the known location of model FMt(O) before performing refinement, we know a location for node vi, coming from the recognition process. Let this location be denoted by wir. We will make use of both these locations wir and wip and their associated probabilities Pz(wir) and Pz(wip) to fuse them to estimate the final refined location wi.

wi=wip+(wirwip)Pz(wir)Pz(wir)+Pz(wip). (5)

The refinement process proceeds in this manner in the hierarchical order. The fuzzy model FMt(O) found before refinement is finally rescaled to fit the refined boundary locations, which yields the refined model FMT(O) for all OO in I.

3.4. Object Delineation

The delineation process proceeds in two steps. First, the localized fuzzy model FM T(O) output in the recognition step for each OO in I is utilized to identify all voxels that are within the object region in I via a kNN classifier. In the second step, to this cluster of identified voxels the fuzzy model FM T(O) is optimally fit to produce the final delineation.

(i). Delineation via fuzzy connectedness/ kNN voxel classification

For skin objects, we use Iterative Relative Fuzzy Connectedness (IRFC) algorithm (Ciesielski et al., 2007) as elaborated in (Udupa et al., 2014). IRFC is an image-based delineation engine that requires the specification of a set of seed voxels and a local affinity function. It is suitable for large objects with sufficient intensity contrast wherein automatic seed selection and object-specific affinity specification work well. For small and sparse objects, automatically selecting seeds often fails. Therefore, for all objects other than skin objects, we employ a trained k-nearest-neighbor (kNN) voxel-wise classifier to find the object voxels within the fuzzy mask specified by FM T(O). For this purpose, we use a 3-dimensional feature vector [fFM, fI, fT]t associated with each voxel, where fFM denotes the fuzzy membership value of O as expressed in FM T(O), fi denotes voxel intensity in I, and fT represents a texture property value assessed at the voxel. All texture properties are from among those derived from gray-level co-occurrence matrix (Sonka et al., 2007). The texture property that is optimal for each OAR is found at the model building stage. For example, the glands (LSG, RSG, LPG, and RPG) have a similar textural characteristic among themselves but different from other objects. kNN training and estimation of all required parameters including the determination of optimal texture properties are performed automatically from the training data sets at the model building stage.

(ii). Optimal fuzzy model fitting

The result of kNN classification for object O is a cluster of voxels C(O) in I which is typically a scatter of voxels without proper boundaries and potentially with holes (false negatives) and extraneous voxels (false positives) although all within the generous fuzzy mask defined by FM T(O). To minimize these issues, we transform the fuzzy model FM T(O) optimally to C(O) by minimizing the sum of squared difference between C(O) as a binary mask and FMT(O) as a fuzzy mask. The transformation involves x, y, z translations, uniform scaling, and a threshold. We will denote these 5 parameters by a vector p, the binary mask resulting from FMT(O) after transforming by a given p by BM(O, p), and the sum of squared difference between C(O) and BM(O, p) by ||C(O)- BM(O, p)||. The final delineation of object O is found as BM(O, p*), where p* is the optimal transformation parameter

p*=argminp{C(O)BM(O,p)}. (6)

4. Experiments, Results, Discussion

4.1. Data-Related

As mentioned previously, we created 4 anatomy models, one each for: Thorax male (20 model-worthy data sets), Thorax female (18 model-worthy data sets), H&N male (20 model-worthy data sets), and H&N female (16 model-worthy data sets). These models were used in a gender-specific manner to test recognition and delineation performance on all test data sets. The model-worthy data sets (Table 1) did not participate in any experiments involving the testing of recognition and delineation algorithms. We performed evaluation of OAR recognition and delineation separately under four categories based on object quality: male-good, male-poor, female-good, and female-poor. We conducted two groups of experiments - the first on planning data sets and the second on replanning scans.

Table 2 lists statistics related to the age distribution of our study patients. For both body regions and for planning data sets, there is no statistically significant difference (P > 0.05) in age distribution between the male and female cohorts for the younger age group (GM1 and GF1), although the difference is statistically significant (P < 0.05) for the older age group (GM2 and GF2), the female group being older than the male group. The replanning CT data sets were selected randomly and not by age group. There is no statistically significant difference (P > 0.05) in age distribution between the male and female groups for these data sets for both body regions.

Table 2.

Age statistics of patients in planning and replanning studies.

Group Thorax H&N
# Patients Mean SD # Patients Mean SD
Planning data
Male 40–59 50 50 5 54 52.7 5.1
Female 40–59 52 51 5 54 52.4 5.0
Male 60–79 54 72 3 54 64.6 2.6
Female 60–79 54 71 3 54 67.6 4.2
Replanning data
Male 18 66 9 22 56.9 19.2
Female 12 70 5 8 57.9 16.8

4.2. Models

Figure 9 displays each OAR (except skin objects) selected from several model-worthy studies for male and female subjects for H&N and thorax body regions as well as the models generated for the two body regions. The optimal hierarchies found from model-worthy male data sets for the OARs in the two body regions are also included in the figure. Note that here the skin boundary was specified explicitly as the root object since it is easy to locate and delineate in CT images compared to other objects. This hierarchy was used for building all models.

Figure 9.

Figure 9.

Object samples from model-worthy data sets shown as surface renditions, models built from model-worthy data sets shown as volume renditions, and optimal hierarchies for H&N and thorax. (a) Male H&N, 9 OARs each from 5 studies. (b) Male thorax, 8 OARs each from 5 studies. (c) Female H&N, 9 OARs each from 5 studies. (d) Female thorax, 8 OARs each from 3 studies. Object samples represent binary objects after they are aligned in the scanner coordinate system during the model building process. Models for (e) thorax (right anterior oblique view), and (f) H&N (left posterior oblique view). Optimal hierarchies found for (g) H&N, and (h) thorax. See Table 1 for object names.

All parameters involved in AAR-RT are estimated automatically from model-worthy data sets during the model building stage. There are only two additional parameters: τ (Equation 1, Figure 7), k in the kNN method. The values of these parameters are experimentally determined and fixed once for all at τ= 10 mm and k = 50 for both thorax and H&N.

4.3. Object Recognition and Delineation in Planning Scans

We will present accuracy results from the following four experiments.

  1. E1: Auto-contouring on high-OQS objects from the male group GM. The objects involved in this evaluation generally have streak artifacts and pathologies in not more than 3 slices and may have come from any data sets in GM with any IQS value. Although the objects in this group had minimal artifacts, they may be still affected by pathology. OQS for these objects were in the upper end of the score scale.

  2. E2: Similar to E1 but on the female group GF.

  3. E3: Auto-contouring on low-OQS objects from the male group GM. The data sets involved in this experiment were the complement of the subset of GM used in E1.

  4. E4: Similar to E3 but on the female group GF.

We express recognition accuracy in terms of location error and scale error. Location error (LE) is the distance (in millimeter) of the geometric center of the object model at final recognition to the known true geometric center of the object, ideally 0 mm. Scale error (SE) is the ratio of the estimated object size to its true size, with the ideal value of 1. We describe delineation accuracy/error via Dice Coefficient (DC) and Hausdorff boundary distance (HD). The ideal values for these parameters are 1 and 0 mm, respectively. Figure 10 displays recognition and delineation accuracies graphically for experiments E1 through E4 on different OARs in the two body regions. Sample recognition and delineation results for the two body regions (for both good- and poor-quality cases) are displayed in Figures 11 and 12, respectively, with a slice of the recognized model and the delineated contour overlaid on the original slice.

Figure 10.

Figure 10.

Recognition errors LE and SE (Rows 1, 2) and delineation accuracies/errors DC and HD (Rows 3, 4) for the OARs in the two body regions in experiments E1-E4. See Table 1 for OAR abbreviations. The bars represent mean value over the tested object samples and the whiskers denote standard deviation.

Figure 11.

Figure 11.

Sample results for different OARs on representative H&N CT images. Rows 1 & 2: Recognition and delineation from good-quality data sets. Rows 3 & 4: Recognition and delineation from poor-quality data sets.

Figure 12.

Figure 12.

Sample results for different OARs on representative thoracic CT images. Rows 1 & 2: Recognition and delineation from good-quality data sets. Rows 3 & 4: Recognition and delineation from poor-quality data sets.

We make the following observations from the quantitative results depicted in Figure 10:

  1. E1 and E2 (good object quality): There were 461 object samples (360 H&N, 101 thorax) involved in E1 described above in group GM out of a total of 1479 samples (32%). The corresponding numbers for group GF in E2 were 658 object samples (545 H&N, 113 thorax) from a total of 1391 (57%). The voxel size in our data sets varied from 0.93×0.93×1.5 mm3 to 1.6×1.6×3 mm3, most with a slice spacing of 2 mm or more. Overall, the location error of recognition (object localization) in these experiments, as seen from the last bar in Figure 10 labeled “All”, is 428 mm for the male group and 4.01 mm for the female group, which is about 2 voxels, and the overall accuracy in delineation is close to 0.7 for DC and within about a voxel from true boundary for HD. The difference in recognition and delineation accuracy between the male and female groups is not statistically significant (P > 0.05). Note a similar trend in accuracy for the different objects between the male and female groups. Some OARs like OHP, cES, tES, LBP, and RBP are more challenging than others, but given images with minimal artifacts and generally nominal pathology, they can all be located and delineated quite accurately via AAR-RT.

    In understanding these results, two points should be noted: (1) DC is known to be very sensitive to errors in small and sparse objects where small errors of the order of a voxel at different parts of the boundary can deteriorate DC drastically. In this sense, HD may be a more robust measure. (2) There is considerable variation in the ground truth delineations themselves. As will be shown under results from the second experiment (Section 4.4), DC depicting the variability between two dosimetrists is significant. Considering these two points, our results from E1 and E2 are excellent: HD is about 1 voxel and DC is comparable to DC between dosimetrists. If we exclude the five most challenging objects tES, cES, OHP, LBP and RBP4, then the overall DC for our results for the remaining 16 OARs over the test object samples becomes 0.80 compared to DC of 0.81 (see below) between the dosimetrists.

  2. E3 and E4 (poor quality): When the objects have significant artifacts, the results are much worse, hovering around 5 voxels for recognition and 5 mm for HD. The streak artifacts pose serious challenges to object recognition and delineation, especially in the H&N images due to dental implants and tooth fillings. Among our planning data in H&N, only 2 out of 216 scans were completely free of any streak artifacts (although they contained other deviations), 188 (87%) had streak artifacts arising from beam hardening from metal (Figure 2), and 26 (12%) had beam hardening effects from bone. The other factors encoded in OQS, such as existence of pathology and body posture deviation also lead to much worse results compared with E1 and E2. Again, there is no statistically significant difference between results for the male and female subjects in these experiments.

The previous AAR method (Udupa et al., 2014) was tested on near-normal studies. Since all scans considered in the present work contained deviations from normalcy, in almost all cases, AAR-RT yielded better results than the previous approach. To illustrate the improvements brought about by the innovations incorporated in AAR-RT, we compare in Figure 13 final delineation accuracies achieved for some sample OARs before and after the proposed improvements.

Figure 13.

Figure 13.

Delineation accuracies/errors DC and HD for some OARs in the two body regions in experiments E1 and E2 for the previous AAR approach (Udupa et al., 2014) and AAR-RT. See Table 1 for OAR abbreviations. The bars represent mean values over the tested object samples and the whiskers denote standard deviations.

4.4. Object Recognition and Delineation in Replanning Scans

We have gathered retrospectively image data from 60 patients who underwent PBRT fractionated treatment serially (see Table 1). For each patient, we selected image data at 2 to 3 serial time points, accounting for a total of 82 studies in H&N and 87 in thorax. Like Figure 10, we show in Figure 14 recognition and delineation accuracy for replanning data sets for good- and poor-quality object cases. Generally, replanning data sets had a much larger percentage of cases with low scores (poor quality) for both OQS and IQS.

Figure 14.

Figure 14.

Delineation accuracies/errors DC and HD for the OARs in the two body regions in experiments on replanning data sets. See Table 1 for OAR abbreviations. The hatched bar denotes metric values between the dosimetrists. The bars represent mean value over the tested object samples and the whiskers denote standard deviation.

Unlike testing on planning scans, we used two methods for testing replanning studies. Method 1: It uses the contours drawn by the dosimetrists for a patient case at an earlier time point t0 in the serial study as a patient-specific model to recognize and delineate object contours at later time points for the same patient. Method 2: At each time point, we perform AAR-RT recognition and delineation afresh by selecting the gender-specific model for the test scan. Note that no registration step is required in either method. We also assess inter-dosimetrist variation for each of the 11 tested OARs (5 in H&N and 6 in thorax, see Section 2.1) based on the contours drawn by four dosimetrists (two for each body region, drawing on the same data sets) on these 82 H&N studies and 87 thoracic studies.

From the results in Figure 14, we make the following observations.

  1. Method 1 achieves better delineation accuracy (statistically significant for All, P<0.001) overall than Method 2 on both high OQS and low OQS groups. This is because using manual contour at t0 as the model for t1 and t2 preserves better patient-specific object information than the fuzzy model over population as in Method 2. Meanwhile, on most OARs, results from Method 1 are better than results in Figure 10 for planning scans. This shows that when introducing manual assistance in auto-contouring, the performance could be further improved over fully automatic methods.

  2. Analogous to Figure 10, when objects have high OQS (artifacts and pathologies on less than 4 slices), overall DC is 0.77 and HD is 1 voxel or less for Method 1. The corresponding values for Method 2 are 0.71 and around 1 voxel. For the low OQS data, overall DC was ~0.6 and HD was 2–3 voxels for both methods. We observe that the H&N OARs are more influenced by image quality than thoracic OARs; this is mainly due to strong streak artifacts in the H&N region on all replanning data sets.

  3. The overall inter-observer variability expressed in DC and HD between the two dosimetrists (per body region) is slightly above the DC and HD obtained via AAR-RT on high OQS data sets as shown by the hatched bar in Figure 14. We noticed a dichotomy between DC and HD results as compared to dosimetrists’ variation. Considering DC and HD, AAR-RT performance on high-OQS large/ non-sparse objects such as MD, LLg, RLg, and Hrt is comparable to dosimetrists’ variation, which implies that major editing effort can be saved when image quality is sufficient on such objects. On smaller/sparse objects, the behavior is different. Some sparse objects like cSC, cES, tSC, and TB were comparable in HD between AAR-RT and dosimetrists although their DC values for dosimetrists seem better than those from AAR-RT. This suggests a non-uniformity in the meaning of these metrics between large globular objects versus sparse objects. This is a drawback of these metrics, especially DC. Even small deviations from reference segmentations can cause drastic lowering of DC for sparse objects. Notably, for high-OQS samples, there seems to be a significant difference in AAR-RT performance between the two genders (for example, tSC, LX, cES, and tES).

  4. We are not aware of any studies in the literature that evaluated performance of algorithms on replanning data sets directly. However, earlier methods have been reported (La Macchia et al., 2012; Tsuji et al., 2010) to propagate segmentations from planning CT to replanning studies by applying image registration techniques. These techniques assume, like our Method 1, that contours are already available on planning images, mostly by manual drawing.

4.5. Computational Considerations

All experiments were conducted on a PC with an Intel i7–6670 processor and 16 GB RAM. In our current implementation of AAR-RT, once the anatomy model is built, auto-contouring of all OARs for each patient study in each body region can be completed in 5–6 minutes, which translates roughly to 30 sec/OAR. Model building itself, however, takes about 6 hours for each body region, out of which the most time-consuming step is finding the optimal hierarchy which consumes about 5 hours. In a clinical RT set up, this does not matter since there is no need to repeat this step very frequently.

4.6. Comparison with Results from Literature

We summarize some key studies from the recent literature that are related to our work in H&N (Table 3) and thorax (Table 4). Our work differs from these studies in several import ways:

  1. Size: Our study is much larger than any from the literature and deals with data sets that constitute the real heterogeneity that exists in clinical cases. The largest previous study from the literature considered 40 cases in H&N (Ibragimov and Xing, 2017b) and 240 cases in thorax (Zhou et al., 2017b) where most studies are used for training and only 12 cases used for testing. Including both planning and replanning scans our evaluation involved: 503 studies with 74 used for training and 429 for testing; a total of 4301 object samples where 774 were involved in training and 3527 in testing. The largest number of object samples tested in prior works is 112 for H&N (Ibragimov and Xing, 2017b) and 413 (Zhou et al., 2017b) for thorax. Testing on a large number of independent data sets (as opposed to on the same data sets in a multifold cross validation manner) is vital to get a real understanding and develop confidence for the behavior of the method in the long run independent of the data sets on which the method is tested. Testing on a large number of object samples from different body regions separately is important for similar reasons since the performance behavior of methods can be different on different objects.

  2. Scope: We investigated both planning and replanning studies by using the same approach. We did not come across any work in the literature that performed such an analysis; all reported studies tested the planning and replanning cases separately - the former by using an auto-contouring method, the latter by propagating expert-drawn contours from planning studies to replanning studies via deformable image registration. In our study cohort, as described previously, we found the quality of the images to be generally slightly lower in replanning studies than in planning cases although the performance was similar. Therefore, testing auto-contouring methods should be done separately on these data sets.

  3. Data quality: None of the studies from the literature discussed the quality of the data sets used, presence and severity of the artifacts in their data, and how they might influence their results. No examples of performance on cases with artifacts are given and there is no discussion of how the training and testing data sets are selected with regard to artifacts and other deviations, particularly streak artifacts.

  4. Gender: We analyzed gender and age dependence of image and object quality and their influence on results. Such information may be useful in the future for developing effective model building/ training strategies and creating standardized and generalizable databases for evaluation.

Table 3.

Comparison with methods from the literature for H&N OARs. Results are illustrated by either mean value or range. Unclear content from the literature is indicated as n/a. AAR-RT results shown are for the good-quality cases including both genders.

Approach Number of cases Train/ Test Number of test objects Image/ object quality DC (mean or range)
(Chen and Dawant, 2015) 25/10 60 Not mentioned MD: 0.86–0.94 LPG, RPG: 0.74–0.87 LSG, RSG: 0.55–0.8
(Jung et al., 2015) 25/10 n/a Not mentioned MD: 0.77–0.86 LPG, RPG: 0.56–0.79 LSG, RSG: 0.29–0.59
(Albrecht, 2015) 25/10 n/a Not mentioned MD: 0.75–0.93 LPG, RPG: 0.73–0.88 LSG, RSG: 0.56–0.81
(Mannion-Haworth, 2015) 25/10 n/a Not mentioned MD: 0.92–0.94 LPG, RPG: 0.74–0.89 LSG, RSG: 0.65–0.87
(Orbes Arteaga M. et al., 2015) 25/10 n/a Not mentioned MD: 0.9–0.96 LPG, RPG: 0.68–0.85
(Ibragimov and Xing, 2017b) 40/10 112 Mentioned cases with streak artifacts MD: 0.89 LPG: 0.77 RPG: 0.77 LSG: 0.7 RSG: 0.73
LX:0.86 cSC: 0.87
(Thomson et al., 2014) n/a / 10 70 Cases not distorted by tumor or artifacts are selected LPG, RPG: 0.74–0.83 LSG, RSG: 0.7–0.85 LX: 0.5–0.62
OHP: 0.4–0.6
(Tao et al., 2015) n/a /16 16 Not mentioned LX: 0.73, OHP: 0.64
(Duc et al., 2015) 100/100 600 Not mentioned LPG: 0.65, RPG: 0.65, cSC 0.75
AAR-RT 36/262 2200 Quality as encountered in clinical practice MD: 0.89, LPG: 0.74, RPG: 0.75, LSG: 0.73 RSG: 0.73, LX: 0.74, OHP: 0.58, cES: 0.62, cSC: 0.75

Table 4.

Comparison with methods from the literature for thoracic OARs. Results are illustrated by either mean value or range. Unclear content from the literature is indicated as n/a. AAR-RT results shown are for the good-quality cases including both genders.

Approach Number of cases Train/ Test Number of test objects Image/ object quality/ artifacts DC (mean or range)
(Zhu et al., 2013) n/a /40 160 Not mentioned LLg 0.95; RLg 0.95; Hrt 0.90; tSC 0.52
(Velker et al., 2013) n/a /50 150 Not mentioned LLg 0.95–0.98; RLg 0.95–0.98; Hrt 0.81–0.95
(Lustberg et al., 2017) 20/20 n/a Not mentioned LLg 0.96–0.98; RLg 0.97–0.98; tES 0.35–0.57; TSC 0.83–0.87; Hrt 0.87–0.93
(Lustberg et al., 2017) 450/20 n/a Not mentioned LLg 0.97–0.98; RLg 0.97–0.98; tES 0.65–0.76; tSC 0.80–0.88; Hrt 0.83–0.93
(Schreibmann et al., 2014) n/a /46 70 Not mentioned LLg 0.92–0.98; RLg 0.88–0.98; tES 0.01–0.54; tSC 0.52–0.87; Hrt 0.83–0.93; TB 0.81–0.95
(Trullo et al., 2017b) 30/30 (6-fold) 120 Not mentioned tES 0.67; Hrt 0.90; TB 0.82
AAR-RT 38/167 1187 Quality as encountered in clinical practice LLg 0.95; RLg 0.96; tES 0.68; tSC 0.68; Hrt 0.86; TB 0.81; LBP 0.38; RBP 0.40

The results listed in Tables 3 and 4 are influenced by multiple factors, such as patient gender and group, image/object quality, image resolution, definitions used (if any) for body region and OARs, manual ground-truth quality, etc. In view of the reasons listed above, a fair comparison with AAR-RT is hard to garner from these tables. Note that although the results listed for AAR-RT are for the good-quality cases, by definition, these cases include objects with artifacts and other deviations in multiple but not exceeding 3 slices. Keeping these factors in mind, our fully-automated method not only covers the largest number of OARs but also achieves very competitive performance. There is no mentioned result for cES, LBP, and RBP for comparison in the literature. One advantage of our method is its steadiness. Indeed, on every object the performance is above average standard from references, which implies the robustness of the proposed method to OAR variations in size, shape, and appearance. This is mainly because of the recognition step which is capable of overlaying the fuzzy model on the image within a location error of ~1.5 voxels (~3 mm) with respect to the actual location of the object. To illustrate the robustness of AAR-RT recognition process, we show in Figure 15 several examples of severe artifacts and absence of adequate image information for locating objects even by experts, where AAR-RT successfully places the object model close to the true location on the image because of the rich prior information encoded in its anatomy model FAM(B, G).

Figure 15.

Figure 15.

Sample recognition results to illustrate the robustness of the AAR-RT recognition process. Note particularly how the model is positioned correctly/closely for (a) MD, RPG, and LPG in spite of severe streak artifacts, for (b) LSG and RSG in spite of distortions due to pathology, for (c) OHP and (d) LBP and RBP in spite of absence of sufficient appearance information, and for (e) LLg in spite of presence of large pathology.

5. Concluding Remarks

In this paper, we significantly extended our previous body-wide AAR framework through several innovations and evaluated its performance comprehensively from the perspective of the RT application. Some key and unique elements of the new AAR-RT framework are as follows. (i) It uses computationally directed precise definitions of the body regions and the OARs. This becomes essential for encoding prior information consistently and faithfully and for bringing about maximum impact from prior information on object recognition. (ii) It employs a strategy to find a hierarchy for arranging OARs in each body region that seeks to minimize the error in recognition in place of a hand-crafted hierarchy in the previous AAR approach. (iii) It uses Directed Probability Graphs to encode OAR boundary relationships and to predict them at the recognition stage. (iv) Its recognition process follows the found optimal hierarchy and the trained probability graph to localize objects in a robust manner even in the presence of significant image artifacts and deviations. (v) Its delineation process uses the localized fuzzy model of the object and object-specific intensity and texture properties to identify voxels indicating strong membership within the object and to fit the model optimally to the identified voxels. (vi) It uses an image/object quality-based evaluation of both recognition and delineation processes utilizing over 500 CT scans of cancer patients undergoing RT and over 4000 object samples in these scans involving both planning and replanning studies. Our conclusions and remarks based on this study are as follows.

  1. On data sets with artifacts and deviations in not more than 3 slices, AAR-RT yields recognition accuracy within 2 voxels and delineation HD within about 1 voxel. This is close to the variability observed among dosimetrists in manual contouring. When artifacts and deviations are more severe, the results are much worse, hovering around 5 voxels for recognition and 5 mm for HD. AAR-RT’s performance is similar on planning and replanning cases (when using Method 2) although we observed a slightly lower object and image quality for the latter.

  2. Understanding object and image quality and how they influence performance is crucial for devising effective object recognition and delineation algorithms. At present, it is very difficult to gain an understanding of the behavior of segmentation methods as a function of image/object quality in spite of the availability of large databases and many segmentation challenges. Streak artifacts arising from dental implants and fillings and beam hardening from bone pose the greatest challenge to auto-contouring methods. They cast streaks that are much brighter or darker than the actual tissue intensity and affect almost all H&N structures in almost all studies.

  3. AAR’s dichotomous treatment of the segmentation methodology as dual recognition and delineation processes is helpful in understanding and addressing challenges due to image artifacts and deviations. AAR’s recognition operation is much more robust than delineation. We observed that often even when the models were placed very close (within 2 voxels) to the actual object with strong streak artifacts and/or deviations, delineation failed to retain that accuracy since the object intensity patterns were greatly distorted. We are studying ways to combine AAR-RT with deep learning methods to improve delineation robustness. A price to be paid for recognition robustness is the expensive computational time at the model building stage of finding optimal hierarchies, although this step needs to be executed very infrequently.

  4. Individual object quality expressed by OQS seems to be much more important than the overall image quality expressed by IQS in determining accuracy. There is an interesting phenomenon underlying OQS, object hierarchy, and accuracy of recognition. Not all ancestors influence accuracy in the same manner. A study of the relationship among OQS, IQS, recognition accuracy, delineation accuracy, and object hierarchy may help to improve robustness of recognition and delineation strategies.

Highlights.

  • A practical system built for auto-contouring organs at risk (OARs) in radiation therapy (RT) planning around the previous Automatic Anatomy Recognition (AAR) framework by significantly improving all three stages - model building, object recognition, and object delineation.

  • Large-scale evaluation on 503 CT scans (of patients undergoing radiation therapy) and involving 4,301 3D object samples from two body regions.

  • Evaluation as a function of object and image quality, gender, and age group. Recognition and delineation accuracy consistently within 2 voxels and 1 voxel, respectively for good quality images (with image artifacts, pathology and other deviations in less than 4 slices of the object) and 5 and 2 voxels for poor quality cases.

  • Object recognition found to be much more robust than delineation.

Acknowledgement

This work was supported by grants from the National Science Foundation [IIP1549509] and National Cancer Institute [R41CA199735-01A1]. The auto-contouring problem was suggested to Udupa by Dr. Peter Bloch, Emeritus Professor, Department of Radiation Oncology, University of Pennsylvania, during an MIPG seminar presented by Udupa on the AAR framework in 2012.

Footnotes

Conflict of interest

There is no any conflict of interest and this is the solo submission to Medical Image Analysis.

1

AAR: Automatic Anatomy Recognition. RT: Radiation Therapy.

2

Binary images of all objects considered in B are expected to be available for each image that is selected for building the model. Only these objects can then be recognized and delineated in any given patient image. In other words, the set of OARs to be segmented in a given patient image should always be a subset of the set of OARs considered for building FAM(B, G).

3

Note that although each node has exactly one parent in H, a node may have several parents in DGz (and DGx and DGy).

4

These objects are rarely considered in most published papers but need to be contoured frequently in RT planning. LBP and RBP are hard to even visually locate on slices and so also esophagus on some slices in the H&N and thoracic regions.

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

4. References

  1. ASTRO Website, https://www.astro.org/News-and-Publications/News-and-Media-Center/Media-Resources/Freguentlv-Asked-Questions/. Accessed June 2018.
  2. Brouwer CL Steenbakkers RJ Bourhis J, Budach W, Grau C, Grégoire V van Herk M, Lee A Maingon P, Nutting C, 2015a. CT-based delineation of organs at risk in the head and neck region: DAHANCA, EORTC, GORTEC, HKNPCSG, NCIC CTG, NCRI, NRG Oncology and TROG consensus guidelines. Radiotherapy and Oncology 117 83–90. [DOI] [PubMed] [Google Scholar]
  3. Brouwer CL, Steenbakkers RJ, Bourhis J, Budach W, Grau C, Grégoire V van Herk M, Lee A Maingon P, Nutting C, 2015b. CT-based delineation of organs at risk in the head and neck region: DAHANCA. EORTC, GORTEC. HKNPCSG. NCIC CTG. NCRI. NRG Oncology and TROG consensus guidelines. Radiotherapy and Oncology 117, 83–90. Supplement Material. [DOI] [PubMed] [Google Scholar]
  4. Chen A Dawant B, 2015. A multi-atlas approach for the automatic segmentation of multiple structures in head and neck CT images. Presented in Head and Neck Auto-Segmentation Challenge 2015 (MICCAI), Munich. [Google Scholar]
  5. Çiçek Ö, Abdulkadir A Lienkamp SS, Brox T, Ronneberger O, 2016. 3D U-Net: learning dense volumetric segmentation from sparse annotation. International Conference on Medical Image Computing and Computer-Assisted Intervention Springer. pp. 424–432. [Google Scholar]
  6. Ciesielski KC, Udupa JK, Saha PK, Zhuge Y, 2007. Iterative relative fuzzy connectedness for multiple objects with multiple seeds. Comput Vis Image Und 107, 160–182. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Cormen TH, Leiserson CE, Rivest RL, Stein C, 2009. Introduction to algorithms. 3rd ed., MIT Press. [Google Scholar]
  8. Daisne J-F, Blumhofer A 2013. Atlas-based automatic segmentation of head and neck organs at risk and nodal target volumes: a clinical validation. Radiation Oncology 8 154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. de Vos BD, Wolterink JM, de Jong PA, Leiner T, Viergever MA, Isgum I, 2017. ConvNet-Based Localization of Anatomical Structures in 3-D Medical Images. IEEE transactions on medical imaging 36 1470–1481. [DOI] [PubMed] [Google Scholar]
  10. Dolz J, Kiriçli HA, Fechter T, Karnitzki S, Oehlke O, Nestle U, Vermandel M, Massoptier L, 2016. Interactive contour delineation of organs at risk in radiotherapy: Clinical evaluation on NSCLC patients. Med Phys. 43 2569–2580. [DOI] [PubMed] [Google Scholar]
  11. Dou Q, Yu L, Chen H, Jin Y, Yang X Qin J Heng P-A, 2017. 3D deeply supervised network for automated segmentation of volumetric medical images. Med Image Anal 41 40–54. [DOI] [PubMed] [Google Scholar]
  12. Duc H, Albert K, Eminowicz G, Mendes R, Wong SL, McClelland. Modat M, Cardoso MJ, Mendelson AF, Veiga C, 2015. Validation of clinical acceptability of an atlas - based segmentation algorithm for the delineation of organs at risk in head and neck cancer. Med Phys 42 5027–5034. [DOI] [PubMed] [Google Scholar]
  13. Fortunati V Verhaart RF, Niessen WJ, Veenland JF, Paulides MM, van Walsum T, 2015. Automatic tissue segmentation of head and neck MR images for hyperthermia treatment planning. Physics in medicine and biology 60 6547. [DOI] [PubMed] [Google Scholar]
  14. Fritscher KD, Peroni M, Zaffino P, Spadea MF, Schubert R, Sharp G, 2014. Automatic segmentation of head and neck CT images for radiotherapy treatment planning using multiple atlases. statistical appearance models. and geodesic active contours. Med Phys 41 051910–n/a. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Ghesu FC, Georgescu B, Zheng Y, Grbic S, Maier A Hornegger J Comaniciu D, 2017. Multi-Scale Deep Reinforcement Learning for Real-Time 3D-Landmark Detection in CT Scans. Ieee T Pattern Anal. [DOI] [PubMed] [Google Scholar]
  16. Grevera G, Udupa J Odhner D, Zhuge Y, Souza A Iwanaga T, Mishra S, 2007. CAVASS: a computer-assisted visualization and analysis software system. J Digit Imaging 20 Suppl 1 101–118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Hall. William H, et al. 2008. Development and validation of a standardized method for contouring the brachial plexus: preliminary dosimetric analysis among patients treated with IMRT for head-and-neck cancer. International Journal of Radiation Oncology* Biology* Physics 725: 1362–1367. [DOI] [PubMed] [Google Scholar]
  18. Han X Hoogeman MS, Levendag PC, Hibbard LS, Teguh DN, Voet P, Cowen AC, Wolf TK, 2008. Atlas-based autosegmentation of head and neck CT images. International Conference on Medical Image Computing and Computer-assisted Intervention Springer. pp. 434–441. [DOI] [PubMed] [Google Scholar]
  19. Ibragimov B, Likar B, Pernus F, Vrtovec T, 2014. Shape representation for efficient landmark-based segmentation in 3-D. IEEE transactions on medical imaging 33 861–874. [DOI] [PubMed] [Google Scholar]
  20. Ibragimov B, Xing L, 2017a. Segmentation of organs-at-risks in head and neck CT images using convolutional neural networks. Med Phys 44, 547–557. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Isambert A, Dhermain F, Bidault F, Commowick O, Bondiau P-Y, Malandain G, Lefkopoulos D, 2008. Evaluation of an atlas-based automatic segmentation software for the delineation of brain organs at risk in a radiation therapy clinical context. Radiother Oncol 87, 93–99. [DOI] [PubMed] [Google Scholar]
  22. Jung F, Knapp O, Wesarg S, 2015. CoSMo - coupled shape model segmentation, Presented in Head and Neck Auto-Segmentation Challenge 2015 (MICCAI), Munich. [Google Scholar]
  23. Kong FM, Quint L, Machtay M, Bradley J. Atlas for organs at risk (OARs) in thoracic radiation therapy. RTOG website. https://www.rtog.org/LinkClick.aspx?fileticket=qlz0qMZXfQs%3d&tabid=361 (accessed July 20, 2018).
  24. Kong F, Ritter T, Quint D, Senan S, Gaspar L, Komaki R, Hurkmans C, Timmerman R, Bezjak A, Bradley J, Movsas B, Marsh L, Okunieff P, Choy H, Curran W, 2011. Consideration of dose limits for organs at risk of thoracic radiotherapy: atlas for lung, proximal bronchial tree, esophagus, spinal cord, ribs, and brachial plexus. Int J Radiat Oncol Biol Phys. 81(5):1442–57. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. La Macchia M, Fellin F, Amichetti M, Cianchetti M, Gianolini S, Paola V, Lomax AJ, Widesott L, 2012. Systematic evaluation of three different commercial software solutions for automatic segmentation for adaptive therapy in head-and-neck, prostate and pleural cancer. Radiation Oncology 7, 160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Leeman JE, Romesser PB, Zhou Y, McBride S, Riaz N, Sherman E, Cohen MA, Cahlon O, Lee N, 2017. Proton therapy for head and neck cancer: expanding the therapeutic window. Lancet Oncol 18(5), e254–e265. [DOI] [PubMed] [Google Scholar]
  27. Lustberg T, van Soest J, Gooding M, Peressutti D, Aljabar P, van der Stoep J, van Elmpt W, Dekker A, 2017. Clinical evaluation of atlas and deep learning based automatic contouring for lung cancer. Radiother Oncol. [DOI] [PubMed] [Google Scholar]
  28. Matsumoto MSM, Udupa JK, Tong Y, Saboury B, Torigian DA, 2016. Quantitative normal thoracic anatomy at CT, Computerized Medical Imaging and Graphics 51, 1–10. [DOI] [PubMed] [Google Scholar]
  29. Men K, Dai J, Li Y, Automatic segmentation of the clinical target volume and organs at risk in the planning CT for rectal cancer using deep dilated convolutional neural networks. Med Phys, n/a–n/a. [DOI] [PubMed] [Google Scholar]
  30. McGowan SE, Burnet NG, Lomax AJ, 2013. Treatment planning optimization in proton therapy. Br J Radiol, 86(1021), 20120288. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Oktay O, Ferrante E, Kamnitsas K, Heinrich M, Bai W, Caballero J, Cook SA, de Marvao A, Dawes T, O’Regan DP, 2018. Anatomically Constrained Neural Networks (ACNNs): Application to Cardiac Image Enhancement and Segmentation. IEEE transactions on medical imaging 37, 384–395. [DOI] [PubMed] [Google Scholar]
  32. Orbes Arteaga M, Cardenas Peña D, G., C.D., 2015. Head and Neck Auto Segmentation Challenge based on Non-Local Generative Models, Presented in Head and Neck Auto-Segmentation Challenge 2015 (MICCAI), Munich. [Google Scholar]
  33. Roelofs E, Engelsman M, Rasch. et al. , 2012. Results of a multicentric in silico clinical trial (ROCOCO): comparing radiotherapy with photons and protons for non-small cell lung cancer. J Thorac Oncol 7(1), 165–176. [DOI] [PubMed] [Google Scholar]
  34. Pednekar GV, Udupa JK, McLaughlin DJ, Wu X, Tong YT, Simone CBI, Camaratta J, Torigian DA, 2018. Image Quality and Segmentation, SPIE Medical Imaging; [DOI] [PMC free article] [PubMed] [Google Scholar]
  35. Phellan R, Falcao AX, Udupa JK, 2016. Medical image segmentation via atlases and fuzzy object models: Improving efficacy through optimum object search and fewer models. Medical Physics 43, 401–410. [DOI] [PubMed] [Google Scholar]
  36. Raudaschl PF, Zaffino P, Sharp GC, Spadea MF, Chen A, Dawant BM, Albrecht T, Gass T, Langguth C, Lüthi M, 2017. Evaluation of segmentation methods on head and neck CT: Auto - segmentation challenge 2015. Med Phys 44, 2020–2036. [DOI] [PubMed] [Google Scholar]
  37. Saito A, Nawano S, Shimizu A, 2016. Joint optimization of segmentation and shape prior from level-set-based statistical shape model, and its application to the automated segmentation of abdominal organs. Med Image Anal 28, 46–65. [DOI] [PubMed] [Google Scholar]
  38. Schreibmann E, Marcus DM, Fox T, 2014. Multiatlas segmentation of thoracic and abdominal anatomy with level set-based local search. Journal of Applied Clinical Medical Physics 15, 22–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Shi C, Cheng Y, Wang J, Wang Y, Mori K, Tamura S, Low-rank and sparse decomposition based shape model and probabilistic atlas for automatic pathological organ segmentation. Med Image Anal 38, 30–49. [DOI] [PubMed] [Google Scholar]
  40. Siegel RL, Miller KD, & Jemal A (2017). Cancer statistics, 2018 CA: a cancer. J Clin 68, 7–30. [DOI] [PubMed] [Google Scholar]
  41. Simone CB 2nd, Ly D, Dan TD, Ondos J, Ning H, Belard A, O’Connell J, Miller RW, Simone NL, 2011. Comparison of intensity-modulated radiotherapy, adaptive radiotherapy, proton radiotherapy, and adaptive proton radiotherapy for treatment of locally advanced head and neck cancer. Radiother Oncol 101, 376–382. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Sims R, Isambert A, Grégoire V, Bidault F, Fresco L, Sage J, Mills J, Bourhis J, Lefkopoulos D, Commowick O, 2009. A pre-clinical assessment of an atlas-based automatic segmentation tool for the head and neck. Radiother Oncol 93, 474–478. [DOI] [PubMed] [Google Scholar]
  43. Sonka M, Hlavac V, Boyle R, 2007. Image processing, analysis, and machine vision, 4th edition. Chapter 13. Thomson-Engineering, 2007. [Google Scholar]
  44. Tanâcs A, 2017. Generation and evaluation of an MRI statistical organ atlas in the head-neck region, Image and Signal Processing and Analysis (ISPA), 2017 10th International Symposium on IEEE, pp. 200–204. [Google Scholar]
  45. Tao C-J, Yi J-L, Chen N-Y, Ren W, Cheng J, Tung S, Kong L, Lin S-J, Pan J-J, Zhang G-S, 2015. Multi-subject atlas-based auto-segmentation reduces interobserver variation and improves dosimetric parameter consistency for organs at risk in nasopharyngeal carcinoma: A multi-institution clinical study. Radiotherapy and Oncology 115, 407–411. [DOI] [PubMed] [Google Scholar]
  46. Teguh DN, Levendag PC, Voet PW, Al-Mamgani A, Han X, Wolf TK, Hibbard LS, Nowak P, Akhiat H, Dirkx ML, 2011. Clinical validation of atlas-based auto-segmentation of multiple target volumes and normal tissue (swallowing/mastication) structures in the head and neck. International Journal of Radiation Oncology* Biology* Physics 81, 950–957. [DOI] [PubMed] [Google Scholar]
  47. Thomson D, Boylan C, Liptrot T, Aitkenhead A, Lee L, Yap B, Sykes A, Rowbottom C, Slevin N, 2014. Evaluation of an automatic segmentation algorithm for definition of head and neck organs at risk. Radiation Oncology 9, 173. [DOI] [PMC free article] [PubMed] [Google Scholar]
  48. Trullo R, Petitjean C, Nie D, Shen D, Ruan S, 2017a. Joint Segmentation of Multiple Thoracic Organs in CT Images with Two Collaborative Deep Architectures, In: Cardoso MJ, Arbel T, Carneiro G, Syeda-Mahmood T, Tavares JMRS, Moradi M, Bradley A, Greenspan H, Papa JP, Madabhushi A, Nascimento JC, Cardoso JS, Belagiannis V, Lu Z (Eds.), Medical Image Computing and Computer-Assisted Intervention (MICCAI) workshop 2017. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Trullo R, Petitjean C, Ruan S, Dubray B, Nie D, Shen D, 2017b. Segmentation of Organs at Risk in thoracic CT images using a SharpMask architecture and Conditional Random Fields, 2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017), pp. 1003–1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  50. Tsuji SY, Hwang A, Weinberg V, Yom SS, Quivey JM, Xia P, 2010. Dosimetric evaluation of automatic segmentation for adaptive IMRT for head-and-neck cancer. International Journal of Radiation Oncology* Biology* Physics 77, 707–714. [DOI] [PubMed] [Google Scholar]
  51. Udupa JK, Odhner D, Zhao L, Tong Y, Matsumoto MM, Ciesielski KC, Falcao AX, Vaideeswaran P, Ciesielski V, Saboury B, 2014. Body-wide hierarchical fuzzy modeling, recognition, and delineation of anatomy in medical images. Med Image Anal 18, 752–771. [DOI] [PMC free article] [PubMed] [Google Scholar]
  52. Veiga C, Janssens G, Teng CL, Baudier T, Hotoiu L, McClelland JR, Royle G, Lin L, Yin L, Metz J, Solberg TD, Tochner Z, Simone CB 2nd, McDonough J, Teo BK, 2016. First Clinical Investigation of Cone Beam Computed Tomography and Deformable Registration for Adaptive Proton Therapy for Lung Cancer. Int J Radiat Oncol Biol Phys 95, 549–559. [DOI] [PubMed] [Google Scholar]
  53. Velker VM, Rodrigues GB, Dinniwell R, Hwee J, Louie AV, 2013. Creation of RTOG compliant patient CT-atlases for automated atlas based contouring of local regional breast and high-risk prostate cancers. Radiation Oncology 8, 188. [DOI] [PMC free article] [PubMed] [Google Scholar]
  54. Veresezan O, Troussier I, Lacout A, Kreps S, Maillard S, Toulemonde A,, Marcy PY,, Huguet F, Thariat J, 2017. Adaptive radiation therapy in head and neck cancer for clinical practice: state of the art and practical challenges. Jpn J Radiol 35(2), 43–52. [DOI] [PubMed] [Google Scholar]
  55. Voet PW, Dirkx ML, Teguh DN, Hoogeman MS, Levendag PC, Heijmen BJ, 2011. Does atlas-based autosegmentation of neck levels require subsequent manual contour editing to avoid risk of severe target underdosage? A dosimetric analysis. Radiother Oncol 98, 373–377. [DOI] [PubMed] [Google Scholar]
  56. Wang Z, Wei L, Wang L, Gao Y, Chen W, Shen D, 2018. Hierarchical Vertex Regression-Based Segmentation of Head and Neck CT Images for Radiotherapy Planning. IEEE Transactions on Image Processing 27, 923–937. [DOI] [PMC free article] [PubMed] [Google Scholar]
  57. Whitfield GA, Price P, Price GJ, Moore CJ, 2013. Automated delineation of radiotherapy volumes: are we going in the right direction? British Journal of Radiology, 86(1021), 20110718 Doi: 10.1259/bjr.20110718. [DOI] [PMC free article] [PubMed] [Google Scholar]
  58. Wu X, Udupa JK, Tong Y, Odhner D, Pednekar GV, Simone CB, ... & Shammo G, 2018. Auto-contouring via automatic anatomy recognition of organs at risk in head and neck cancer on CT images. 2018 SPIE Medical Imaging. [DOI] [PMC free article] [PubMed] [Google Scholar]
  59. Wu X, Udupa JK, Torigian DA, Thoracic object definition document. http://www.mipg.upenn.edu/Vnews/BodvRegionsObiects/ThoracicObiects.pdf
  60. Wu X, Udupa JK, Torigian DA, H&N object definition document. http://www.mipg.upenn.edu/Vnews/BodvRegionsObiects/HeadNeckObiects.pdf
  61. Zheng Y, Liu D, Georgescu B, Nguyen H, Comaniciu D, 2015. 3D Deep Learning for Efficient and Robust Landmark Detection in Volumetric Data, In: Navab N, Hornegger J, Wells WM, Frangi A (Eds.), Medical Image Computing and Computer-Assisted Intervention (MICCAI) 2015. [Google Scholar]
  62. Zhou X, Takayama R, Wang S, Hara T, Fujita H, 2017a. Deep learning of the sectional appearances of 3D CT images for anatomical structure segmentation based on an FCN voting method. Med Phys 44, 5221–5233. [DOI] [PubMed] [Google Scholar]
  63. Zhu M, Bzdusek K, Brink C, Eriksen JG, Hansen O, Jensen HA, Gay HA, Thorstad W, Widder J, Brouwer CL, 2013. Multi-institutional quantitative evaluation and clinical validation of Smart Probabilistic Image Contouring Engine (SPICE) autosegmentation of target structures and normal tissues on computer tomography images in the head and neck, thorax, liver, and male pelvis areas. International Journal of Radiation Oncology• Biology• Physics 87, 809–816. [DOI] [PubMed] [Google Scholar]

RESOURCES