Abstract
Image segmentation has become a vital and often rate limiting step in modern radiotherapy treatment planning. In recent years the pace and scope of algorithm development, and even introduction into the clinic, have far exceeded evaluative studies. In this work we build upon our previous evaluation of a registration driven segmentation algorithm in the context of 8 expert raters and 20 patients who underwent radiotherapy for large space-occupying tumors in the brain. In this work we tested four hypotheses concerning the impact of manual segmentation editing in a randomized single-blinded study. We tested these hypotheses on the normal structures of the brainstem, optic chiasm, eyes and optic nerves using the Dice similarity coefficient, volume, and signed Euclidean distance error to evaluate the impact of editing on inter-rater variance and accuracy. Accuracy analyses relied on two simulated ground truth estimation methods: STAPLE and a novel implementation of probability maps. The experts were presented with automatic, their own, and their peers’ segmentations from our previous study to edit. We found, independent of source, editing reduced inter-rater variance while maintaining or improving accuracy and improving efficiency with at least 60% reduction in contouring time. In areas where raters performed poorly contouring from scratch, editing of the automatic segmentations reduced the prevalence of total anatomical miss from approximately 16% to 8% of the total slices contained within the ground truth estimations. These findings suggest that contour editing could be useful for consensus building such as in developing delineation standards, and that both automated methods and even perhaps less sophisticated atlases could improve efficiency, inter-rater variance, and accuracy.
1. Introduction
Image segmentation is a vital step in most radiotherapy planning today. It describes a process that partitions imaging studies into discrete geometric information that can be used to plan and evaluate radiation treatment. The information usually consists of coordinate point sets or binary masks in the reference frame of the imaging study. Since the integration of x-ray computed tomography (CT) in treatment planning systems, segmentation of images has been used to optimize dose distributions by providing dose volume information of both targets and organs at risk. This is of particular importance in inversely planned therapy, such as intensity modulated radiation therapy (IMRT) and volumetric modulated arc therapy, and in situations such as stereotactic radiosurgery and other ablative methods that require high doses over a short time scale. Traditionally, images have been segmented manually in a time-consuming process that must occur before designing treatment fields or calculating dose. Our experience and that of others (Das et al. 2009) has been that segmentation is the rate-determining step in the treatment planning process.
In recent years a number of algorithms for automatic or semi-automatic segmentation have emerged, and quickly following several clinical systems have been marketed both within and as standalones to treatment planning systems. In the context of radiation therapy the vast majority of scholarly activity has involved algorithm development, and these algorithms have been quickly adapted to clinical systems with a relative lack of information regarding overall impact.
A potential explanation for the lack of evaluation studies involves the nature of segmentation itself. This is a problem lacking a known ground truth for comparison. Organ delineation in the human body requires decisions drawing from an aggregation of both explicit and implicit anatomic and physiologic information. Phantom studies, synthetic datasets and cadaver sections offer a more controlled but less realistic environment and hence are not well suited for gauging clinical impact. Several authors have shown previously that using a single expert rater as a gold standard is unreliable (Chao et al. 2007, Stapleford et al. 2010, Deeley et al. 2011). Isambert (2008) after noting low correlation with a single expert segmentation concluded that perhaps automatic segmentation was not well suited for small tubular structures such as the optic nerves and chiasm. Our previous study showed relatively low similarity between the automatic and expert segmentations as well. However, in the context of several experts, we found that the automatic system performed no worse than the experts. That is, the inter-rater variance amongst the experts was similar to the automatic-expert variance, indicating not that automatic systems are inadequate but that these structures are inherently difficult to segment.
Several methods (Warfield et al. 2004, Kittler et al. 1998, Meyer et al. 2006, Asman & Landman 2012, Windridge & Kittler 2003, Jacobs 1995) have been proposed for estimating ground truth through a combination of expert segmentations. The method put forth by Warfield and colleagues, termed simultaneous truth and performance level estimation (STAPLE), is designed to incorporate truth priors and rater performance priors and be robust to outliers. However, truth priors are rarely known, and incorporating rater priors is problematic in clinical studies as relative rater quality generally cannot be anticipated accurately. In prior work we found that STAPLE tended to be influenced disproportionately by larger segmentations within the expert cohort. Biancardi (2010) noted a similar phenomenon. This may be a byproduct of STAPLE depending heavily on a sometimes inaccurate truth estimate in the absence of a truth prior (Zhu et al. 2008). In our prior and current work, we rely on both STAPLE and another method, the computationally simple idea of probability maps (p-maps) (Meyer et al. 2006, Deeley et al. 2011), similar to voting rule (Kittler et al. 1998). Often the p-maps are thresholded at a predetermined level such as 0.50, where half of the raters agree. Recognizing that rater consensus may well be a function of organ type and location, we allow a moving threshold as determined by the p-map mean over the range (0,1] to be the best “vote” level for the ground truth.
Another persistent problem in the design of evaluation studies is the choice of comparison metrics. A number of volume and distance-based metrics have been used. Nominal volume (we use this terminology to disambiguate the use of volume from other meanings such as a three dimensional set of images or contours) is a useful measure that does not require pairwise calculation and is easily compared across separate studies. However, its value is that of a summary statistic. Two segmentations of different shape and location may have the same volume. Measures of spatial overlap such as the Dice (1945) similarity coefficient and the Jaccard coefficient (1908) provide pairwise comparison incorporating general shape and location information and are intuitive, but they do not provide information about whether differences are a result of over-or under-segmentation (Crum et al. 2006, Popovic et al. 2007). Additionally, volume and overlap measures by nature deemphasize central-peripheral (Meyer et al. 2006), or edge, deviations when they are small in comparison to overall volume. Distance measures, such as the Hausdorff and Euclidean distances, fill the gap by adding detailed information about edges.
We believe evaluation studies should be behavioural in nature, bringing together clinically relevant disease sites and imaging studies as well as enough raters and cases to provide robust statistical analysis. In our previous study we collected manual segmentations from a group of eight expert raters over 20 challenging cases. The experts delineated the brainstem, optic chiasm, eyes, and optic nerves in the presence of large space-occupying lesions. We also used our algorithms to segment these organs automatically for each case. We tested the hypothesis that the automatic system would produce segmentations that could serve as surrogates to the manual physician segmentations, and we evaluated inter-rater variance and accuracy through simulated ground truths using STAPLE and our own application of the concept of thresholded probability maps. The results of this study, to which we will refer as the de novo study, have been published previously (Deeley et al. 2011). In summary, we found that differences in raters could be large and that at least one rater was often markedly different from the group. We also found that the automatic system performed well against the group of experts and, indeed, could serve as a surrogate.
Realistically, we contend that no automated system will completely replace expert segmentations in radiation therapy planning in the near future. However, automatic segmentation will and indeed already is offering a starting point to clinicians. From this starting point the clinicians will have to make judgments about the quality of the initial segmentations and make edits accordingly. Our de novo study provided information about expert delineation when starting from a blank slate but did not evaluate editing of pre-existing delineations. Building on the work of Chao (2007) and Stapleford (2010) in the present work we have undertaken a single-blind, randomized study presenting the same eight raters from the de novo study contours for editing. We tested four general hypotheses. First, editing the automatically generated contours (A1) reduces inter-rater variance. Second, editing A1 either increases or maintains accuracy. Third, editing A1 salvages the results of low performing raters in the de novo study. In other words, raters who were low performers will produce better performing contours when they use A1 as a starting point. Fourth, contour editing in general (independent of segmentation source) reduces inter-rater variation while maintaining or improving accuracy. Much of the methodology in terms of ground truth estimation and metrics was covered at depth in our prior work. Our attempt here is to refer the reader to the prior work as much as possible while maintaining clarity.
2. Methods
2.1. Study design
In this study we utilized imaging volumes from the same 20 patients used in our de novo segmentation study, as well as the same eight expert raters. Extensive descriptions of those imaging volumes, raters, delineation guidelines, and technical considerations are given in that work (Deeley et al. 2011).
The 20 patients had been previously treated at the Vanderbilt-Ingram Cancer Center with intensity-modulated radiation therapy (IMRT) for high-grade gliomas. Their cases were specifically chosen for the presence of large space occupying lesions often in close proximity to intracranial organs at risk, a situation that has both high clinical relevance (Amelio et al. 2010) and presents a challenge for automatic segmentation (Dawant et al. 2002). The images were x-ray computed tomography (CT) of 2 or 3 mm slice thickness and 1.5/3 T T1 magnetic resonance (MR) volumes of approximately 1 mm3. These are typical of patients undergoing stereotactic brain biopsy. The raters were classified as senior (P1-P4, three attending radiation oncologists and one diagnostic radiologist) and junior (J1-J4, four radiation oncology residents in their final year of training).
In the first portion of our evaluation study the raters were asked to delineate brainstem, optic chiasm, eyes and optic nerves for the 20 cases utilizing fused CT/MR imaging within a clinical system. As they were given no starting point other than delineation guidelines and anatomical definitions, we refer to this as the de novo, or “from scratch” study. These delineations were acquired over a period of approximately one year.
Several months after concluding the de novo study, we initiated an editing study with the same raters. In this second round of contouring we presented the experts with fully completed contours for the brainstem, optic chiasm, eyes, and optic nerves from three sources: the automatic contours (A1), their own contours (self), and contours delineated by their peers (peer) in the de novo study. In total each expert edited 60 complete sets of segmentations, three per patient. These 60 tasks were randomized and single-blind, in that the raters did not know the origin of the segmentations. In fact, to avoid presumptive guessing, we told the raters only that they would be presented segmentations for editing. We made no mention of the potential sources of segmentations, though it is likely some of them assumed the source to be the automatically generated contours. Though it was beyond the scope of this work to test, we anticipated that this time interval would be sufficient to avoid potential effects of memory on the raters’ interpretations. Additionally, if effects were to exist within the current study as a result of revisiting each patient three times, these would be randomly distributed over the patients and segmentation sources.
An in-house graphical user interface was developed to present a task queue and to record editing times. The design was such that each rater was presented with each of the 20 automatic segmentation sets once, each of their own previous segmentations at least once and sometimes twice per patient, and their peers’ segmentations from the de novo study. We will refer to these groups as “A1”, “self”, and “peer”, respectively. The selection of which peer to edit was also randomized and balanced such that each rater edited each peer two to three times over the course of the tasks. The editing was done using a research version of the treatment planning system (Eclipse 8.5, Varian Medical Systems, Palo Alto, CA) identical to the clinical system. Details of this system were included in our previous work (Deeley et al. 2011). We did not specify to the experts which tools to use for editing. Several options were available such as using a paintbrush tool to take away or add to an existing contour, deleting a contour and redrawing from scratch, and moving the contours as a whole. We did not collect data on tools utilized, though generally most experts appeared to prefer the paintbrush method for making edits.
2.2. Automatic segmentation
The automatic segmentations presented for editing were the same generated in the de novo study. In summary, we segmented the organs at risk using two methods. An atlas-based, registration-driven method was used to segment the brainstem and eyes. It involves a global affine registration of the atlas to target, followed by automatic extraction of a predefined bounding box from both target and atlas. A second, now local, affine registration is performed on the bounded region, resulting in a transformation projecting the atlas to the target. Normalized mutual information (NMI) (Studholme et al. 1999) is used as the similarity measure. A local non-rigid registration is performed between the results of the local affine registration and the atlas. Lastly, the deformation fields resulting from the three registrations are used to project contours from the atlas to the target (patient) image.
The second method, used to segment the optic chiasm and nerves, is a technique we have developed for the segmentation of tubular structures, termed the atlas-navigated optimal medial axis and deformable model algorithm (NOMAD) (Noble & Dawant 2011). NOMAD first computes the medial axis of the structure as the optimal path with respect to a cost function relative to image and shape features, and then expands using a level-set algorithm to the final structure. The statistical model employed in NOMAD was trained on image volumes outside those used in this study.
The non-rigid transformations used in our segmentation framework are provided by the adaptive bases algorithm (ABA) we have developed (Rohde et al. 2003). It utilizes NMI and models the deformation field registering the atlas and target as a linear combination of radial basis functions (Wu 1995) with finite support.
2.3. Ground truth estimation
To gauge accuracy we calculated ground truth estimations via two methods: the simultaneous truth and performance level estimation (STAPLE) algorithm (Warfield et al. 2004) and through dynamically thresholded probability maps (p-maps) (Meyer et al. 2006). We calculated STAPLE and the p-map derived ground truths in the same manner described previously (Deeley et al. 2011). In that study we found the two methods produced ground truths estimates with a high degree of overlap as measured by DSC, even for the small tubular structures. However, much of the impetus to use two independent estimates arose from a qualitative observation that STAPLE could be influenced disproportionately by under-segmenting individuals. We also noted that though spatial overlap between the two methods was consistently high, STAPLE also consistently under-segmented compared to the p-map method. For the purposes of this work we define under-segmented structures as those that are volumetrically larger than a reference structure.
STAPLE is designed to be robust to bias (Warfield et al. 2004), so we applied it as has been commonly done in other work (Stapleford et al. 2010) to create a single ground truth per patient from a cohort of experts. With the p-map method we removed bias explicitly and thus calculated a different ground truth mask for each rater in a leave-one-out process. That is, the p-map derived ground truth for rater P1, for example, is generated from the p-map which excludes his own segmentations. We also eliminated all segmentations delineated by rater P2 from the pool as evidence from the de novo study showed this rater was often different from the rest of the group. The p-maps are calculated as
| (1) |
where i, j, and k, represent the patient case, rater, and structure, respectively; j = 2 indicates the rater P2; E represents the binary mask; K is a Gaussian kernel of 3×3 pixels applied to each slice of the mask with standard deviation σ = 0.65 pixel width. A full discussion of method and rationale for thresholding the p-maps can be found in our prior work (Deeley et al. 2011).
The de novo and editing studies resulted in a combined 32 sets of manual segmentations for each patient (in total, 640 structure sets and 3840 individual organ structures): 8 de novo (P1-P4, J1-J4), and on average, 8 edited automatic , 9 edited self (self), and 7 edited-peers (peer). From these we calculated ground truth estimations to provide a basis for accuracy assessment and calculation of distance maps (discussed in section 2.4.2). Figure 1 illustrates a p-map for an optic nerve with a single physician contour as well as the corresponding Pmapmean. and STAPLE ground truth estimations. An important choice had to be made as to which of the segmentations cohorts to draw from in these calculations. One could envision using all of the manual de novo, the manual-edited, those groups combined, or individual sets of manual-edited segmentations (e.g., edited-peers). We chose to base all analyses in this study on the class derived from the edited-peers group, the rationale for which we discuss in section 4.
Figure 1.
Ground truth estimation. The upper left panel displays the area of an optic nerve for patient 12 on an axial CT slice. The dotted contour is the automatic segmentation after editing by rater P3. The upper right panel plots the p-map used to estimate a ground truth for comparison against P3 and consists of his peers’ segmentations. The contour overlaying the p-map is the unedited automatic result. The ground truth estimated by thresholding the p-map at the non-zero mean is shown at lower left, and the STAPLE estimation for the same slice at lower right.
2.4. Metrics for comparison
2.4.1. Volume-based metrics
We calculated two volumetric measures in this study: nominal volume and Dice similarity coefficient. Nominal volume is calculated as
| (2) |
where the binary mask E for patient i, rater j, structure k, is summed over its voxels of volume v. Volume provides a summary statistic about the gross size of segmentations and can be compared readily to results from other studies utilizing different datasets as there is no dependence on ground truth estimation, group variance, or image dimensionality.
The Dice similarity coefficient (DSC) is a spatial overlap measure that can be calculated generally via
| (3) |
where EA and EB denote any two mask volumes of the same dimensionality. Its range is [0,1], where zero signifies no overlap and 1 signifies exact overlap. The DSC can be calculated as an integrative measure over the volume segmentations or on a slice-by-slice basis. In this study we calculated DSC on volumetric basis to measure inter-rater variance, and assess accuracy, while the slice-by-slice implementation was used only in gauging amount of editing.
2.4.2. Distance-based metrics
Distance-based metrics complement the volume-based metrics by providing information about differences between segmentations at their edges, independent of object shape. There are a number of methods for calculating generalized distance measures; a discussion toward a generic evaluation of image segmentation using the concept of distance is provided by Cardoso (2005). In this work we were concerned with the end-use of segmentations in radiation treatment, an environment well-suited to Euclidean distances, sometimes known as ordinary or surface normal distances.
Signed three-dimensional Euclidean distance maps were pre-calculated from the Pmapmean simulated ground truths. Each voxel in the distance map contains the Euclidean distance between that voxel and the nearest edge voxel of the ground truth. Distance distributions were formed for individual rater segmentations by sampling the appropriate distance map with contour points for the segmentation in question. Contour points lying inside the boundary of the ground truth were signed negative and those lying outside signed positive. The distributions were used to calculate average min, mean, and max distances across the patients, raters, and structures.
Additionally, we calculated a quantity which we term the true positive rate. It is the fraction of total contour points falling within a shell of a specified distance from the edge of the simulated ground truth, Pmapmean. We chose ± 2 mm as a relevant distance for selection of the shell. It is on the order of the slice thickness, and while one would want to minimize uncertainty from segmentation as much as possible, 2 mm is on the order of the overall geometric accuracy of most linear accelerators.
Several aspects of this implementation are noteworthy. First, as the distances are signed, we avoided summary statistics without also examining the distribution, as measures of central tendency could be washed out by positive and negative variations. Second, as the distance distributions are calculated in one direction only, from rater segmentation to ground truth, each distribution contains information regarding exclusively where a rater segmented, but not where a rater elected not to segment. In that sense, a rater could delineate one slice of a multi-slice structure and have a distance distribution of zeroes. In the absence of complementary volume-based measures, this would be a weakness. Lastly, as the distances are calculated from ground truths, they do not provide a direct pair-wise comparison of the same flavour as DSC, and hence they are less valuable in determining inter-rater variance than DSC and nominal volume.
2.4.3. Time-to-edit
Time is an important factor in the process of treatment planning, which can be a complex multi-step workflow with a number of checkpoints requiring input from several professionals. In this study we measured the time required by physicians to modify pre-generated segmentations, as discussed in previous sections. This was accomplished via an in-house task queue and timing program. To be as clinically realistic as possible, the software alerted the expert to the current task in need of attention and allowed for pausing and restarting. The experts were instructed not to run the timer during administrative tasks such as opening and closing patients.
2.5. Analytical framework
We use the measures discussed herein in combination to test the hypotheses laid out in section 1. One tool that we use repeatedly to present the reader with a visual summary of the data is the boxplot (Tukey 1977). Each boxplot divides the distribution into quartiles q1-q4, where the inter-quartile range q3-q1 is represented by a vertical rectangular box, and vertical lines, also known as whiskers, extend past the box to represent the statistical range of the distribution; outliers are shown as dots plotted individually that fall beyond 150 percent of the inter-quartile range. The median of the distribution is shown as a red horizontal line within the box. In cases where the distribution satisfies conditions for normality, notches, were used to provide information about significance in differences at the median. In these plots notches were represented via triangles, whose centers delimit the edges of the 95% confidence interval about the median.
Some distributions, such as that of DSC, are not normally distributed, and in such cases we have calculated measures of central tendency and 95% confidence intervals via bias corrected and accelerated bootstrapping with 1000 replicates (Davison & Hinkley 1997). Others have achieved normality by transformation of the data, such as by using the logit function (Zou et al. 2004).
3. Results
3.1. Assessing editing efficiency
One aim of introducing automation into the segmentation process is to improve efficiency. We measured two variables to gauge efficiency: time to edit pre-generated contours and amount of editing required for a satisfactory end product. All measures of quality being held equal, one would choose a process that minimizes both of these factors. In our previous study we found that the experts required a mean time of approximately 14.5 minutes (total range 4.5–31 minutes, individual means 6.5–21 minutes; N = 107 contouring sessions as times were not collected for first six patients and on occasion raters forgot to start the timer) to segment the brainstem, chiasm, eyes, and optic nerves utilizing fused CT/MR, though individuals varied widely. editing required considerably less time than contouring from scratch. Panel (a) of figure 2 plots the distribution of times across all raters for the de novo (P1−J4) and editing studies for the group of tasks in which the raters edited A1, the pre-generated automatic segmentations produced by our algorithm. editing of A1 reduced mean time to final product to 5.9 ([5.5, 6.4] 95% confidence interval) minutes, as did editing of their own (self) contours to 4.3 [4.0, 4.7] minutes, and those of their peers to 5.5 [4.9, 6.1] minutes. Panel (b) compares the distributions across the three sources for editing: automatic (A1), self, peer. We found there was a significant (α = 0.05) though small reduction in time when raters were presented with their own contours segmented in the previous study as compared to those of the automatic system or their peers. As this was a task-oriented study conducted over approximately a year, we wondered if there would be an effect of learning or even potentially fatigue on time to modify. We randomized the tasks over the patient population to avoid confounding case difficulty with experience and found that taken as a group (panel (c)) there was no learning effect. We similarly found there was no influence of task number on accuracy as measured by DSC against the ground truth estimations.
Figure 2.
Time Analysis. Panel (a) plots the distribution of times across all raters for the de novo (P1-J4) and editing studies for the group of tasks in which the raters edited A1, the pre-generated automatic segmentations. Panel (b) compares the distributions across the three sources for editing: automatic (A1), self, peer. In all each rater completed 60 randomized tasks over the course of the editing study. Panel (c) plots the time to modify as a function of task to evaluate whether there was a learning effect.
To keep the study as clinically relevant as possible and the timing procedure valid, we did not ask raters to comment directly on the quality of the contours presented (Stapleford et al. 2010). Rather we gauged acceptability using the Dice coefficient in a pairwise calculation between the pre- and post-editing masks, both by slice and volumetrically. For example, a DSC of 1.0 indicates unequivocally that the initial segmentation matches the final segmentation. As the similarity between pre- and post-editing segmentations increases, DSC increases as a function of the overlap relative to volume or area of the segmentations, indicating smaller changes were made. The results of the volumetric calculation are shown in figure 3, where distributions of DSC are plotted as a function of source for editing and structure. Most edits to the brainstem and the eyes resulted in a less than 10% change in spatial overlap, whereas the chiasm and nerves required more extensive editing across all sources. Mirroring the data concerning time, there was a small preference of raters for their own contours compared to the automatic and those of their peers. To evaluate the amount of editing by slice, we divided the range of DSC, [0,1], in Table 1 into four categories: major [0,0.7), moderate [0.7,0.9), minor [0.9,1), and no [1,1] editing. Here there was a clear preference of raters for their own contours, of which they made no edits to 43%, 29%, 59% and 44% of the contours for brainstem, chiasm, eyes, and optic nerves. The chiasm and nerves underwent substantial heavy editing regardless of the original source: A1, self, or peers.
Figure 3.
Plot of volumetric Dice coefficient as a function of source for editing: A1, self and peers. Each distribution consists of the pairwise comparisons between the de novo and edited segmentations for each of the eight raters. For example, P1de novo is compared via DSC to P1 from each of the editing sources to gauge the similarity (or, equivalently, amount of editing).
Table 1.
The range of DSC [0,1] is divided into four categories to gauge amount of editing as a function of structure and source: major [0,0.7), moderate [0.7,0.9), minor [0.9,1), and no [1,1] editing. Each cell contains the fraction of slices via 2D DSC calculation that fell within a given range.
| Brainstem | Chiasm | Eyes | Nerves | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| A1 | self | peers | A1 | self | peers | A1 | self | peers | A1 | self | peers | |
| none | 0.28 | 0.43 | 0.32 | 0.15 | 0.29 | 0.17 | 0.37 | 0.59 | 0.43 | 0.26 | 0.44 | 0.37 |
| minor | 0.42 | 0.41 | 0.44 | 0.10 | 0.13 | 0.11 | 0.40 | 0.24 | 0.29 | 0.13 | 0.14 | 0.16 |
| moderate | 0.21 | 0.10 | 0.14 | 0.14 | 0.15 | 0.12 | 0.15 | 0.11 | 0.19 | 0.25 | 0.17 | 0.18 |
| major | 0.09 | 0.06 | 0.09 | 0.62 | 0.43 | 0.60 | 0.08 | 0.06 | 0.09 | 0.37 | 0.25 | 0.29 |
Figure 4 contains four panels of orthogonal MR cross-sections comparing (a) the de novo, (b) A1-edited, (c) self-edited, and (d) peer-edited groups. In this example the de novo contours display the most variance. The unedited A1 contours are included in red in panel (a). We also note the erroneously contoured internal carotid arteries as part of the optic chiasm shown with red arrows in the coronal (upper right) section of panel (a). Editing of A1 reduced inter-rater variability the most and eliminated the inclusion of the internal carotids for all but one rater.
Figure 4.
Orthogonal views comparing group results from (a) de novo, (b) A1-edited, (c) self-edited, (d) peer-edited. The red arrows in the upper right (coronal section) of panel (a) point to the internal carotid arteries, which were often erroneously included as part of the optic chiasm in the de novo study as well as self- and peer-edited groups. In panel (a) the red contours are those of the A1 while the other colors represent manual expert segmentations.
3.2. Evidence regarding hypothesis: Editing of automatic segmentations (A1) reduces inter-rater variance
Each of the physician raters was presented complete sets of automatic segmentations (A1) for each of the 20 patients as discussed in section 2.1. This process was blinded and randomized along with the presentation of self and peer segmentations. Figure 5 plots the distribution of volumetric Dice coefficients over all structures and raters providing a sense of inter-rater variance from de novo study versus the editing study. Columns A1, P1-J4 plot the distributions of non-redundant pairwise DSC from the de novo study. Primed columns, , denote the editing study results. These distributions relate inter-rater variance in two key ways. The first is simply the DSC statistic itself, which can be summarized via the mean or median. The red horizontal lines in the boxplot represent the median. The means and corresponding 95% confidence intervals and standard deviations were calculated via bootstrapping and are presented in Tables A1 and A2. In figure 5 it is clear that across all structures and raters the median DSC increased with editing of the automatic contours. The gains were largest for the chiasm and nerves, though even after editing agreement was still less than seen with the brainstem and eyes. The mean inter-rater DSC treating all raters as a single group increased from 0.83 (de novo) to 0.92 for the brainstem, 0.39 to 0.57 for the chiasm, 0.83 to 0.93 for eyes, and 0.49 to 0.73 for the optic nerves when the raters were presented with automatic contours for editing. The second way DSC relates inter-rater variance is through the spread of these distributions, which decreased as a result of editing such that there was both a reduction in outliers and standard deviation.
Figure 5.
Plots the distribution of volumetric Dice coefficients for the editing of the automatic (A1) over all structures and raters providing a sense of inter-rater variance from de novo study versus the editing study. Columns A1, P1-J4 plot the distributions of non-redundant pairwise DSC from the de novo study Primed columns, , denote the editing study results.
Similar to figure 5 nominal volume is plotted in figure 6, and the corresponding mean, confidence interval, and standard deviation, are present in table A3 for all raters as a group and as a function of source: unedited A1, de novo, as well as edited A1, self and peer groups. editing of A1 resulted in reduction in inter-quartile range as well as the coefficient of variation over all the raters as a group and across each of the structures.
Figure 6.
Plots the distribution of nominal volume as a function of structure and rater and segmentation class: unedited automatic (A1), de novo (columns P1-J4) and edited-A1 .
3.3. Evidence regarding hypothesis: Editing of automatic segmentations (A1) maintains or improves accuracy
To assess accuracy, we compared rater segmentations to ground truth estimations via the Dice coefficient. Figure 7 plots the distributions of pair-wise DSC of STAPLE and Pmapmea. ground truth estimations against automatic and rater segmentation distributions. Each subplot can be divided into two: the four left columns (A/S,…,E/S) plot the segmentations against the STAPLE-derived ground truth, while the right side columns (A/P,…,E/P) plot the segmentations against the Pmapmean ground truth. Columns A/,…,E/ represent the DSC distributions for the unedited automatic segmentations, unedited de novo segmentations, edited automatic segmentations, edited self, and edited peers, respectively.
Figure 7.
Plots the distributions of pair-wise DSC of STAPLE and Pmapmean ground truth estimations against automatic and rater segmentation distributions. Each subplot can be divided into two: the four left columns (A/S,…,E/S) plot the segmentations against the STAPLE-derived ground truth, while the right side columns (A/P,…,E/P) plot the segmentations against the Pmapmean ground truth. Columns A/,…,E/ represent the DSC distributions for the unedited automatic segmentations, unedited de novo segmentations, edited automatic segmentations, edited self, and edited peers, respectively.
Figure 7 provides evidence that accuracy compared to unedited A1 and de novo segmentations is at minimum maintained by editing, and this was consistent against both STAPLE and Pmapmean ground truth estimates. However, the small tubular structures of the chiasm and nerves benefited the most from editing. The mean[95% CI] of Dice comparison against the ground truths for the chiasm increased from 0.47 [0.43,0.5] and 0.45 [0.43,0.47] for A1 and de novo to 0.55 [0.53,0.56], 0.53 [0.51,0.55], and 0.59 [0.57,0.61] for edited-automatic, -self, and -peer, respectively. These mean DSC and 95% confidence intervals can be found in table 2.
Table 2.
Assessing accuracy of 5 classes of segmentations: unedited automatic (A1), de novo, and editing groups M(A1), M(self), and M(peers), via DSC against the ground truth estimates.
| Brainstem | Chiasm | Eyes | Nerves | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Source | Mean | Mean | CI | Mean | Mean | CI | Mean | Mean | CI | Mean | Mean | CI |
| A1 | 0.87 | 0.86 | 0.88 | 0.47 | 0.43 | 0.50 | 0.88 | 0.88 | 0.89 | 0.59 | 0.57 | 0.61 |
| de novo | 0.87 | 0.86 | 0.88 | 0.45 | 0.43 | 0.47 | 0.88 | 0.87 | 0.88 | 0.63 | 0.61 | 0.64 |
| mod(A1) | 0.89 | 0.89 | 0.90 | 0.55 | 0.53 | 0.56 | 0.90 | 0.89 | 0.90 | 0.66 | 0.65 | 0.67 |
| mod(self) | 0.88 | 0.88 | 0.89 | 0.53 | 0.51 | 0.55 | 0.89 | 0.89 | 0.90 | 0.66 | 0.65 | 0.67 |
| mod(peers) | 0.90 | 0.90 | 0.91 | 0.59 | 0.57 | 0.61 | 0.90 | 0.90 | 0.91 | 0.70 | 0.69 | 0.71 |
A complementary gauge of accuracy to DSC is the Euclidean, or surface normal, distance from the ground truth estimate to the test segmentation. Figure 8 plots the signed distances, where positive indicates a point outside and negative a point inside the ground truth, Pmapmean. The columns from left to right represent the distribution of distance error for the unedited A1, de novo (columns P1,…, J4) and edited A1 for individual raters (columns ). Blue dashed lines indicate a distance error of ± 2 mm. These distributions comprise hundreds of thousands of contour points, and there are a number of outliers. To improve clarity we have plotted them over fixed range from −5 mm to +10 mm denoted by the lower and upper bounding dashed lines. If outliers occur beyond these bounds, they are shown in lower and upper bands for which the density is proportional to the number of outliers. The number of data points comprising each column of each subfigure (of figure 8) varies as a function of the size of the structure drawn from one rater to the next. In so much as they are comprised of the same number of whole structure distance calculations (20 for each column), relative comparisons of outliers are valid. However, absolute comparisons of outlier density between structures are invalid, as the brainstem, for instance, has vastly more contour points than the chiasm, nerves or eyes. There were generally only small changes in median distance error between the de novo and edited-automatic segmentations, though the number of outliers was reduced in the edited distributions for most cases.
Figure 8.
Plots the signed distance errors, where positive indicates a point outside and negative a point inside the ground truth. The columns from left to right represent unedited A1de novo (columns P1,…, J4) and edited A1 by raters (columns ). Inner dashed lines indicate a distance error of ± 2 mm. To improve clarity we plot are over fixed range from −5 mm to +10 mm denoted by the lower and upper bounded dashed lines. If outliers occur beyond these bounds, they are shown in lower and upper bands for which the density is proportional to the number of outliers.
3.4. Evidence regarding hypothesis: Editing of automatic segmentations (A1) salvages the results of low performing raters
In our previous study we noted that one rater in particular often produced segmentations different from the rest of the group. For this reason the rater in question was not included in the ground truth estimation. We hypothesized whether editing of the automatic segmentations would salvage the rater’s performance. We use salvage to describe the process of preventing a negative result, such as “radiation was able to salvage the failed surgery”. Looking to figures 5 and 6 we can see in situations where P2 had a distribution markedly different from the group, and these deviations have been corrected by editing of A1. The mean volumetric DSC for rater P2 against the other experts increased from 0.69 to 0.91 (brainstem), 0.25 to 0.56 (chiasm), 0.75 to 0.93 (eyes), and 0.403 to 0.72 (nerves) through editing of the automatic segmentations.
We expect all raters on occasion to produce segmentations of low accuracy. These could be entire segmentations, such as mistaking the pituitary for the optic chiasm, or individual areas such as a single slice or series of slices omitted as part of the inferior brainstem. To this end we compared areas of low quality (slice DSC < 0.5) in the de novo study to the same areas post-editing of the automatic contours. First, we found the frequency of total miss, or omission of a slice, higher (16%) than the frequency of present but low quality contours (3.4%). The unedited A1 produced fewer (12%) total misses but more low quality slices (8%) than the experts. As a result of editing of the automatic contours, the median DSC of the low quality slices increased from 0, which was skewed heavily by total misses, to a minimum of 0.5 for each of the raters. This reduced the total miss frequency by half, though the overall accuracy in these areas remained challenged. The mean DSC after editing for slices that were total misses (DSC = 0) in the de novo study improved to 0.45. Similarly for slices that contained contours but of low quality (DSC in range (0, 0.5)), the post-editing mean DSC increased from 0.34 to 0.52.
While it is clear that areas of poor performance de novo remained a challenge for raters during editing, the situation was improved as can be seen by the increase in both mean and median DSC and the avoidance of approximately half of total misses. Interestingly, P2, the lowest performing rater de novo, saw the most dramatic improvement, from a median DSC of 0 to 0.68, the highest of the rater group, after editing of the automatic segmentations.
3.5. Evidence regarding hypothesis: Contour editing reduces inter-rater variation while maintaining or improving accuracy irrespective of the source segmentation
Thus far the results have focused on the performance of the automatic system in the context of editing compared to the experts’ de novo segmentations and the unedited automatic segmentations. As outlined in section 2.1, we also asked the physicians in a blinded and randomized experiment to modify their own segmentations and those of their peers.
Their performance in these tasks is assessed in figures 9–11 alongside the automatic and de novo results. The following nomenclature is used to distinguish the classes: automatic-unedited (A1), expert-unedited (de novo), automatic-edited (M(A1)), experts modifying their own initial segmentations (M(self)), and the expert modifying their peers’ initial segmentations (M(peer)).
Figure 9.
Plots the distributions of volumetric DSC across each class of segmentation: unedited automatic (A1), de novo, and the editing groups M(A1), M(self), and M(peers).
Figure 11.
True positive rate is plotted as the fraction of contour points falling within a 2 mm shell of the ground truth across the 5 segmentation classes. The dashed line is drawn at the level of the median for the unedited automatic (A1).
Figure 9 plots the distributions of volumetric Dice coefficient across each class of segmentation, and tables A1 and A2 provide the mean, 95% confidence interval, and standard deviation for the same. Inter-rater variation was reduced for all editing groups as seen by both the increase in mean DSC and reduction in standard deviation. However, the best results came through editing of the automatic segmentations, which was consistent across the different structures. There was also a small but significant (α = 0.05) advantage to modifying peers’ as opposed to one’s own segmentations.
Edits resulted in small differences in distance error, plotted in figure 10, compared to the unedited automatic and de novo segmentations in terms of median error and with regard to the extent and number of outliers. [Note in figure 10 when viewing the outliers shown by dots in red at the extremes of distributions, the A1 (divided from the other groups by a vertical line) distributions are a result of only 20 segmentations each, whereas the other groups have approximately eight times the number of segmentations (one for each rater) in their distributions. Therefore, a direct comparison of outlier prevalence between A1 and the others is not possible visually.] In fact, in the cases of the optic chiasm and brainstem the unedited automatic segmentations produced a median distance error closer to zero than either the de novo physicians or the any of the edited segmentations. These boxplots, however, consider the complete set of all contour points for a given rater or rater-group, which skews the results towards patients with larger structures and raters who contoured larger structures. Weighting each rater and case equally, we recalculated the mean (and 95% confidence interval), minimum and maximum distance errors provided in table A4. Across all classes of segmentations mean distance errors were approximately equal to or less than 1 mm. Interestingly, the unedited automatic performed well in comparison to the edited classes. This was especially true in terms of maximum (signed positive) distance errors, which were smaller for all structures except the eyes.
Figure 10.
Distance errors are plotted as a function of structure and segmentation class: unedited automatic (A1), de novo, and edited A1, self and peer. The inner dashed lines are drawn at ± 2 mm from the ground truth estimation. The plots are confined to a range −5 mm to + 10 mm as shown by the outer dashed lines. If a distribution has outliers beyond this range, they are plotted in the small bands at the periphery of the distributions. The density of the outliers within the bands is proportional to the number of outliers beyond the plot range.
We also used the signed distance maps to calculate the true positive rate within a 2 mm shell around the ground truth estimation. We found significant differences (α = 0.05) from the unedited automatic and the de novo segmentations (figure 11) only in the case of the optic chiasm, where both the A1-unedited and all editing classes (A1-, self-, and peer-edited) had higher true positive rate compared to the unedited de novo class. This advantage disappeared when we narrowed the shell to 1 mm (not shown), such that there were no differences (α = 0.05) amongst the five groups of unedited and edited segmentations.
Another question that arises which we can begin to answer is that of whether editing is robust to segmentations of varying quality. We can do so by examining the correlation between pre- and post-editing accuracy. Figure 12 plots DSC against the ground truth estimates pre- and post-editing for (a) A1 and (b) P2. We chose to single out P2 to illustrate this effect as this rater generally produced the segmentations most different from the group in the de novo study. A line with a slope of 1 is plotted through the origin; all points above the line indicate an improvement in accuracy. We see in general that editing improves the accuracy, which is supported by figure 7 as well, and editing appears largely robust in areas of low quality initially (low DSC de novo). In figure 12 (b) we see that each time a peer edited the contours of P2 the accuracy was improved, usually substantially, though they did not generally attain a final accuracy as high as was achieved starting with higher accuracy segmentations.
Figure 12.
Plots of DSC against the ground truth segmentations pre- and post-editing for A1 (a) and P2 (b). Note P2 was not randomized to all peers in editing, thus the differences between the raters in legends of (a) and (b).
4. Discussion
We have undertaken a large scale behavioural study to better understand performance of automatic and manual segmentation in radiation therapy and the interaction of the two. Previously (Deeley et al. 2011) we reported on automatic segmentation for brain organs at risk in the presence of large space-occupying lesions, which challenge registration-based methods. That study characterized inter-rater variance and found that the automatic system generally could serve as a surrogate to the physicians with potential gains in efficiency and accuracy within the treatment planning process. The basis for our experimental framework is the observation that segmentation should be evaluated in behavioural studies through 1) multi-dimensional metric analysis, e.g., volumetric and distance-based methods, 2) sufficient numbers of raters and patients chosen prospectively to ensure high power analyses, and 3) clinically realistic design that recognizes end-use of the segmentations. Here we applied this framework using the same physicians and patients in a single-blind editing study to test hypotheses concerning the impact of manual-automatic system interaction (editing of the automatic) on inter-rater variance and accuracy, the impact of manual-manual rater interaction (modifying their own and peers’ segmentations), and whether the automatic system could salvage the performance of low performing raters.
4.1. Comparison to previous studies
Previously Chao (2007) reported results of an editing study using computer-assisted delineation of head and neck structures. In this study, eight physicians manually contoured two head and neck cases and then edited contours produced from an atlas-based system. They found editing reduced inter-rater variance with significance via nominal volume, Dice coefficient, and Euclidean distance disagreement. The authors proffered that computer-assisted segmentation and contour editing may be useful to educate physicians from different training backgrounds and to improve efficiency in the treatment planning process. We found this work compelling in the design of our study, but it was limited in several ways. While our experience would indicate the number of raters generally sufficient, the overall analysis is likely of low power as the statistical analyses were performed on each of the two patients separately. It is unlikely the results reported can be used to infer to a larger population. Additionally, there is a fundamental difference from the study we have undertaken. In the Chao study, the participants contoured from scratch and then immediately were presented with automatic contours for editing. Furthermore, the raters viewed the atlas images at all times during editing. This is certainly a valid design though has a markedly different emphasis than our own. One could envision a single standard atlas for use by all radiation oncologists for every contouring task, and one could extrapolate that this method may reduce variation within the population. Our focus was in a different direction. At the outset of the study we discussed with the group of experts general guidelines for delineation. We also chose a body site, the brain, where training is more ubiquitous and expected variance lower. We suspect viewing of the atlas during editing would additionally bias the raters in a clinically unrealistic way, as this is not standard practice.
Stapleford (2010) and colleagues also reported results of a segmentation and editing study involving the head and neck. They recruited five physicians to contour bilateral lymph node regions for five patients and to edit automatically generated contours. The data were analysed using five metrics: sensitivity, DSC, percent false positive, mean and max surface distance error, and volume. These metrics yield complementary information about the differences in segmentations. The use of percent false positive in place of specificity is particularly useful, as specificity is dependent on the image size and thus not readily comparable between studies. They found the automatic contours compared well to the manual, and editing led to improved consistency. Interestingly the experts commented that only 32 percent of contours were acceptable without editing, and the primary complaint was the automatic segmentations were too large. However, when making edits, the cumulative changes only partially recaptured the mean volume of the manual segmentations, leading one to wonder about the bias introduced by the automatic segmentations. We found the opposite in our study of brain structures. The automatic system consistently produced smaller segmentations than the experts, but expert nominal volume was generally recaptured upon editing. In fact, we found in general that editing produced a trend of increasing segmentation size, though the effect was small (table 2 and figure 6).
There are some limitations to the methodology of the investigation by Stapleford and colleagues. The authors used STAPLE to calculate two ground truth segmentations for all pairwise comparisons: one from the cohort of manual segmentations and one from the cohort of edited segmentations. The use of simulated ground truths to assess accuracy is desirable, but their methodology rests the validity of almost all judgment on the quality of these estimations. First, we find it non-ideal to create separate ground truth estimates to compare groups of segmentations from the same imaging dataset. How can one infer with high confidence differences between segmentations when the two groups are being compared to different ground truths that were calculated from their respective groups? This requires a seemingly contradictory assumption: both ground truth estimations, while different from each other, are fully accurate ground truths, or at the very least have equal quality. If a systematic difference exists between the groups, it may be missed. We believe a more appropriate assumption, though not ideal, is to choose a single ground truth calculated from the most appropriate cohort for all comparisons. Second, basing variance analysis through intermediary comparison with simulated ground truths will produce a perception of lower overall variance and increased correlation amongst the group. We posit that the most accurate and transparent way to evaluate inter-rater variance is through standalone metrics such as nominal volume or pairwise metrics on the unadulterated segmentations.
We found that editing of pre-generated segmentations both improved efficiency (reduced contouring time by at least 60%) and reduced inter-rater variance across all sources (A1, self, peers), structures, patients, and physicians. Though we found, interestingly, that physicians showed preference toward their own contours in terms of time and amount of editing, variance was reduced more when they edited the automatic segmentations, regardless of structure. However, as Zou (2004) suggests the problem can be restated as one in which error is a function of bias and variance, or another way of stating it, as random and systematic errors. Thus, one must be careful not to overstate the implications of observed variance. In our study, each rater edited the same automatically generated structures such that the inter-rater variance before editing was zero. When modifying their own or peers’ contours, the baseline variance was carried from the de novo study.
To determine whether pre-generation of contours impacted the raters’ accuracy, we employed two ground truth estimates. The STAPLE algorithm and our own approach with p-maps as well as the rationale for using both estimates has been discussed previously (Warfield et al. 2004, Deeley et al. 2011, Meyer et al. 2006, Biancardi et al. 2010). In general, ground truth estimation is a difficult problem that is at least in part a function of size and quality of the input cohort. As discussed in reference to the work by Stapleford and colleagues, choice of ground truth cohort can be vital to the conclusions drawn from the analyses. We had several distinct classes of segmentations (A1, de novo, A1-edited, self-edited, and peer-edited), each with multiple cases, raters, and structures from which to choose a cohort for ground truth estimation and subsequent accuracy analyses. The following considerations were made. First, since all expert segmentations are valid clinically by virtue of the raters’ expertise and none of the automatic segmentations would be deemed acceptable without oversight, we did not include A1 as an input to the ground truth calculations. Second, it is also not valid to use A1-edited for reasons already mentioned: there is no basis to know whether it will bias the raters toward higher or lower accuracy. Third, heuristically, we reasoned that including either all physician segmentations (de novo and edited) or just those edited would be non-ideal, as there could be significant inter-class differences (increased variance) which would presumably lead to lower quality estimates. With this in mind we chose to make all assessments against those calculated from a single class, the peer-edited class. Prospectively we anticipated that this class of edits would be the most likely to have reduced variance and similar or less bias compared to the de novo class (used for ground truth creation in the previous study) and the self-edited class, and this was born out in the data as can be seen in tables A1 and A2.
Testing against the ground truth estimates from the peer-edited class, we found that accuracy was either maintained or improved in figure 7 via editing. Accuracy of edited classes was similar to the unedited automatic and de novo classes, as seen by the 95% confidence intervals of mean DSC in tables A1 and A2 for the brainstem and eyes, but editing improved accuracy in the more challenging optic chiasm and nerves. The distance data paint a less clear picture. editing regardless of source reduced the number of outliers, but in terms of mean, min, and max distance error there were only small differences from the A1 or de novo classes. In fact, in the analysis of true positive rate within a 1 mm and 2 mm shell (figure 11) of the ground truth, only for the chiasm were results notable in that the unedited automatic as well as all three editing classes had smaller distance errors than the physicians de novo.
The de novo study previously uncovered that in the group of eight experts one was often an outlier and therefore removed from the ground truth cohort. We also found the source of these differences, especially in the brainstem, was often failure to extend the organ as far cranial-caudal as the group. It would be very useful clinically for automatic systems to correct these errors, which we term total miss errors. Looking at every slice from the de novo study against the ground truth estimates we isolated low quality contours, anything with a DSC < 0.5. The prevalence of total miss over present-but-low-quality slices suggests that edges of structures in the cranial-caudal plane are a challenge for manual raters. This is likely both a result of lack of natural boundary (e.g., brainstem and spinal cord) and partial volume effects (e.g., chiasm, nerves and eyes). editing of A1 was generally successful at salvaging the total misses. The improvement for present-but-low-quality slices was less remarkable and is likely a result of the automatic system and the manual raters being generally challenged in areas of low contrast.
This study provides strong evidence that editing of pre-generated segmentations, independent of source, reduces inter-rater variance while maintaining or improving accuracy and increasing efficiency. This suggests given a starting point, even if the starting points are different, experts tend to converge. We postulate that raters focus on the task of segmentation differently when modifying than when starting from a blank slate. The data showed, for instance, though differences at the edges of contours (distance error) were not dramatically different from the de novo study, raters focused more on capturing the entire extent and correct location of a structure, suggesting a good starting place to develop delineation standards may be to propose contours for editing to experts in the field. These results also lend evidence to the suggestions made in prior work (Chao et al. 2007, Beyer et al. 2006) that automatic methods can help improve consistency in radiation therapy treatment planning, especially situations wherein users are less experienced. Finally, the two studies we have undertaken provide evidence that our unedited automatic segmentations perform quite well, and after editing provide an even more robust alternative to manual segmentation.
4.2. Limitations and future work
There were several limitations in this work. First, we have attempted to extend an experimental framework for segmentation analysis beyond what has been done previously using a behavioural approach and statistically robust design. However, our study though large by comparison to others is limited to a single institution that may have systematic bias. Additionally, as this is an ongoing project extending the prior de novo study, we have not evaluated the results of other algorithms or now commercially available systems. We have focused on only one body site. Most of these choices were a function of resources, since behavioural studies require prolonged time for longitudinal tasks (over 2 years to collect data in our case) and are costly. Many more questions could be investigated with less global uncertainty if a framework such as that we have proposed could be implemented on a multiple institution, body site, and algorithm basis. This would also help to gain useful interaction about the users and the system for contouring, such as which tools were utilized for contours or editing and whether those choices impacted results.
Second, the choice of metrics is important. We believe multiple complementary and cross-study compatible metrics such as the Dice coefficient, distance-based measures, and nominal volume increase the value of the analyses. However, the metrics as used herein can only characterize the data and describe differences in and relationships between groups or classes of segmentations. A valuable analysis would involve an understanding of what are the sources of these differences, such as has been done in prior work by Meyer (2006) and Zou (2004) using analysis of variance and multiple regression. We did not include that analysis herein as the scope was already extensive. However, given sufficient categorical understanding of the data, this could be done retrospectively. This type of analysis in a targeted study with multiple different sources of varying quality would also help to further answer questions about the interaction of source segmentation quality and the editing process.
Third, the end point of the segmentations in our context is radiation therapy treatment plans, which was not considered herein. The ultimate impact of differences will manifest in dose coverage of target volumes and normal tissues. Others have looked at dosimetric end points (Weiss et al. 2008, Tsuji et al. 2010) but generally not in the context of a large scale study with multiple raters. Nelms and colleagues (2012) conducted a “Plan Challenge” evaluating the dosimetric impact of differences in normal tissue contouring in the head and neck. However, only a single patient was analysed over 32 raters. Extending studies such as these with more raters, patients, and anatomical sites would provide valuable information about the impact of segmentation variance as well as help guide clinical users.
Lastly, the lack of a known ground truth is an ongoing challenge in segmentation evaluation. We have discussed the choice and importance as well as the pitfalls of ground truth estimation. The wealth of data generated in the editing study presented a problem of choosing a cohort of segmentations as inputs to the ground truth calculations. We reasoned that the peer-editing group was the most desirable class to use for truth estimation, and it was applied in all analyses for all groups. In post-hoc analysis we also looked at the impact had different assumptions been made, namely using the other classes to compose the truth estimation. We found that these assumptions did produce small differences, most notably when A1-edited was used. The choice of A1-edited in ground truth composition resulted in higher accuracy for the A1-edited class as compared to other classes, though the magnitude of accuracy in the other groups did not change remarkably. This is likely a result of the reduced variance of the A1-edited class compared to the other edited classes. It is also possible that the results we have presented favor accuracy toward the peer-edited class at the expense of the other classes, including the automatic and automatic-edited. However, it was determined this was a better choice than to potentially bias accuracy toward the automatic system.
Acknowledgments
This work was funded in part through a grant from the National Institute of Biomedical Imaging and Bioengineering (award R01EB006193). All images used in this study were acquired with institutional review board approval through the Vanderbilt-Ingram Department of Radiation Oncology. The manual segmentations in this study were acquired via a research Eclipse treatment planning system through a grant from Varian Medical Systems (Palo Alto, CA). The authors would also like to thank George Ding for his thoughtful advice in implementation of this project, as well Jenny Lu for assistance in data entry and extraction from the treatment planning system.
Appendix A
Table A1.
DSC, Brainstem and Chiasm. The mean, 95% confidence interval on the mean, and standard deviation across all raters, sources for editing, and structures. Raters are also considered as senior, junior, and as a single combined group.
| Rater | Source | Brainstem | Chiasm | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Mean | CI | std | Mean | Mean | CI | std | ||
| P1 | A1 | 0.927 | 0.922 | 0.931 | 0.029 | 0.612 | 0.586 | 0.644 | 0.172 |
| self | 0.863 | 0.851 | 0.872 | 0.066 | 0.468 | 0.437 | 0.498 | 0.190 | |
| peers | 0.873 | 0.863 | 0.881 | 0.047 | 0.488 | 0.457 | 0.519 | 0.169 | |
| P2 | A1 | 0.907 | 0.900 | 0.913 | 0.038 | 0.562 | 0.530 | 0.592 | 0.191 |
| self | 0.726 | 0.716 | 0.737 | 0.064 | 0.251 | 0.220 | 0.285 | 0.194 | |
| peers | 0.867 | 0.859 | 0.874 | 0.038 | 0.402 | 0.364 | 0.439 | 0.198 | |
| P3 | A1 | 0.926 | 0.922 | 0.931 | 0.027 | 0.562 | 0.523 | 0.597 | 0.218 |
| self | 0.864 | 0.855 | 0.874 | 0.059 | 0.396 | 0.364 | 0.427 | 0.186 | |
| peers | 0.884 | 0.877 | 0.891 | 0.038 | 0.491 | 0.452 | 0.528 | 0.191 | |
| P4 | A1 | 0.926 | 0.921 | 0.931 | 0.029 | 0.550 | 0.514 | 0.585 | 0.207 |
| self | 0.867 | 0.856 | 0.878 | 0.066 | 0.443 | 0.415 | 0.474 | 0.170 | |
| peers | 0.886 | 0.879 | 0.892 | 0.036 | 0.506 | 0.470 | 0.537 | 0.180 | |
| J1 | A1 | 0.916 | 0.910 | 0.920 | 0.032 | 0.539 | 0.513 | 0.560 | 0.144 |
| self | 0.855 | 0.842 | 0.866 | 0.075 | 0.444 | 0.417 | 0.472 | 0.172 | |
| peers | 0.877 | 0.871 | 0.884 | 0.037 | 0.508 | 0.479 | 0.537 | 0.154 | |
| J2 | A1 | 0.927 | 0.922 | 0.931 | 0.030 | 0.572 | 0.537 | 0.606 | 0.209 |
| self | 0.861 | 0.851 | 0.870 | 0.060 | 0.486 | 0.453 | 0.518 | 0.197 | |
| peers | 0.882 | 0.875 | 0.889 | 0.036 | 0.484 | 0.450 | 0.521 | 0.182 | |
| J3 | A1 | 0.910 | 0.904 | 0.915 | 0.030 | 0.524 | 0.497 | 0.548 | 0.153 |
| self | 0.842 | 0.830 | 0.852 | 0.066 | 0.471 | 0.443 | 0.502 | 0.176 | |
| peers | 0.870 | 0.862 | 0.875 | 0.033 | 0.486 | 0.452 | 0.517 | 0.166 | |
| J4 | A1 | 0.924 | 0.919 | 0.929 | 0.032 | 0.609 | 0.582 | 0.636 | 0.166 |
| self | 0.841 | 0.831 | 0.850 | 0.061 | 0.490 | 0.459 | 0.518 | 0.177 | |
| peers | 0.872 | 0.863 | 0.880 | 0.044 | 0.518 | 0.483 | 0.550 | 0.168 | |
| Senior | A1 | 0.922 | 0.919 | 0.924 | 0.032 | 0.572 | 0.554 | 0.590 | 0.200 |
| self | 0.830 | 0.822 | 0.837 | 0.088 | 0.389 | 0.371 | 0.406 | 0.204 | |
| peers | 0.877 | 0.874 | 0.881 | 0.041 | 0.472 | 0.454 | 0.490 | 0.190 | |
| Junior | A1 | 0.919 | 0.916 | 0.921 | 0.032 | 0.561 | 0.545 | 0.575 | 0.174 |
| self | 0.850 | 0.844 | 0.856 | 0.066 | 0.473 | 0.459 | 0.488 | 0.182 | |
| peers | 0.875 | 0.872 | 0.879 | 0.038 | 0.499 | 0.483 | 0.515 | 0.169 | |
| All Phys | A1 | 0.920 | 0.918 | 0.922 | 0.032 | 0.566 | 0.555 | 0.577 | 0.187 |
| self | 0.840 | 0.835 | 0.845 | 0.079 | 0.431 | 0.420 | 0.443 | 0.198 | |
| peers | 0.876 | 0.874 | 0.879 | 0.039 | 0.486 | 0.473 | 0.497 | 0.180 | |
Table A2.
DSC, Eyes and Optic Nerves.
| Rater | Source | Eyes | Nerves | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Mean | Mean | CI | std | Mean | Mean | CI | std | ||
| P1 | A1 | 0.904 | 0.898 | 0.911 | 0.052 | 0.746 | 0.731 | 0.759 | 0.122 |
| self | 0.860 | 0.853 | 0.867 | 0.058 | 0.539 | 0.524 | 0.555 | 0.134 | |
| peers | 0.881 | 0.873 | 0.887 | 0.051 | 0.603 | 0.585 | 0.621 | 0.128 | |
| P2 | A1 | 0.932 | 0.926 | 0.937 | 0.045 | 0.718 | 0.703 | 0.736 | 0.142 |
| self | 0.810 | 0.805 | 0.816 | 0.046 | 0.469 | 0.456 | 0.483 | 0.123 | |
| peers | 0.880 | 0.869 | 0.887 | 0.065 | 0.599 | 0.582 | 0.617 | 0.130 | |
| P3 | A1 | 0.930 | 0.926 | 0.935 | 0.039 | 0.768 | 0.752 | 0.781 | 0.121 |
| self | 0.877 | 0.869 | 0.883 | 0.058 | 0.595 | 0.580 | 0.608 | 0.121 | |
| peers | 0.887 | 0.876 | 0.894 | 0.065 | 0.609 | 0.583 | 0.632 | 0.167 | |
| P4 | A1 | 0.915 | 0.910 | 0.919 | 0.039 | 0.760 | 0.746 | 0.774 | 0.121 |
| self | 0.873 | 0.866 | 0.879 | 0.056 | 0.610 | 0.595 | 0.624 | 0.121 | |
| peers | 0.878 | 0.867 | 0.887 | 0.073 | 0.618 | 0.601 | 0.635 | 0.124 | |
| J1 | A1 | 0.932 | 0.927 | 0.937 | 0.042 | 0.775 | 0.762 | 0.788 | 0.114 |
| self | 0.880 | 0.874 | 0.887 | 0.059 | 0.607 | 0.593 | 0.621 | 0.121 | |
| peers | 0.878 | 0.869 | 0.885 | 0.058 | 0.611 | 0.590 | 0.632 | 0.158 | |
| J2 | A1 | 0.918 | 0.912 | 0.923 | 0.042 | 0.733 | 0.721 | 0.746 | 0.107 |
| self | 0.874 | 0.866 | 0.880 | 0.057 | 0.607 | 0.592 | 0.618 | 0.113 | |
| peers | 0.888 | 0.877 | 0.896 | 0.064 | 0.648 | 0.631 | 0.663 | 0.121 | |
| J3 | A1 | 0.936 | 0.931 | 0.942 | 0.048 | 0.600 | 0.589 | 0.613 | 0.105 |
| self | 0.832 | 0.824 | 0.839 | 0.063 | 0.525 | 0.511 | 0.538 | 0.117 | |
| peers | 0.823 | 0.805 | 0.839 | 0.121 | 0.574 | 0.559 | 0.591 | 0.117 | |
| J4 | A1 | 0.928 | 0.923 | 0.934 | 0.050 | 0.761 | 0.746 | 0.776 | 0.131 |
| self | 0.868 | 0.862 | 0.874 | 0.052 | 0.515 | 0.499 | 0.529 | 0.122 | |
| peers | 0.879 | 0.869 | 0.887 | 0.069 | 0.612 | 0.597 | 0.630 | 0.132 | |
| Senior | A1 | 0.920 | 0.918 | 0.923 | 0.046 | 0.748 | 0.740 | 0.755 | 0.128 |
| self | 0.855 | 0.851 | 0.858 | 0.061 | 0.553 | 0.545 | 0.561 | 0.137 | |
| peers | 0.881 | 0.877 | 0.885 | 0.064 | 0.608 | 0.597 | 0.616 | 0.138 | |
| Junior | A1 | 0.929 | 0.926 | 0.931 | 0.046 | 0.717 | 0.710 | 0.726 | 0.134 |
| self | 0.864 | 0.860 | 0.867 | 0.061 | 0.563 | 0.556 | 0.571 | 0.126 | |
| peers | 0.867 | 0.862 | 0.873 | 0.086 | 0.611 | 0.603 | 0.621 | 0.136 | |
| All Phys | A1 | 0.925 | 0.923 | 0.926 | 0.046 | 0.733 | 0.727 | 0.738 | 0.132 |
| self | 0.859 | 0.857 | 0.862 | 0.061 | 0.558 | 0.553 | 0.563 | 0.132 | |
| peers | 0.874 | 0.870 | 0.877 | 0.076 | 0.610 | 0.603 | 0.615 | 0.137 | |
Table A3.
Volume [cm3]. Mean, 95% confidence interval on the mean, and the coefficient of variation of nominal volume for the unedited automatic (A1), de novo, and editing groups M(A1), M(self), and M(peers).
| Brainstem | Chiasm | |||||||
|---|---|---|---|---|---|---|---|---|
| Source | Mean | Mean | CI | cov | Mean | Mean | CI | cov |
| A1 | 23.99 | 22.82 | 24.87 | 11.01 | 0.41 | 0.39 | 0.45 | 16.07 |
| de novo | 25.88 | 25.01 | 26.62 | 19.59 | 0.66 | 0.60 | 0.74 | 67.41 |
| mod(A1) | 25.84 | 25.33 | 26.31 | 12.55 | 0.56 | 0.52 | 0.61 | 48.30 |
| mod(self) | 26.76 | 25.98 | 27.42 | 18.51 | 0.67 | 0.62 | 0.73 | 59.04 |
| mod(peers) | 27.16 | 26.55 | 27.83 | 13.37 | 0.67 | 0.60 | 0.73 | 57.31 |
| Eyes | Optic Nerves | |||||||
| Source | Mean | Mean | CI | cov | Mean | Mean | CI | cov |
| A1 | 9.13 | 8.59 | 9.57 | 17.56 | 0.64 | 0.61 | 0.67 | 54.98 |
| de novo | 8.59 | 8.40 | 8.77 | 20.15 | 0.87 | 0.83 | 0.91 | 41.75 |
| mod(A1) | 9.39 | 9.25 | 9.52 | 12.78 | 0.89 | 0.85 | 0.92 | 35.16 |
| mod(self) | 8.88 | 8.71 | 9.03 | 17.13 | 0.95 | 0.91 | 0.98 | 38.22 |
| mod(peers) | 9.27 | 9.10 | 9.44 | 16.06 | 1.01 | 0.97 | 1.04 | 33.90 |
Table A4.
Distance error [mm]. Presents the mean, confidence interval, minimum and maximum signed distance errors for the 5 classes of segmentations: unedited automatic (A1), de novo, and edited A1, self and peer. These distances were determined weighting each rater and patient equally, e.g., the maximum can be thought of as the maximum distance error averaged over the 20 patients.
| Brainstem | Chiasm | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| Mean | Mean | CI | Min | Max | Mean | Mean | CI | Min | Max | |
| A1 | 0.18 | 0.02 | 0.35 | −3.42 | 4.83 | −0.08 | −0.25 | 0.17 | −2.02 | 2.00 |
| de novo | 0.72 | 0.60 | 0.82 | −3.90 | 7.23 | 1.08 | 0.78 | 1.62 | −1.90 | 5.07 |
| mod(A1) | 0.57 | 0.48 | 0.65 | −3.61 | 6.38 | 0.04 | −0.12 | 0.21 | −2.28 | 3.58 |
| mod(self) | 0.85 | 0.73 | 0.96 | −3.21 | 7.40 | 0.51 | 0.28 | 0.79 | −2.16 | 4.54 |
| mod(peers) | 0.88 | 0.80 | 0.97 | −3.01 | 7.37 | 0.31 | 0.09 | 0.54 | −2.14 | 4.00 |
| Eyes | Optic Nerves | |||||||||
| Mean | Mean | CI | Min | Max | Mean | Mean | CI | Min | Max | |
| A1 | 0.63 | 0.51 | 0.75 | −2.59 | 3.75 | −0.39 | −0.50 | −0.27 | −2.59 | 2.38 |
| de novo | 0.32 | 0.17 | 0.46 | −2.78 | 3.15 | 0.31 | 0.16 | 0.45 | −2.89 | 3.21 |
| mod(A1) | 0.74 | 0.66 | 0.81 | −2.31 | 3.54 | 0.79 | 0.73 | 0.85 | −2.33 | 3.49 |
| mod(self) | 0.44 | 0.30 | 0.54 | −2.36 | 3.19 | 0.42 | 0.28 | 0.54 | −2.31 | 3.16 |
| mod(peers) | 0.68 | 0.56 | 0.77 | −2.19 | 3.31 | 0.71 | 0.59 | 0.80 | −2.21 | 3.43 |
References
- Amelio D, Lorentini S, Schwarz M, Amichetti M. ‘Intensity-modulated radiation therapy in newly diagnosed glioblastoma: A systematic review on clinical and technical issues’. Radiother Oncol. 2010;97(3):361–369. doi: 10.1016/j.radonc.2010.08.018. URL: http://www.sciencedirect.com/science/article/pii/S0167814010005232. [DOI] [PubMed] [Google Scholar]
- Asman AJ, Landman BA. Non-local staple: An intensity-driven multi-atlas rater model, in ‘MICCAI (3)’. 2012:426–434. doi: 10.1007/978-3-642-33454-2_53. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Beyer GP, Velthuizen RP, Murtagh FR, Pearlman JL. ‘Technical aspects and evaluation methodology for the application of two automated brain MRI tumor segmentation methods in radiation therapy planning’. Magn Reson Imaging. 2006;24(9):1167–1178. doi: 10.1016/j.mri.2006.07.010. [DOI] [PubMed] [Google Scholar]
- Biancardi A, Jirapatnakul A, Reeves A. ‘A comparison of ground truth estimations’. IJCARS. 2010;5:295–305. doi: 10.1007/s11548-009-0401-3. [DOI] [PubMed] [Google Scholar]
- Cardoso JS, Corte-Real L. ‘Toward a generic evaluation of image segmentation’. IEEE Trans Image Process. 2005;14(11):1773–1782. doi: 10.1109/tip.2005.854491. [DOI] [PubMed] [Google Scholar]
- Chao KS, Bhide S, Chen H, Asper J, Bush S, Franklin G, Kavadi V, Liengswangwong V, Gordon W, Raben A, Strasser J, Koprowski C, Frank S, Chronowski G, Ahamad A, Malyapa R, Zhang L, Dong L. ‘Reduce in variation and improve efficiency of target volume delineation by a computer-assisted system using a deformable image registration approach’. Int J Radiat Oncol Biol Phys. 2007;68:1512–1521. doi: 10.1016/j.ijrobp.2007.04.037. [DOI] [PubMed] [Google Scholar]
- Crum W, Camara O, Hill D. ‘Generalized overlap measures for evaluation and validation in medical image analysis’. IEEE Trans Med Imaging. 2006;25(11):1451–1461. doi: 10.1109/TMI.2006.880587. [DOI] [PubMed] [Google Scholar]
- Das IJ, Moskvin V, Johnstone PA. ‘Analysis of treatment planning time among systems and planners for intensity-modulated radiation therapy’. JACR. 2009;6(7):514–517. doi: 10.1016/j.jacr.2008.12.013. [DOI] [PubMed] [Google Scholar]
- Davison A, Hinkley D. Bootstrap methods and their applications. Cambridge University Press; 1997. [Google Scholar]
- Dawant B, Hartmann S, Pan S, Gadamsetty S. ‘Brain atlas deformation in the presence of small and large space-occupying tumors’. Comput Aided Surg. 2002;7:1–10. doi: 10.1002/igs.10029. [DOI] [PubMed] [Google Scholar]
- Deeley M, Chen A, Datteri R, Noble J, Cmelak A, Donnelly E, Malcolm A, Moretti L, Jaboin J, Niermann K, Yang E, Yu D, Yei F, Koyama T, Ding G, Dawant B. ‘Comparison of manual and automatic segmentation methods for brain structures in the presence of space-occupying lesions: a multi-expert study’. Phys Med Biol. 2011;56:4557–4577. doi: 10.1088/0031-9155/56/14/021. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dice L. ‘Measures of the amount of ecologic association between species’. Ecology. 1945;26(3):297–302. [Google Scholar]
- Isambert A, Dhermain F, Bidault F, Commowick O, Bondiau P, Malandain G, Lefkopoulos D. ‘Evaluation of an atlas-based automatic segmentation software for the delineation of brain organs at risk in a radiation therapy clinical context’. Radiother Oncol. 2008;87(1):93–99. doi: 10.1016/j.radonc.2007.11.030. [DOI] [PubMed] [Google Scholar]
- Jaccard P. ‘Nouvelles recherches sur la distribution florale’. Bulletin de la Societe Vaudoise des Sciences Naturelles. 1908;44:223–270. [Google Scholar]
- Jacobs RA. ‘Methods for combining experts’ probability assessments’. Neural Comput. 1995;7(5):867–888. doi: 10.1162/neco.1995.7.5.867. [DOI] [PubMed] [Google Scholar]
- Kittler J, Hatef M, Duin R, Matas J. ‘On combining classifiers’. IEEE Trans Pattern Anal Mach Intelln. 1998;20(3):226–239. [Google Scholar]
- Meyer C, Johnson T, McLennan D, Aberle D, Kazerooni E, MacMahon H, Mullan B, Yankelevitz D, van Beek J, Armato S, III, McNitt-Gray M, Reeves A, Gur D, et al. ‘Evaluation of lung MDCT nodule annotation across radiologists and methods’. Acad Radiol. 2006;13(10):1254–1265. doi: 10.1016/j.acra.2006.07.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Nelms BE, Tome WA, Robinson G, Wheeler J. ‘Variations in the contouring of organs at risk: test case from a patient with oropharyngeal cancer’. Int J Radiat Oncol Biol Phys. 2012;82(1):368–378. doi: 10.1016/j.ijrobp.2010.10.019. [DOI] [PubMed] [Google Scholar]
- Noble JH, Dawant BM. ‘An atlas-navigated optimal medial axis and deformable model algorithm (NOMAD) for the segmentation of the optic nerves and chiasm in MR and CT images’. Med Image Anal epub. 2011 doi: 10.1016/j.media.2011.05.001. epub. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Popovic A, Fuente M, Engelhardt M, Radermacher K. ‘Statistical validation metric for accuracy assessment in medical image segmentation’. IJCARS. 2007;2:169–181. URL: http://dx.doi.org/10.1007/s11548-007-0125-1. [Google Scholar]
- Rohde G, Aldroubi A, Dawant B. ‘The adaptive bases algorithm for intensity-based nonrigid image registration’. IEEE Trans Med Imaging. 2003;22(11):1470–1479. doi: 10.1109/TMI.2003.819299. [DOI] [PubMed] [Google Scholar]
- Stapleford L, Lawson J, Perkins C, Edelman S, Davis L, McDonald M, Waller A, Schreibmann E, Fox T. ‘Evaluation of automatic atlas-based lymph node segmentation for head-and-neck cancer.’. Int J Radiat Oncol Biol Phys. 2010;77:959–966. doi: 10.1016/j.ijrobp.2009.09.023. [DOI] [PubMed] [Google Scholar]
- Studholme C, Hill D, Hawkes D. ‘An overlap invariant entropy measure of 3D medical image alignment’. Pattern Recognit. 1999;32:71–86. [Google Scholar]
- Tsuji SY, Hwang A, Weinberg V, Yom SS, Quivey JM, Xia P. ‘Dosimetric evaluation of automatic segmentation for adaptive IMRT for head-and-neck cancer’. Int J Radiat Oncol Biol Phys. 2010;77(3):707–714. doi: 10.1016/j.ijrobp.2009.06.012. [DOI] [PubMed] [Google Scholar]
- Tukey JW. Exploratory Data Analysis. Addison-Wesley; 1977. [Google Scholar]
- Warfield S, Zou K, Wells W. ‘Simultaneous Truth and Performance Level Estimation (STAPLE): an algorithm for the validation of image segmentation’. IEEE Trans Med Imaging. 2004;23(7):903–921. doi: 10.1109/TMI.2004.828354. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Weiss E, Wijesooriya K, Ramakrishnan V, Keall PJ. ‘Comparison of intensity-modulated radiotherapy planning based on manual and automatically generated contours using deformable image registration in four-dimensional computed tomography of lung cancer patients’. Int J Radiat Oncol Biol Phys. 2008;70(2):572–581. doi: 10.1016/j.ijrobp.2007.09.035. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Windridge D, Kittler J. ‘A morphologically optimal strategy for classifier combination: multiple expert fusion as a tomographic process’. IEEE Trans Pattern Anal Mach Intelln. 2003;25(3):343–353. [Google Scholar]
- Wu Z. ‘Compactly supported positive definite radial functions’. Adv Comput Math. 1995;4:283–292. [Google Scholar]
- Zhu Y, Huang X, Wang W, Lopresti D, Long R, Antani S, Xue Z, Thoma G. ‘Balancing the Role of Priors in Multi-Observer Segmentation Evaluation’. J Signal Process Syst. 2008;55(1–3):185–207. doi: 10.1007/s11265-008-0215-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zou K, Warfield S, Bharatha A, Tempany C, Kaus M, Haker S, Wells W, III, Jolesz F, Kikinis R. ‘Statistical validation of image segmentation quality based on a spatial overlap index’. Acad Radiol. 2004;11(2):178–189. doi: 10.1016/S1076-6332(03)00671-8. [DOI] [PMC free article] [PubMed] [Google Scholar]












