Abstract
Purpose:
Contouring inconsistencies are known yet under studied clinical radiotherapy trials. We applied auto-contouring to the RTOG 0617 dose escalation trial data. We hypothesized that the trial heart doses were higher than reported due to inconsistent and insufficient heart segmentation. We test our hypothesis by comparing doses between deep-learning (DL) segmented hearts and trial hearts.
Materials/Methods:
The RTOG 0617 data were downloaded from TCIA; the 442 patients with trial hearts and dose distributions were included. All hearts were re-segmented using our DL pipeline and quality assured to meet the requirements for clinical implementation. Dose (V5%, V30% and mean heart dose (MHD)) was compared between the two sets of hearts (Wilcoxon signed-rank test). Each dose metric was associated with overall survival (OS; Cox Proportional Hazards). Lastly, 18 volume similarity metrics were assessed for the hearts and correlated with |DoseDL- DoseRTOG0617| (linear regression; significance: p≤0.0028; corrected for 18 tests).
Results:
Dose metrics were significantly higher for DL hearts compared to trial hearts (e.g. MHD: 15Gy vs. 12Gy; p=5.8E-16). All three DL heart dose metrics were stronger OS predictors than those of the trial hearts (median: p=2.8E-5 vs. 2.0E-4). Thirteen similarity metrics explained |DoseDL-DoseRTOG0617|; the axial distance between the two centers of mass was the strongest predictor (CENTAxial; median R2=0.47; p=6.1E-62). CENTAxial agreed with the qualitatively identified inconsistencies in the superior direction. The trial’s qualitative heart contouring score was not correlated with |DoseDL-DoseRTOG0617| (median: R2=0.01, p=0.02) or with any of the similarity metrics (median Rs=0.13 (range: −0.22, 0.31)).
Conclusions:
Using a coherent heart definition, as enabled through our open-source DL algorithm, the trial heart doses in RTOG 0617 were found to be significantly higher than previously reported, which may have led to an even more rapid mortality accumulation. Auto-segmentation is likely to reduce contouring and dose inconsistencies and increase the quality of clinical RT trials.
Introduction
The prognosis for unresectable locally-advanced non-small cell lung cancer (LA-NSCLC) remains poor [1, 2]. The RTOG 0617 trial is one of the largest efforts aiming to improve the outcomes in these patients and in this trial the prescription dose was escalated from 60 Gy to 74 Gy [1]. The trial was, however, terminated early due to futility of the hypothesis and the associated multivariate models based on this data have demonstrated that heart/pericardium dose is the key predictor for overall survival (OS) [1–3].
The main source of geometric uncertainty in the radiotherapy workflow is manual delineation. In the clinical target volume (CTV) contouring variability study by Lawton et al [4], considerable variation was observed among the prostate node CTVs from 14 genitourinary radiation oncologists and the CTV agreement was at most judged as moderate. In the head and neck cancer (HNC) treatment planning study by Hong et al [5], the 20 enrolled centers proposed three different CTVs, which resulted in a total of six prescription doses. In another HNC treatment planning study but focused on normal tissues by van Rooij et al [6], auto-planned doses to auto-segmented structures were found to be significantly different from those to manually segmented structures (constrictor and esophagus; p=0.005 and 0.002), and the similarity agreement between the two sets of segmentations was further correlated with the observed dose differences (r=−0.24; p=0.002 for all organs). These studies emphasize that manual segmentation leads to contour and dose inconsistencies that should be considered non-negligible. Further, Dale et al observed discrepancies between the CTVs from three clinical prostate trials and auto-segmented CTVs, which were primarily located adjacent to the nearby bladder and rectum, and these discrepancies were associated with treatment failure (death, local progression or prostate-specific antigen progression 5–10 years after treatment) [7].
In parallel, the process of assessing agreement between sets of segmentations lacks consensus metrics and evaluations [8]. The dice similarity coefficient (DSC) is one of the most widely used such metric, and is typically accompanied with the 95th percentile of the Hausdorff distance (Hausdorff95th). In a recent study by Roach et al [9] no correlation was observed between any of their 14 investigated similarity metrics and dose metrics (referred to as ‘simulated outcomes’) for tumors or normal tissues in prostate cancer. Thus, it is currently somewhat unclear which similarity metric to use in order to reflect the treatment as surrogated e.g. by dose and/or outcomes.
This study was motivated by the general lack of studies that have thoroughly assessed the impact of contouring inconsistency due to manual delineation on clinical outcomes. We hypothesized that the heart doses in the RTOG 0617 trial were significantly higher than previously reported due to inconsistent and insufficient heart segmentation and that this will, therefore, influence the observed OS. The hypothesis was tested by comparing dose to our deep-learning (DL) auto-segmented hearts with dose to the trial hearts. Secondary, we explored the correlation between 18 volume similarity metrics and heart dose.
Material and Methods
Data
The RTOG 0617 data, consisting of CT scans, dose distributions, and structure sets including the trial hearts were downloaded from The Cancer Imaging Archive (TCIA). DICOM data were imported into CERR [10], in which all dose and imaging analyses were performed. Outcome data (overall survival status, censoring, follow-up time and qualitative contouring scoring) as well as indications what patients received RT, had evaluable RT and heart ROIs were extracted from the spread sheet available under TCIA (‘NCT00533949-D1-Dataset.csv’). The inclusion criteria for this study were that patients should have received RT, had evaluable fractionation and elapsed days, heart contours as well as available dose distributions. Of the 490 patients with available dose distributions, a total of 442 patients fulfilled all inclusion criteria. A summary of the associated patient characteristics for these patients is given in Table 1.
Table 1.
Summary of patient characteristics of the included 442 patients. The same variables (except tumor location, lymph node group, and consolidation, and concurrent chemotherapy, which were not available through the data deposited under TCIA) are also presented in Table 1 in [3].
| N=442 | |
|---|---|
| Age [y] | 64 (37–83) |
|
| |
| Cetuximab assigned | |
| Yes | 202 (46%) |
| No | 240 (54%) |
|
| |
| Gender | |
| Male (ref) | 182 (41%) |
| Female | 260 (59%) |
|
| |
| GTV [cm3]* | 93 (5.9–960) |
|
| |
| Histology | |
| Adeno | 174 (43%) |
| SCC | 189 (39%) |
| Large cell undifferentiated/NSCLC NOS | 78 (18%) |
|
| |
| OS | |
| Alive | 171 (39%) |
| Dead | 271 (61%) |
| Median (95%CI) time since randomization [m]** | 25 (21–29) |
|
| |
| Prescribed dose | |
| 60Gy | 252 (57%) |
| 74Gy | 190 (43%) |
|
| |
| RT technique | |
| 3DCRT | 235 (53%) |
| IMRT | 207 (47%) |
|
| |
| Smoking status | |
| Never | 28 (6%) |
| Former | 194 (44%) |
| Current | 199 (45%) |
| Unknown | 22 (5%) |
|
| |
| Tumor stage | |
| IIIA+N2 (ref) | 294 (67%) |
| IIIB+N3 | 148 (33%) |
|
| |
| Zubrod Performance | |
| 0 (ref) | 261 (59%) |
| 1 | 181 (41%) |
GTV available for N=329; ITV used for the remaining 103 patients, and GTV/ITV was missing for ten patients.
Kaplan-Meier estimated.
Deep-learning segmentation
All segmentations were applied and evaluated in the geometry of the planning CT scan and the associated planned dose. We used our recently developed multi-label DL algorithm to segment the hearts de novo in the RTOG 0617 dataset [11]. Our DL algorithm is based on the DeepLab convolutional neural network and is publicly available through our CERR model library (link https://github.com/cerr/CERR/wiki/Auto-Segmentation-models).[12]. In addition to the heart, the algorithm identifies the associated four chambers, the aorta (ascending and descending combined), inferior vena cava, pulmonary artery, and superior vena cava (based on the anatomical guidelines in [13]), as well as atria, ventricles (left and right sides combined) and pericardium (based on the RTOG 1106 guidelines; available under the “Organs at Risk in Thoracic Radiation Therapy” link here: https://www.nrgoncology.org/ciro-lung). While our DL algorithm produces cardio-pulmonary substructure segmentations that are overall agree with highly curated manually segmented correspondents (median structure DSCs: 0.81–0.96), the DL performed the best for the heart (median DSC=0.96 (range: 0.95–0.97)) [11]. In addition, running the DL segmentation (including all 12 structures) required a few seconds per patient as compared to around one hour per patient using manual segmentation. The anatomical definitions used for the training data in the development of our DL hearts [14] extended from the slice below the pulmonary artery split superiorly to the visibility of the heart inferiorly [13]. For the purpose of this study and since this is also our standard heart definition clinically, all DL hearts were manually edited to adhere to our clinical standard. No delineation guidelines are currently available for the trial hearts in RTOG 0617.
Volume similarity metrics
A total of 18 volume similarity metrics was assessed between the DL hearts and the trial hearts. Fourteen of these metrics were adopted from [9] i.e. the absolute value of the relative volume difference, (a(RVD)), C-factor, distance between the centers of mas i.e. centroids (Euclidean, and in axial, coronal, and sagittal plane: CENTEuclidian, CENTAxial, CENTCor, and CENTSag), volumetric dice similarity coefficient (DSCV), false positive (FP), Hausdorff95th, mean absolute surface distance (MASD), sensitivity, specificity, true positive (TP), and volume similarity (VS). In addition, we calculated the added path length with or without normalization (APL, APLNorm), and surface DSC (DSCS) [15], as well as deviance, which is proposed as a new similarity metric as part of this study. Deviance is the volume missed by the DL heart relative to the trial heart added the excess volume of the DL heart over the trial heart and lastly normalized by the trial heart.
Within the RTOG 0617 trial, qualitative scoring of the trial hearts had been performed and the scoring was available under TCIA. The scoring ranged from accepted per protocol, acceptable variation to unacceptable variation.
Heart dose
To represent heart dose, we analyzed the relative heart volume receiving at least 5 Gy (V5%), which was the strongest predictor in the multivariate model in [1]), and V30% (univariate predictor in [1]). We also extracted the mean heart dose (MHD), which is nowadays a commonly used heart dose metric in the treatment planning of LA-NSCLC. No correction for fractionation effects was performed in order to enable direct comparison with the non-corrected doses in [1] and in [2].
Analyses
The V5%, V30% and MHD were compared between the DL hearts and the trial hearts using a Wilcoxon signed-rank test. For both the DL hearts and the trial hearts, each of the three dose metrics was associated with OS using Cox proportional hazard regression, and Kaplan-Meier curves were stratified with respect to the dose metric and plotted for the low and high risk tertile. The 18 volume similarity metrics were linearly correlated with |DoseDL-DoseRTOG0617|. In all analyses, bootstrap resampling was applied (1000 sample populations), and significance was denoted as p≤0.05 except for the latter analysis, which was corrected for 18 tests (significance: p≤0.0028).
For comparison, the qualitative scoring of the trial hearts, which had been performed during the trial, was linearly associated with DoseDL-DoseRTOG0617. This qualitative scoring was also correlated with each of the 18 volume similarity metrics using Spearman’s rank correlation coefficient (Rs).
Results
All DL hearts were assessed for quality assurance (QA) and minor edits were made in the inferior and the superior border slices and in the most superior inferior vena cava slice for a majority of the DL hearts. For only a few DL hearts, distant segmentation islands were removed. The minor edits made are also reflected in the DSCv between the edited and the non-edited DL hearts, which was 0.99 at a median, and in 94% of the patients (415/442 patients) DSCV was > 0.95. In comparison, the population median DSCv between the corrected DL hearts and the trial hearts was 0.86 (range: 0.06–0.98) and DSCV was > 0.95 only in 2% of the patients (7/442 patients). Figure 1 depicts eight patients with a DSCV around the population median (median: DSCV=0.86; DSCV in the eight patients=0.857–0.861) between the corrected DL hearts and the trial hearts in which primarily an under/overestimation of the DL hearts in the superior direction is demonstrated. The DL algorithm was applied in batch mode across all patients and was completed after a few seconds per patient, while the thorough QA applied added on average up to a minute per patient due to various degrees of inspection and/or edits.
Figure 1.
Comparison between the trial hearts (red) and the edited DL hearts (blue) in a coronal slice (middle of DL heart in axial and sagittal planes) for eight patients with a DSCv around the population median (DSCv=0.86). The planned dose distribution has been overlaid to visualize dose gradient differences between the two sets of hearts on a patient by patient basis (Note: The max dose ranges from 66.7Gy to 88.8Gy across these patients). The corresponding DSCv between these edited DL hearts and the non-edited DL hearts were (from upper left to lower right) 0.97, 1.00, 0.97, 0.96, 1.00, 0.97, 0.99 and 0.98.
Both MHD and V30% were significantly higher in the DL hearts compared to the trial hearts: The cohort medians were 15Gy compared to 12Gy for MHD and 18% compared to 13% for V30% (p=5.8E-16, 3.8E-15; Figure 2). For V5%, a similar pattern was observed (46% vs. 44%; p=1.0E-7). Interestingly, the association between dose (all three metrics) and OS was stronger using the DL hearts compared to the trial hearts (e.g. all p-values were one order of magnitude lower; median p-value: 2.8E-5 (range: 2.5E-6, 5.5E-5) vs. 2.0E-4 (range: 4.3E-5, 4.5E-4); median Hazard Ratio: 2.6 (range: 1.0–3.5) vs. 2.1 (range: 1.0–2.8); Figure 2).
Figure 2.
Upper panel: Scatter plots of the three analyzed heart dose metrics (MHD: left; V30: middle; V5: right) between the trial hearts (x-axis) and the DL hearts (y-axis). The p-values refer to a Wilcoxon signed-rank test between dose to the two sets of contours and the black solid lines are the identify lines. Lower panel: Kaplan-Meier curves between each of the three dose metrics and OS based on the trial hearts (solid) or the DL hearts (dashed) stratified into high-risk and low-risk groups representing the upper and lower tertiles of the concerned dose metric.
Thirteen of the 18 investigated volume similarity metrics were significantly and linearly correlated to the dose differences between the DL hearts and the trial hearts as measured by MHD, V30%, and V5% (median (range): p-value=3E-13 (5E-71, 0.49); R2=0.13 (−1E-33, 0.51); Table 2). In light of the two commonly used DSCV and Hausdorff95th, both being moderately strongly correlated to the paired dose differences (R2=0.32–0.51 and R2=0.33–0.37), CENTAxial was just as good of a predictor (R2=0.44–0.51; Figure 3). The CENTAxial is the distance between centroids in the axial plane, and this quantitatively derived finding is also in agreement with the qualitative visual inspection of the trial and DL hearts, which identified discrepancies primarily superiorly of the DL hearts. All assessed volume similarity metrics are given in Table S1.
Table 2.
Linear regression (B0=intercept; B1=regression coefficient) results between each volume similarity metric (ordered alphabetically) and the three analyzed dose metrics (MHD, V5%, V30%). The volume similarity metrics not significantly associated with the heart dose metrics (p>2.8E-3) are grey shaded. For comparison, the results between the qualitative heart scoring and the dose metrics is inserted on the lowest row. The three metrics enclosed in borders presented with the overall lowest p-values and highest R2.
| MHD | V30% | V5% | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
| ||||||||||||
| Quantitative metrics | R2 | B0 | B1 | p | R2 | B0 | B1 | p | R2 | B0 | B1 | p |
| APL (cm) | 3E-3 | 3.5 | −3E-3 | 0.12 | 4E-5 | 0.06 | 1E-5 | 0.31 | 0.03 | 0.05 | −6E-5 | 0.02 |
| APLNorm | 0.01 | 2.5 | 1.7 | 7E-3 | 0.02 | 0.04 | 0.03 | 4E-3 | 0.01 | 0.05 | 0.04 | 0.02 |
| a(RVD) | 0.26 | 0.89 | 0.11 | 4E-31 | 0.23 | 0.02 | 2E-3 | 2E-27 | 0.30 | 0.01 | 3E-3 | 1E-35 |
| CENTEuclidian | 0.43 | 0.40 | 4.2 | 2E-56 | 0.39 | 0.01 | 0.07 | 2E-49 | 0.42 | 1E-3 | 0.10 | 8E-20 |
|
| ||||||||||||
| CENTAxial | 0.51 | 0.33 | 6.3 | 5E-71 | 0.44 | 0.01 | 0.10 | 3E-58 | 0.46 | 5E-4 | 0.14 | 3E-14 |
|
| ||||||||||||
| CENTSag | 0.23 | 2.0 | 3.8 | 2E-27 | 0.22 | 0.04 | 0.06 | 1E-25 | 0.22 | 0.04 | 0.09 | 7E-22 |
| CENTCor | 0.02 | 2.6 | 2,1 | 3E-3 | 0.01 | 0.05 | 0.03 | 0.01 | 0.03 | 0.05 | 0.06 | 0.22 |
| C-Factor | 0.21 | 3.7 | 23 | 2E-24 | 0.21 0.01 |
0.06 0.05 |
0.39 3E-3 |
5E-24 6E-2 |
0.12 | 0.06 | 0.38 | 8E-14 |
| Deviance | 0.03 | 2.9 | 0.33 | 4E-4 | 0.01 | 0.05 | 3E-3 | 6E-2 | 0.18 | 0.06 | 0.02 | 1.8E-21 |
| DSCS | 0.07 | 5.3 | −6.8 | 8E-9 | 0.07 | 0.09 | −0.12 | 8E-9 | 0.07 | 0.11 | −0.16 | 6E-9 |
|
| ||||||||||||
| DSCV | 0.38 | 19 | −19 | 2E-47 | 0.32 | 0.31 | −0.31 | 1E-38 | 0.51 | 0.49 | −0.52 | 1E-70 |
|
| ||||||||||||
| FP | 0.09 | 1.8 | 2E-5 | 3E-11 | 0.08 | 0.03 | 3E-7 | 7E-10 | 0.15 | 0.02 | 6E-7 | 1E-17 |
|
| ||||||||||||
| Haussdorff95th (cm) | 0.37 | −0.06 | 1.9 | 1E-45 | 0.33 | 3E-3 | 0.03 | 6E-40 | 0.36 | −0.01 | 0.04 | 6E-45 |
|
| ||||||||||||
| MASD | 0.21 | 0.71 | 5.1 | 4E-24 | 0.18 | 0.02 | 0.08 | 9E-21 | 0.28 | −3E-3 | 0.14 | 5E-33 |
| Sensitivity | 2E-3 | 6.8 | −3.9 | 0.19 | 1E-3 | 0.11 | −0.06 | 0.21 | −1E-3 | 0.06 | 2E-3 | 0.48 |
| Specificity | 0.11 | 849 | −848 | 7E-13 | 0.09 | 14 | −14 | 2E-11 | 0.14 | 23 | −23 | 1E-16 |
| TP | 0.11 | 5.2 | −1E-5 | 6E-13 | 0.10 | 0.09 | −2E-7 | 2E-12 | 0.09 | 0.11 | −2E-7 | 2E-10 |
| VOLSIM | 0.33 | 9.6 | −9.6 | 4E-40 | 0.28 | 0.16 | −0.15 | 4E-33 | 0.41 | 0.23 | −0.25 | 3E-52 |
|
| ||||||||||||
| Qualitative metric | 9E-3 | 2.2 | 0.77 | 0.03 | 0.01 | 0.04 | 0.01 | 0.02 | 0.01 | 0.04 | 0.02 | 9E-3 |
Abbreviations: APL: Added path length; a(RSVD): The absolute value of the relative volume difference; CENT: Centroid; DSC: Dice similarity coefficient (S: Surface; V: Volume); FP: False Positive; Hausdorff95th: The 95th percentile of the Hausdorff distance; MASD: Mean absolute surface distance; TP: True positive; VOLSIM: Volume similarity.
Figure 3.
Scatter plots and linear regression between the differences in heart dose for the three investigated metrics between the two sets of contours (ΔDose=|DoseDL-DoseRTOG 0617|) and the three volume similarity metrics with the strongest association between these (left: volumetric DSC, middle: the 95th percentile of the Hausdorff distance; right: the distance between centroids in axial plane). Note: the red lines are the linear regression (dotted lines: 95% CI); the R2 and the associated linear regression p-value have been inserted in each subfigure.
Lastly, the linear correlation between the qualitative scoring of the trial hearts and |DoseDL-DoseRTOG0617| was weak (median: R2=0.01; p=0.02), and in addition this qualitative scoring was not correlated with any of the 18 quantitative volume similarity metrics (median Rs=0.13 (range: −0.22, 0.31); Figure S1).
Discussion
Based on curated auto-segmented hearts using our DL pipeline applied to the large randomized controlled RTOG 0617 data, this study suggests that heart doses in this trial are significantly higher than previously reported. In addition, a stronger association was found between heart dose and OS based on the DL hearts compared to using the trial hearts likely due to the significantly higher doses observed for the DL hearts, and consequently, the OS curves better discriminated between low- and high-risk using DL heart dose as opposed to trial heart dose. While this further supports heart/pericardium dose as the key OS predictor in RTOG 0617 ([1–3]), this importantly points to an even broader delivery of lethal heart doses on this trial.
In the original publication disclosing predictors for OS in RTOG 0617 [1], heart V5% was found to be the strongest predictor and the only heart dose predictor in the multivariate model. In their updated analysis, replacing heart V5% with heart V30% produced similar results [2]. Interestingly in the current study, MHD was an order of magnitude stronger OS predictor than both V5% and V30% regardless of using the DL hearts or the trial hearts (median p=2.3E-5 vs. 1.3E-4). To the best of our knowledge, MHD was not analyzed in either [1] or in [2].
Our MHD association follows that of another recent publication by Thor et al [3], which was also based on the RTOG 0617 data. In that study, dose to post-trial and manually segmented atria, pericardium and ventricles as well as the trial lungs was analyzed and, among all predictors, the mean of the hottest 55% pericardium dose (MOH55%[Gy]) was the overall strongest predictor (p=8.2E-8 vs. p=1.2E-5 to 1.0E-6 for the other three dose metrics included in their multivariate ensemble model). A potentially important anatomical difference is that the pericardium in [3] in comparison to the trial hearts in RTOG 0617 [1, 2] extend superiorly and include also the pulmonary artery (PA), superior vena cava (SVC) and the ascending aorta. Though, comparing pericardium MOH55%[Gy] used in [3] with the corresponding dose metric but for the pericardium segmented using our DL pipeline, the dose to the latter was, similarly as in the trial hearts vs. DL hearts comparison, significantly higher (population median: 31Gy vs. 25Gy; p=1.2E-44). Of note, none of the DL pericardium was edited, which may have influenced this comparison, however, until the post-trial manually segmented atria, pericardium and ventricles are released (currently unavailable under TCIA), a further analysis of the observed difference is not possible. In two previous studies, low SVC dose has been found to correlate with OS in early stage NSCLC [16] and also with treatment-induced cardiac toxicity in LA-NSCLC [14]. Dose to the ‘base of the heart’ (combined dose to ascending aorta, PA and SVC according to Figure 1 in [17]) has also been found to predict OS in LA-NSCLC [17]. These findings and the stronger association between pericardium MOH55%[Gy] and OS compared to between the mean dose to the DL hearts and OS in the current study (p=8.2E=8 in [3] vs. p=2.5E-6) may indicate that a higher degree of blood irradiation using the pericardium compared to the heart leads to immunosuppression, which causes death in LA-NSCLC [18, 19]. Interestingly but in a much smaller LA-NSCLC cohort (n=94), a similar result was previously reported: pericardium dose but not heart dose was associated with OS (p=0.01 for pericardium V30 and V55 in the multivariate OS model; cf. Table 3 in [20]).
The most pronounced volumetric differences between the trial and the DL hearts as per visual inspection were identified in the superior direction. No delineation guidelines have been disclosed for the 0617 trial hearts but assuming that such guidelines were used, the superior border of the heart has not been sufficiently communicated across the participating institutions. Further, there was a significant linear correlation between the majority of the volume similarity metrics and the heart dose differences between the DL hearts and the trial hearts (median p=3E-13), but the magnitude of the correlations was weak (median R2=0.13), and none of the volume similarity metrics significantly predicted OS (p=0.15–0.56). In addition to DSCv and Hausdorff95th, which across all metrics were most strongly associated with |DoseDL-DoseRTOG0617|, the distance between the centers of mass of the two sets of contours in axial plane performed as well (R2=0.32–0.51, R2=0.33–0.37, R2=0.44–0.51). This metric captures the qualitatively observed inconsistencies in the superior direction of the trial hearts and could be a preferred similarity metric in scenarios for which systematic directional (inferior-superior) inconsistencies are being noticed or suspected. The corresponding metric in the sagittal or coronal planes was less strongly associated with the dose differences (R2=0.22–0.23 and R2=0.01–0.03). The APL, APLNorm, and DSCS that are associated with the time saved during contouring compared to manual delineation [15] were most weakly associated with the heart dose differences between the two sets of contours. This could inform further development of volume similarity metrics in better factoring for both time savings and outcomes. Interestingly, the linear correlation between the trial qualitative heart scoring and the corresponding dose differences was even weaker (median R2=0.01; p=0.02), and the qualitative scoring of the trial hearts was only weakly correlated to the quantitative volume similarity metrics (median Rs=0.13). Of note, only three trial hearts had been scored as ‘unacceptable deviations’ and 75% were scored as ‘per protocol’ i.e. no observed deviations. The quantitative scoring in contrast indicated considerable discrepancies between the DL and the trial hearts: e.g. the population median DSC was 0.86 (range: 0.06–0.98) and only 27% of the 442 patients had a DSCV>0.90, and 56% had a DSCV>0.85 meaning that 15% of the heart is missed in more than every other patient.
Exploring our institutional heart dose-volume constraintV30<50% based on either the trial hearts or the DL hearts disclosed that the majority of the patients who deceased in the RTOG 0617 trial (241/271 patients; 89%) fulfilled this constraint, yet their median survival time was 15 months. This finding suggests the use of more conservative heart dose/volume constraints. Given the orders-of-magnitude stronger association with OS using dose to the pericardium over the hearts in the RTOG 0617 data, constraints should probably be derived from the pericardium rather than from the heart.
To mitigate the largest source of errors in RT due to manual contouring, we recommend the use of clear anatomical guidelines in addition to systematic and quality assured segmentations of the concerned organ. For heart segmentation, our DL algorithm (which includes heart in addition to other cardiac substructures) is open-source and available through the CERR model library (link https://github.com/cerr/CERR/wiki/Auto-Segmentation-models).[12]. The edited DL hearts used in this study and the non-edited DL hearts agreed better than did the edited DL hearts and the trial hearts, e.g. the median DSCV was 0.99 compared to 0.86, and lead to no statistically significant difference in the investigated heart DVH metrics (p=0.61–0.74). However, all generated DL segmentations should go through QA similarly to contours from any source such as manual segmentation prior to use in clinical trials or before clinical implementation. Applying our DL algorithm and a QA process anchoring the resulting segmentations with the anatomical definitions given in the original publications should be straightforward (download the container from the link provided d https://github.com/cerr/CERR/wiki/Auto-Segmentation-models, and incorporate with local systems), fast (a few seconds to apply the algorithm followed by qualitative/quantitative QA of the generated segmentations), and would importantly produce segmentations that are more consistent than manual contouring de novo and in a considerably shorter amount of time. In addition to consistent heart segmentation and in order to improve outcomes for these patients, our findings encourage more conservative pericardium/heart dose-volume constraints.
This study suggests the use of auto-segmentation to reduce contour and dose inconsistencies in order to improve the outcomes of clinical trials. Using auto-segmentation and exemplified for the RTOG 0617 trial, this study demonstrates the inability of avoiding dose to the heart due to inconsistent heart segmentations, which is likely to have had a negative impact on OS on this trial.
Supplementary Material
Acknowledgments
Funding: This study was funded in part through the NIH/NCI Cancer Center Support Grant P30 CA008748, and through NIH/NCI R01 CA198121
Footnotes
Clinical trial information and data sharing: This study used the randomized phase III locally-advanced non-small cell lung cancer RTOG 0617 trial data, which is publically available under The Cancer Imaging Archive
Conflict of interest: No related conflict of interest reported
Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
References
- [1].Bradley JD, Paulus R, Komaki R, et al. Standard-dose versus high-dose conformal radiotherapy with concurrent and consolidation carboplatin plus paclitaxel with or without cetuximab for patients with stage IIIA or IIIB non-small-cell lung cancer (RTOG 0617): a randomized, tow-by-two factorial phase 3 study. Lancet Oncol 2015;16:187–99 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [2].Bradley JD, Hu C, Komaki RR, et al. Long-term results of NRG Oncology RTOG 0617: Standard-versus high-dose chemoradiotherapy with or without cetuximab for unresectable stage III non-small cell lung cancer. J Clin Oncol 2020;38:706–14 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [3].Thor M, Deasy JO, Hu C, et al. Modeling the impact of cardio-pulmonary irradiation on overall survival in NRG Oncology trial RTOG 0617. Clin Can Res 2020. [online ahead of print] [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Lawton CAF, Michalski J, El-Naqa I, et al. Variation in the definition of clinical target volumes for pelvic nodal conformal radiation therapy for prostate cancer. Int J Radiat Oncol Biol Phys 2009; 74:377–82 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [5].Hong TS, Tomé WA, and Harari PM. Heterogeneity in head and neck IMRT target design and clinical practice. Radiother Oncol 2012; 103:92–8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [6].van Rooij W, Dahele M, Ribeiro Brandao H, et al. Deep learning-based delineation of head and neck organs at risk: geometric and dosimetric evaluation. Int J Radiat Oncol Biol Phys 2019; 104:677–84 [DOI] [PubMed] [Google Scholar]
- [7].Roach D. Chapter 5. Clinical impact of contouring variability on patient outcome in “Looking back to move forward: retrospective automated analysis of prostate radiotherapy trials data”. PhD thesis. University of New South Wales, Sydney. PhD thesis available at: http://unsworks.unsw.edu.au/fapi/datastream/unsworks:71654/SOURCE02 [Google Scholar]
- [8].Jameson MG, Holloway LC, Vial PJ, Vinod SK, and Metcalfe PE. A review of methods of analysis in contouring studies for radiation oncology. J Med Imaging Radiat Oncol 2010;54:401–10 [DOI] [PubMed] [Google Scholar]
- [9].Roach D, Jameson MG, Dowling JA, et al. Correlations between contouring similarity metrics and simulated treatment outcome for prostate radiotherapy. Phys Med Biol 2018;63:1–14 [DOI] [PubMed] [Google Scholar]
- [10].Deasy JO, Blanco AI, Clark VH. CERR: a computational environment for radiotherapy research. Med Phys 2003;30:979–85 [DOI] [PubMed] [Google Scholar]
- [11].Haq R, Hotca A, Apte A, Rimner A, Deasy JO, Thor M. Cardiopulmonary substructure segmentation of radiotherapy computed tomography images using convolutional neural networks for clinical outcomes analysis. Phys Imaging Radiat Oncol 2020. [accepted for publication]. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].2. Apte A, Iyer I, Thor M, et al. Library of deep-learning image segmentation and outcomes model-implementations. Phys Med 2020;73:190–196. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Feng M, Moran JM, Koelling R, et al. Development and validation of a heart atlas to study cardiac exposure to radiation following treatment for breast cancer. Int J Radiat Oncol Biol Phys 2011;79:10–8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [14].Hotca A, Thor M, Deasy JO, and Rimner A. Dose to the cardio-pulmonary system and treatment-induced electrocardiogram abnormalities in locally advanced non-small cell lung cancer. Clin Transl Radiat Oncol 2019;19:96–102 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Vaassen F, Hazelaar C, Vaniqui A, et al. Evaluation of measures for assessing time-saving of automatic organ-at-risk segmentation in radiotherapy. Phys Imaging Radiat Oncol 2020;13:1–6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [16].Stam B, Peulen H, Guckenberger M, et al. Dose to heart substructures is associated with non-cancer death after SBRT in stage I-II NSCLC patients. Radiother Oncol 2017;123:370–5 [DOI] [PubMed] [Google Scholar]
- [17].McWilliam A, Kennedy J, Hodgson C, et al. Radiation dose to heart base linked with poorer survival in lung cancer patients. Eur J Cancer 2017;85:106–13 [DOI] [PubMed] [Google Scholar]
- [18].Contreras JA, Lin Aj, Weiner A, et al. Cardiac dose is associated with immunosuppression and poor survival in locally advanced non-small cell lung cancer. Radiother Oncol 2018;128:498–504 [DOI] [PubMed] [Google Scholar]
- [19].Thor M, Montovano M, Hotca A, et al. Are unsatisfactory outcomes after concurrent chemoradiotherapy for locally advanced non-small cell lung cancer due to treatment-related immunosuppression? Radiother Oncol 2020;143:51–7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- [20].Xue J, Han C, Jackson A, et al. Doses of radiation to the pericardium, instead of heart, are significant for survival in patients with non-small cell lung cancer. Radiother Oncol 2019;133:213–9 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.



