Abstract
Background
Patients with early‐stage non‐small cell lung cancer (NSCLC) typically receive surgery as their primary form of treatment. However, studies have shown that a high proportion of these patients will experience a recurrence after their resection, leading to an increased risk of death. Cancer staging is currently the gold standard for establishing a patient's prognosis and can help clinicians determine patients who may benefit from additional therapy. However, medical images which are used to help determine the cancer stage, have been shown to hold unutilized prognostic information that can augment clinical data and better identify high‐risk NSCLC patients. There remains an unmet need for models to incorporate clinical, pathological, surgical, and imaging information, and extend beyond the current staging system to assist clinicians in identifying patients who could benefit from additional therapy immediately after surgery.
Purpose
We aimed to determine whether a deep learning model (DLM) integrating FDG PET and CT imaging from the thoracic cavity along with clinical, surgical, and pathological information can predict NSCLC recurrence‐free survival (RFS) and stratify patients into risk groups better than conventional staging.
Materials and methods
Surgically resected NSCLC patients enrolled between 2009 and 2018 were retrospectively analyzed from two academic institutions (local institution: 305 patients; external validation: 195 patients). The thoracic cavity (including the lungs, mediastinum, pleural interfaces, and thoracic vertebrae) was delineated on the preoperative FDG PET and CT images and combined with each patient's clinical, surgical, and pathological information. Using the local cohort of patients, a multi‐modal DLM using these features was built in a training cohort (n = 225), tuned on a validation cohort (n = 45), and evaluated on testing (n = 35) and external validation (n = 195) cohorts to predict RFS and stratify patients into risk groups. The area under the curve (AUC), Kaplan–Meier curves, and log‐rank test were used to assess the prognostic value of the model. The DLM's stratification performance was compared to the conventional staging stratification.
Results
The multi‐modal DLM incorporating imaging, pathological, surgical, and clinical data predicted RFS in the testing cohort (AUC = 0.78 [95% CI:0.63–0.94]) and external validation cohort (AUC = 0.66 [95% CI:0.58–0.73]). The DLM significantly stratified patients into high, medium, and low‐risk groups of RFS in both the testing and external validation cohorts (multivariable log‐rank p < 0.001) and outperformed conventional staging. Conventional staging was unable to stratify patients into three distinct risk groups of RFS (testing: p = 0.94; external validation: p = 0.38). Lastly, the DLM displayed the ability to further stratify patients significantly into sub‐risk groups within each stage in the testing (stage I: p = 0.02, stage II: p = 0.03) and external validation (stage I: p = 0.05, stage II: p = 0.03) cohorts.
Conclusion
This is the first study to use multi‐modality imaging along with clinical, surgical, and pathological data to predict RFS of NSCLC patients after surgery. The multi‐modal DLM better stratified patients into risk groups of poor outcomes when compared to conventional staging and further stratified patients within each staging classification. This model has the potential to assist clinicians in better identifying patients that may benefit from additional therapy.
Keywords: computed tomography, deep learning, lung cancer, multi‐modality, positron emission tomography, quantitative imaging
1. INTRODUCTION
Surgery is currently the primary form of treatment for patients with early‐stage non‐small cell lung cancer (NSCLC). However, studies have shown that 30% to 55% of patients will develop a recurrence despite curative resection. Additionally, outcomes can vary based on the tumor biology, extent of the disease, and surgery type. 1 , 2 The current gold standard for patient prognostication is the tumor‐node‐metastasis (TNM) staging system. Yet, patient outcomes in response to complete resection can vary drastically within the same TNM stage group. 1 , 3 Due to this trend, the American Joint Committee on Cancer (AJCC) has identified the need for more personalized probabilistic predictions for patient outcomes that extend beyond the current staging system. 4 This endeavor for patient‐specific treatment and precision oncology has led to predictive models using different sources of data to estimate a patient's prognosis and potentially guide treatment. 5
Models created using quantitative features obtained from the tumor volumes on computed tomography (CT) and [18F] fluorodeoxyglucose (FDG) positron emission tomogrpahy (PET) images can predict outcomes in NSCLC patients and in some cases outperform the TNM stage. 6 , 7 , 8 , 9 , 10 , 11 Additionally, regions of interest (ROIs) in the peri‐tumoral region and vertebral bone marrow have been shown to be associated with patient outcomes, suggesting that areas outside of the tumor have prognostic value. 6 , 12
Deep learning (DL) techniques have been shown to augment or outperform traditional machine learning methods using radiomic features when predicting oncologic outcomes. 13 , 14 , 15 , 16 , 17 , 18 , 19 Most DL studies have utilized convolutional neural networks (CNNs) using one imaging modality for prediction. 20 Additionally, past studies tended to use 2D slices or 3D volumes of the tumor alone as the ROI for model development. No study to date has explored an integrated multi‐modality model of CT and PET imaging features using the entire thorax to examine all possible regions in the lung and thoracic vertebrae for prognosis in lung cancer at the time of surgery. These additional ROIs will build on prior work in the literature, which indicate that areas outside of the tumor are associated with patient outcomes. Moreover, a multi‐modality model of CT and PET imaging features integrated with clinical, surgical, and pathological variables for resected NSCLC prognostication does not exist.
The objective of this study was the development and external validation of a novel multi‐modality deep learning model (DLM) using the entire thoracic cavity on multi‐modal FDG PET‐CT imaging with clinical, surgical, and pathological data to predict the recurrence‐free survival (RFS) of NSCLC patients following surgery. We also investigated the impact of eliminating one of the modality inputs on model performance to determine the relative importance of each modality. Additionally, we aimed to compare the DLM's performance to conventional AJCC TNM staging to determine its added prognostic value.
2. MATERIALS AND METHODS
2.1. Patient dataset
Local institutional review board (IRB) approval (#115587) was obtained to conduct this retrospective study and patient informed consent was waived. This study assessed 667 NSCLC patients referred for upfront surgical resection between 2010 and 2018 from the local institution. Patients were excluded based on pre‐, peri‐, and post‐operative inclusion and exclusion criteria illustrated in Figure 1, which resulted in 305 patients for analysis. The external validation dataset was acquired from the external institution, with prior IRB approval. This dataset included 195 patients referred for upfront surgical resection between 2009 and 2017 based on the same inclusion criteria used in the local dataset. Patients at both centers received adjuvant therapies based on institutional clinical guidelines. Patient data and images were anonymized at both institutions.
FIGURE 1.

Exclusion criteria used in the study. Study flowchart showing all NSCLC patients considered for study from the local cohort and all pre‐, peri‐, and post‐operative exclusions used to determine the patients used in the study. M1—Metastatic disease as defined by the AJCC. AJCC, American Joint Committee on Cancer; NSCLC, non‐small cell lung cancer.
Pre‐operative FDG PET‐CT scans were acquired according to standardized imaging protocols at each institution. Images were acquired using a GE Discovery STE PET/CT scanner (GE Healthcare, Waukesha, Wisconsin) at both institutions. The PET images were converted to standardized uptake value (SUV) units and normalized by the patient's body weight. Routine coverage from base‐of‐skull to mid‐thigh was included in the image acquisition, with additional spot views where necessary. The CT images were acquired using the following acquisition protocols: 100–140 kVp (median: 140 kVp), 31–398 mA (median: 83 mA), in‐plane voxel size of 0.72–1.37 mm (median: 0.98 mm) and slice thickness of 2.5–5 mm (median: 3.75 mm). CT‐based attenuation correction was performed using a measured approach at both centers. Additionally, images were reconstructed using three‐dimensional (3D) iterative reconstruction at the local institution and ordered subset expectation maximization at the external validation institution. The PET images were acquired with an in‐plane voxel size of 3.65–5.47 mm (median: 5.47 mm) and a slice thickness of 3–5 mm (median: 3.27 mm).
Clinical, pathological, and surgical variables were curated from medical records and existing research databases. Clinical variables included age, sex, histology, smoking status, pathological stage (based on the AJCC 6th, 7th, or 8th edition), surgery type, tumor site, margin status, lymphovascular invasion (LVI), and visceral pleural invasion (VPI). Absolute RFS was the outcome of interest and was defined as the time from surgical resection to evidence of disease recurrence or death by any cause or last known follow‐up. We chose this outcome since it has been extensively studied in the literature as a clinical endpoint for surgical NSCLC patient studies. 21 , 22 , 23 , 24 Recently, Rajaram et al. described the importance of predicting RFS for surgically resected NSCLC patients, the lack of studies reporting this clinical endpoint, and its potential as a surrogate of overall survival (OS). 25 Recurrence was diagnosed radiologically or pathologically, and was defined as local, regional, or distant disease present after surgical resection and was verified through manual review of electronic medical records.
Patients from the local dataset were split using a random number generator into training (n = 225), validation (n = 45), and testing (n = 35) cohorts using an approximate 75%/15%/10% split, while balancing the event rate across all three cohorts. The training cohort was used for model building, the validation for model tuning, and the testing and external validation (n = 195) cohort were held out for model evaluation. The differences in the clinical variables between the local and external cohorts were assessed using the chi‐square (χ2) test for categorical variables and an unpaired t‐test for continuous variables. This was performed to compare the baseline characteristics between the two patient cohorts.
2.2. Data preprocessing for deep learning
Since CT and FDG PET images were acquired using different voxel spacing, all images were resampled to a voxel size of 1 mm × 1 mm × 2.5 mm using nearest‐neighbor interpolation. The final image volume of each scan was cropped to 300 × 300 × 150 voxels. The thorax volumes of interest (VOIs) on the CT and PET images were acquired using a custom‐built automated thorax segmentation algorithm. The mask included the lungs, mediastinum, pleural surfaces, and thoracic vertebrae as seen in Figure 2. Details describing the segmentation process are available in Supplementary Material A1.
FIGURE 2.

Result of the fully automatic thorax segmentation algorithm. (a) 3D representation of binary thorax mask with example (b) axial, (c) coronal, and (d) sagittal slices of the CT imaging with the binary mask applied. 3D, three‐dimensional; CT, computed tomography.
The pathological stage of each patient was converted to a 5‐year survival probability based on the AJCC staging manuals. 26 , 27 For example, Stage IIIA patients from the 6th, 7th, and 8th edition AJCC staging system were represented as 5‐year probabilities of 36%, 36%, and 41%, respectively. All continuous clinical variables were standardized using the z‐score transformation on the training cohort prior to model building and this transformation was applied to the validation, testing, and external validation cohorts. All categorical clinical variables were converted to one‐hot‐encoded format.
2.3. Deep learning model
PyTorch Version 2.01 was used to develop, train, tune, and evaluate our model. The model was made up of three separate networks: a 3D CNN for CT images, a 3D CNN for FDG PET images, and a single‐layer perceptron (SLP) network for clinical variables (Figure 3). The model produced a risk score between 0 and 1 from the resulting logistic activation function to predict RFS, indicating the RFS probability of the patient after their resection. The trained model was locked and evaluated on the testing and external validation cohorts. All experiments were conducted using an NVIDIA RTX 3090 GPU.
FIGURE 3.

The multi‐modal DLM for RFS prediction. This model was comprised of three networks: two CNNs for the CT and PET imaging respectively and a single layer perceptron network for the clinical variables. Each network produced an input‐attributed risk score (XCT, XPET, XClin) which was concatenated in the penultimate layer and resulted in a normalized risk score (Ro) between 0 and 1 for each patient. CT, computed tomography; CNNs, convolutional neural networks; DLM, deep learning model; RFS, recurrence‐free survival.
2.4. Deep learning architecture
The CNN networks for the imaging inputs were comprised of six convolutional layers of 8, 16, 32, 64, 128, and 256 filters, respectively. This was followed by three fully connected (FC) layers. The kernel sizes of the convolution operations were 7 × 7 × 5 for layers 1 and 2, and 5 × 5 × 3 for layers 3 through 6. A stride of 2, padding of 1, and a dropout rate of 25% were used across all convolution layers. The FC layers were comprised of 6912, 256, and 64 units, respectively. The SLP network for the clinical data input consisted of one FC layer mimicking a logistic regression model. The outputs from the three networks (CT CNN, PET CNN, and clinical SLP) were then flattened, concatenated, and fed into one FC layer comprised of 3 units. All CNN FC layers used a dropout rate of 25% and the SLP used a dropout rate of 10%. Subsequently, a sigmoid layer was employed to determine the prediction probabilities. Instance normalization was used for layers 1 through 5 and batch normalization was used for the last layer. The leaky rectified linear unit (leaky ReLU) activation function was applied across all convolutional and FC layers.
During the training process, the Adam optimizer was used with a learning rate of 1 × 10−3 and L2 regularization of 1 × 10−6. Images were fed into the model in batches of 5 and the training epoch limit was 50 epochs. The training and validation losses were calculated using the binary cross entropy loss function. The validation loss was monitored, and early stopping was employed to select the model with the best performance on the validation cohort. This method was done to avoid overfitting on the training cohort.
Three additional models were created to compare the importance of each modality to model performance. These models were: (1) CT and clinical variables, (2) PET and clinical variables, and (3) PET and CT. Each model utilized the same architecture and training process as the primary model with all three inputs. The performance of each model was compared to the primary model with all three modalities.
2.5. Model assessment and statistical analysis
The model's ability to predict RFS was evaluated using the area under the receiver operating characteristic curve (AUC) and the hazard ratio (HR). Kaplan–Meier curves were generated to analyze the ability of the model to stratify patients into high, medium, and low‐risk groups of RFS. Upper and lower quartile risk scores in the training cohort were used to stratify patients into the three risk groups. Additionally, the ability to stratify patients within the same stage group was analyzed. The log‐rank test was used to determine the statistical significance between each determined group, and significance was denoted by a p‐value less than 0.05. DLM stratification performance was compared to the conventional AJCC stage of the patient to determine the model's added value in risk stratification and RFS prediction. Statistical analysis was conducted in Python 3.10.2 using the Lifelines package. 28
3. RESULTS
3.1. Patient demographics
Table 1 summarizes the demographics of the local and external validation cohorts. The external validation cohort was younger (p = 0.002), had a different surgery type distribution (p = 0.007), and different LVI distributions (p = 0.02) when compared to the local cohort. However, there was no difference in the proportion of patients who recurred (p = 0.82) or died (p = 0.12) between the cohorts. The median follow‐up time in the local and external validation cohorts was 57 months and 60 months, respectively. The demographics of the training, validation, and testing cohorts can be found in Table S1.
TABLE 1.
Clinical variables for the local and external validation cohorts.
| Clinical variables |
Local cohort (n = 305) |
External validation cohort (n = 195) |
p value |
|---|---|---|---|
| Mean age [range] | 70 [27–91] | 67 [29–93] | 0.002 |
| Sex | 0.18 | ||
| Men | 138 (42%) | 101 (52%) | |
| Histology | 0.10 | ||
| Adenocarcinoma | 205 (67%) | 136 (70%) | |
| Squamous cell carcinoma | 89 (29%) | 54 (28%) | |
| Large cell carcinoma | 4 (1%) | 3 (1%) | |
| Adenosquamous carcinoma | 7 (2%) | 0 (0%) | |
| NSCLC NOS | 0 (0%) | 2 (1%) | |
| Smoking status | 0.15 | ||
| Former | 150 (49%) | 111 (57%) | |
| Current | 113 (37%) | 56 (29%) | |
| Non‐smoker | 42 (14%) | 28 (14%) | |
| Outcome | 0.53 | ||
| Alive | 136 (45%) | 89 (46%) | |
| Recurrence alone | 21 (7%) | 20 (10%) | |
| Death alone | 66 (21%) | 39 (20%) | |
| Recurrence and death | 82 (27%) | 47 (24%) | |
| Pathological stage a | 0.18 | ||
| I | 173 (57%) | 118 (60%) | |
| II | 84 (27%) | 40 (21%) | |
| III | 48 (16%) | 37 (19%) | |
| Post‐op therapy | 0.05 | ||
| Yes | 79 (26%) | 35 (18%) | |
| Surgery type | 0.007 | ||
| Sub‐lobar resection | 44 (14%) | 50 (26%) | |
| Lobectomy | 246 (81%) | 138 (71%) | |
| Pneumonectomy | 15 (5%) | 7 (3%) | |
| Primary site | 0.14 | ||
| RUL | 113 (37%) | 79 (41%) | |
| RML | 20 (7%) | 7 (3%) | |
| RLL | 56 (18%) | 31 (16%) | |
| LUL | 67 (22%) | 56 (29%) | |
| LLL | 49 (16%) | 22 (11%) | |
| Margin status | 0.09 | ||
| Negative (> 5 mm) | 216 (71%) | 155 (80%) | |
| Close (1–5 mm) | 77 (25%) | 34 (17%) | |
| Positive (< 1 mm) | 12 (4%) | 6 (3%) | |
| Lympho‐vascular Invasion | |||
| Not identified | 225 (74%) | 141 (72%) | 0.02 |
| Indeterminate | 11 (4%) | 0 (0%) | |
| Present | 69 (23%) | 54 (28%) | |
| Visceral pleural invasion | 0.18 | ||
| Not identified | 195 (64%) | 131 (67%) | |
| Indeterminate | 5 (2%) | 0 (0%) | |
| Present | 105 (34%) | 64 (33%) |
Note: Unless otherwise indicated, data is presented as number of patients and data in parentheses are percentages.
Based on American Joint Committee on Cancer (AJCC) 6th, 7th, and 8th Editions.
3.2. Deep learning model performance for RFS prognostication and stratification
The DLM was found to be a significant predictor of RFS in all cohorts. In the training cohort, the DLM achieved an AUC = 0.75 [95% CI: 0.67–0.81]. In the testing cohort, the DLM achieved an AUC = 0.78 [95% CI: 0.63–0.94]. Additionally, the model was able to significantly stratify patients into risk groups of RFS in the testing cohort (HR = 4.81 [95% CI: 2.05–11.29]; multivariable log‐rank: p < 0.001) as shown by the RFS Kaplan–Meier curves in Figure 4a. The median, 2, 5, and 10‐year RFS and relative HRs of the high‐, medium‐, and low‐risk groups in the testing cohort can be seen in Table 2. The performance in the training and validation cohorts can be found in the Supplementary Material A2.
FIGURE 4.

KM curves for the multi‐modal DLM and conventional AJCC stage groups in the testing cohort (n = 35). KM curves separating resected NSCLC patients into high, medium, and low‐risk groups using the (a) DLM and (b) conventional AJCC stage. Patients were stratified using the upper quartile risk score in the training cohort. + indicates censored data. AJCC, American Joint Committee on Cancer; DLM, deep learning network; KM, Kaplan–Meier; NSCLC, non‐small cell lung cancer.
TABLE 2.
RFS information for the DLM and conventional stage in the testing and external validation cohorts.
| DLM risk group | AJCC staging | |||||||
|---|---|---|---|---|---|---|---|---|
| Testing cohort | Low (n = 6) | Medium (n = 20) | High (n = 9) | LR p‐value |
I (n = 20) |
II (n = 9) |
III (n = 6) |
LR p‐value |
| HR (95% CI) | 1.00 (Ref) | 2.90 (0.37–22.94) | 15.79 (1.93–129.01) | ** | 1.00 (Ref) | 3.10 (1.04–9.25) | 2.91 (0.94–8.97) | * |
| Median RFS (years) | 5.76 | 4.98 | 1.33 | – | 5.69 | 1.80 | 2.01 | – |
| 2‐year RFS | 83% | 85% | 22% | NS | 85% | 85% | 50% | NS |
| 5‐year RFS | 83% | 70% | 11% | * | 74% | 33% | 33% | NS |
| 10‐year RFS | 83% | 44% | 0% | N/A | 52% | 33% | 17% | N/A |
| External validation cohort | Low (n = 54) | Medium (n = 101) | High (n = 40) | LR p‐value | I (n = 118) |
II (n = 40) |
III (n = 37) |
LR p‐value |
|---|---|---|---|---|---|---|---|---|
| HR (95% CI) | 1.00 (Ref) | 1.82 (1.09–3.04) | 3.30 (1.87–5.84) | ** | 1.00 (Ref) | 2.06 (1.29–3.30) | 2.60 (1.63–4.14) | ** |
| Median RFS (years) | 5.15 | 3.42 | 1.65 | – | 4.94 | 1.67 | 1.31 | – |
| 2‐year RFS | 83% | 69% | 48% | ** | 83% | 51% | 41% | ** |
| 5‐year RFS | 66% | 50% | 31% | * | 62% | 35% | 31% | * |
| 10‐year RFS | 60% | 33% | 13% | ** | 42% | 31% | 23% | NS |
Abbreviations: AJCC, American Joint Committee on Cancer; DLM, deep learning model; HR, hazard ratio; LR, log‐rank; NS, non‐significant; N/A, not available (due to absence of data at time point); RFS, recurrence‐free survival; RFS, recurrence‐free survival.
p < 0.05.
p < 0.001.
3.3. Stage group performance for RFS prognostication and stratification
The patients were grouped into their conventional AJCC defined pathological stages (I, II, III), where stage was found to stratify patients into risk groups of RFS in the testing cohort, however, this stratification was only borderline significant (HR = 1.77 [95% CI: 1.05–2.97]; multivariable log‐rank: p = 0.049) as shown in Figure 4b. This was a result of the difficulty in separating both stage I and stage II patients (log‐rank: p = 0.05), and stage II and III patients (log‐rank: p = 0.94). The median, 2, 5, and 10‐year RFS and relative HRs for each stage can be seen in Table 2. The DLM outperformed stage in the testing cohort when stratifying patients (multivariable log‐rank: p < 0.001 vs. p = 0.049), demonstrating that the DLM augmented the prognostic power of stage.
3.4. External validation
In the external validation cohort, the DLM was found to be a significant predictor of RFS (AUC = 0.66 [95% CI: 0.58–0.73]). The DLM and stage both significantly stratified patients into risk groups (DLM: HR = 1.82 [95% CI: 1.37–2.41]; multivariable log‐rank: p < 0.001 versus stage: HR = 1.64 [95% CI: 1.31–2.05]; multivariable log‐rank: p < 0.001) as shown by the KM curves in Figure 5a,b. Additionally, the relative HRs shown in Table 2 highlight the improvement in the performance of the DLM model over conventional staging. The DLM outperformed stage due to its ability to significantly stratify medium and high‐risk patients, while stage was unable to do so for stage II and III patients (log‐rank; DLM: p = 0.008, stage: p = 0.38). While the median 2‐year and 5‐year RFS was similar between that of stage and the DLM in the external validation cohort, the DLM was able to identify more patients that experienced an event at 10 years when compared to the stage III group (DLM 10‐year RFS: 13% vs. stage 10‐year RFS: 23%) as seen in Table 2. The DLM was also able to identify more low‐risk patients when compared to the stage I group (DLM 10‐year RFS: 60% vs. stage I 10‐year RFS: 42%) as seen in Table 2. This is further reflected in the 10‐year log‐rank test where the DLM achieved a significant stratification (p < 0.001), and stage did not (p = 0.06).
FIGURE 5.

KM curves for the multi‐modal DLM and conventional AJCC stage groups in the external validation cohort (n = 195). KM curves separating resected NSCLC patients into high, medium, and low‐risk groups using the (a) DLM and (b) conventional AJCC stage. Patients were stratified using the upper quartile risk score in the training cohort. + indicates censored data. AJCC, American Joint Committee on Cancer; DLM, deep learning network; KM, Kaplan–Meier; NSCLC, non‐small cell lung cancer.
3.5. Risk‐stratification within each stage group
Within each stage group, in both the testing and external validation cohorts, the DLM was able to significantly stratify patients into risk groups (Figure 6). In the testing cohort, high‐risk stage I (n = 3) patients (multivariable log‐rank: p = 0.02) identified by the DLM displayed a median RFS of 1.78 years while medium (n = 12) and low‐risk (n = 5) stage I patients had a median RFS of 5.29 and 5.78 years, respectively. Stage II and III patients in the testing cohort were deemed medium and high‐risk patients and were significantly stratified into those two groups (log‐rank; stage II: p = 0.03, stage III: p = 0.049).
FIGURE 6.

KM curves in the (left) testing (n = 35) and (right) external validation (n = 195) cohorts determined by the multi‐modal DLM within each stage. KM curves separating resected NSCLC patients into high, medium, and low‐risk groups in (a) stage I, (b) stage II, and (c) stage III. Patients were stratified using the upper and lower quartile risk scores in the training cohort. + indicates censored data. DLM, deep learning network; KM, Kaplan–Meier; NSCLC, non‐small cell lung cancer.
In the external validation cohort, high‐risk stage I (n = 15) patients identified by the DLM displayed a median RFS of 3.38 years, while medium (n = 59) and low‐risk (n = 44) stage I patients had a median RFS of 4.92 and 4.98 years, respectively. The stratification displayed borderline significance when analyzed using the multivariable log‐rank test (p = 0.05) but displayed statistical significance when only considering the high and low‐risk groups (log‐rank: p = 0.02). The main difference between the groups was observed after 5 years post‐resection. Unlike the testing cohort, the external validation cohort identified some patients as being low risk who were conventionally staged as II and III. While the multivariable log‐rank test did not display significance for stage II and III stratification, similar to the stage I results, the log‐rank test between the high‐ and low‐risk group for stage II patients demonstrated a significant stratification (log‐rank: p = 0.03). The median RFS for the high (n = 10), medium (n = 23), and low‐risk (n = 7) groups identified by the DLM for the stage II patients were 1.11, 1.34, and 7.90, respectively. The stage III patient results did not display significant stratification, however, qualitatively a separation was observed at 5 years post‐resection. A lack of significance could be due to the low number of patients in the low‐risk group.
3.6. Modality importance
The three separate models created, where one modality was removed in each, demonstrated the relative importance of each modality to model performance. The PET and CT model achieved AUCs of 0.55 [95% CI: 0.36–0.75] and 0.57 [95% CI: 0.49–0.65] in the testing and external validation cohort, respectively. This imaging only model did not generalize well from the training to the evaluation datasets. The model with PET imaging and clinical variables achieved AUCs of 0.72 [95% CI: 0.55–0.88] and 0.62 [95% CI: 0.54–0.70] in the testing and external validation cohort, respectively. Lastly, the model with CT imaging and clinical variables AUCs of 0.74 [95% CI: 0.57–0.90] and 0.66 [95% CI: 0.59–0.74] in the testing and external validation cohort, respectively. Additionally, the final weights from the model using all three modalities agreed with the results from the three models, where the CT, PET, and clinical variables inputs had weights of 0.3062, −0.0037, and 0.5706, respectively. However, only the model with PET, CT, and clinical information was able to significantly stratify patients into risk groups in all cohorts.
4. DISCUSSION
Quantitative imaging information within ROIs inside the thoracic cavity on FDG‐PET and CT have been associated with outcomes in NSCLC patients. However, the combination of multi‐modal imaging and surgical information has never been investigated for its prognostication ability. In this study, we developed and evaluated the performance of a DLM for RFS stratification of resected NSCLC patients using the entire thorax on FDG PET and CT imaging, along with clinical, surgical, and pathological data. The DLM was able to predict RFS and significantly stratify patients into high, medium, and low‐risk groups. Although there was some crossing of the survival curves at early time points, we demonstrated good separation of all three risk groups for most time points. The developed model was able to provide significantly improved prognostic stratification when compared to conventional staging (p < 0.001 vs. p = 0.049). Although the internal testing dataset was relatively small (n = 35), the model was externally validated on a large dataset and outperformed conventional stage. Additionally, the model was able to significantly further stratify patients within the same stage group.
These results suggest that this multi‐modal DLM can augment clinical information and identify patients that may be at a high risk of treatment failure contrary to their given pathological stage. This model could be utilized to better identify the high‐risk population of patients that would benefit from additional therapies or more aggressive surveillance after treatment. Additionally, this model has the potential to rightfully select patients who may not benefit from additional therapies or those who are less likely to experience an event. This would be advantageous to patients as adjuvant therapies carry some risks for adverse events. However, the clinical implementation and utility of the model in affecting treatment decisions need to be investigated in a prospective clinical trial.
The importance of each modality to the model's performance was investigated and highlighted that CT imaging and clinical variables were integral to the stratification results. The PET imaging was shown to have a negligible effect, however, only the model with all three sources of information was able to significantly stratify all patient cohorts into risk groups of RFS. The poor performance of PET imaging could be the result of the decreased resolution and increased noise. This could have led to the model determining that PET information was redundant when aiming to minimize model complexity and avoid overfitting. Future studies should aim to investigate if PET imaging from the entire thoracic cavity or whole‐body PET can augment PET imaging local to the tumor.
This work builds on the field of traditional radiomics and outcome prediction where studies have found prognostic imaging features. The advantage of the developed model is the use of information from areas other than the tumor and from multiple sources of information, which have been shown to hold predictive value. 6 , 12 Recent work has demonstrated that models combining both CT and PET imaging can outperform those using a single modality when predicting NSCLC outcomes. 9 , 10 While DL studies can identify higher‐order patterns within imaging, results have remained relatively consistent with single modality studies. Hosny et al. achieved an AUC of 0.71 in their surgery cohort when predicting 2‐year OS using CT images. 13 Sasaki et al. achieved AUCs above 0.80 when predicting post‐operative recurrence using CT images, however, these results were found using cross‐validation which could potentially overestimate the performance of the model when applied to unseen data. 19 , 29 Additionally, these studies only investigated the tumor and its immediate surroundings for prognostic information. Oh et al. expanded their ROI by investigating the ability of whole‐body PET images for predicting NSCLC OS and were able to achieve a cross‐validation concordance index of 0.76. 17 Furthermore, many of the developed models in the literature are not compared to baseline information like the stage. Our work improves upon the current literature by incorporating clinical information with the entire thorax on both CT and PET imaging and comparing the developed model to conventional staging. Additionally, our work evaluates the performance of an operating point that may be used clinically rather than relying on the AUC alone to assess model performance.
There are several limitations of this study that should be noted. To ascertain the clinical utility of the DLM, a prospective study would be required to determine whether improved risk stratification would change physician behavior. Prior to that, further validation studies in external cohorts and DLM integration into existing imaging workstations are required. Additionally, the difference in the AJCC staging edition of the collected patients could potentially confound the model during both training and evaluation due to the possible staging reclassification of patients. However, the large proportion of stage I patients in this study minimizes this limitation. Another limitation is the lack of genetic biomarkers available in our clinical variables due to the time frame our dataset was selected from. Markers such as programmed death‐ligand‐1 (PDL‐1) and epidermal growth factor receptor (EGFR) are currently used to determine patient prognosis and treatment response. 30 Future studies should examine these markers to investigate whether they can augment DLMs that use clinical and imaging information.
Additionally, while RFS is a well‐studied clinical endpoint and has been shown to closely mirror how a treatment impacts OS, alternate endpoints such as OS and recurrence‐free interval could provide a clearer description of the clinical benefit of the proposed model. Lastly, the distribution of some clinical variables between the local and external cohorts was significantly different. While the patient population differences may have impacted model performance, our model was still able to outperform TNM stage, demonstrating that our model may be able to account for demographic differences and generalize to external data. Future studies should investigate updating model weights using a small percentage of external data. Similarly, additional refinements to the model architecture could be investigated, such as early fusion rather than late fusion used in this study. Lastly, while one strength of the developed model was using imaging acquired from a single scan, future studies should investigate if using diagnostic CT scans provides improved performance.
In summary, we developed a DLM integrating PET, CT, clinical, surgical, and pathological data to predict RFS in resected NSCLC patients. We demonstrated that our model was able to significantly stratify patients into high, medium, and low‐risks groups of RFS and outperform conventional stage alone. The developed model was then externally validated and again provided improved risk stratification when compared to the TNM staging system. This model may add value to clinicians trying to better select patients benefiting from adjuvant therapies without causing undue harm.
CONFLICT OF INTEREST STATEMENT
The authors declare no conflicts of interest.
Supporting information
Supporting Information
ACKNOWLEDGMENTS
The authors would like to acknowledge Drs. Aneesh Dhar and Pencilla Lang for their assistance and contributions to the conception and execution of this study, as well as their clinical expertise and guidance. We acknowledge funding support from the Gerald C. Baines Foundation, donor support through the London Health Sciences Foundation, the Cancer Research Society (Grant 152218), the Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grants program (RGPIN‐2020‐06498), the Lawson Health Research Institute's Internal Research Fund (IRF), and the National Cancer Institute (NCI) (T32CA009515).
Christie JR, Romine P, Eddy K, et al. Thorax‐encompassing multi‐modality PET/CT deep learning model for resected lung cancer prognostication: A retrospective, multicenter study. Med Phys. 2025;52:4390–4402. 10.1002/mp.17862
Originating Institution: Department of Medical Biophysics, Western University, 1151 Richmond Street, London, Ontario N6A 3K7, Canada.
REFERENCES
- 1. Uramoto H, Tanaka F. Recurrence after surgery in patients with NSCLC. Transl Lung Cancer Res. 2014;3(4):242‐249. doi: 10.3978/j.issn.2218-6751.2013.12.05 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Mahvi DA, Liu R, Grinstaff MW, Colson YL, Raut CP. Local cancer recurrence: the realities, challenges, and opportunities for new therapies. CA Cancer J Clin. 2018;68(6):488‐505. doi: 10.3322/caac.21498 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Soares M, Antunes L, Redondo P, et al. Treatment and outcomes for early non‐small‐cell lung cancer: a retrospective analysis of a Portuguese hospital database. Lung Cancer Manag. 2021;10(2):LMT46. doi: 10.2217/lmt-2020-0028 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Kattan MW, Hess KR, Amin M, et al. American Joint Committee on Cancer acceptance criteria for inclusion of risk models for individualized prognosis in the practice of precision medicine. CA Cancer J Clin. 2016;66(5):370‐374. doi: 10.3322/caac.21339 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Christie JR, Lang P, Zelko LM, Palma DA, Abdelrazek M, Mattonen SA. Artificial intelligence in lung cancer: bridging the gap between computational power and clinical decision‐making. Can Assoc Radiol J. 2021;72(1):86‐97. doi: 10.1177/0846537120941434 [DOI] [PubMed] [Google Scholar]
- 6. Akinci D'Antonoli T, Farchione A, Lenkowicz J, et al. CT radiomics signature of tumor and peritumoral lung parenchyma to predict nonsmall cell lung cancer postsurgical recurrence risk. Acad Radiol. 2020;27(4):497‐507. doi: 10.1016/j.acra.2019.05.019 [DOI] [PubMed] [Google Scholar]
- 7. Huang L, Fan M, Chen J, et al. Radiomics based method to predict overall survival of inoperable NSCLC patients. Radiother Oncol. 2018;127:S1093‐S1094. doi: 10.1016/S0167-8140(18)32316-8 [DOI] [Google Scholar]
- 8. Huang Y, Liu Z, He L, et al. Radiomics signature: a potential biomarker for the prediction of disease‐free survival in early‐stage (I or II) non‐small cell lung cancer. Radiology. 2016;281(3):947‐957. doi: 10.1148/radiol.2016152234 [DOI] [PubMed] [Google Scholar]
- 9. Mattonen SA, Davidzon GA, Bakr S, et al. 18F] FDG positron emission tomography (PET) tumor and penumbra imaging features predict recurrence in non–small cell lung cancer. Tomography. 2019;5(1):145‐153. doi: 10.18383/j.tom.2018.00026 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Christie JR, Daher O, Abdelrazek M, et al. Predicting recurrence risks in lung cancer patients using multimodal radiomics and random survival forests. J Med Imaging. 2022;9(6):066001. doi: 10.1117/1.JMI.9.6.066001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Li Q, Kim J, Balagurunathan Y, et al. CT imaging features associated with recurrence in non‐small cell lung cancer patients after stereotactic body radiotherapy. Radiat Oncol. 2017;12(1):158. doi: 10.1186/s13014-017-0892-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Mattonen SA, Davidzon GA, Benson J, et al. Bone marrow and tumor radiomics at 18F‐FDG PET/CT: impact on outcome prediction in non‐small cell lung cancer. Radiology. 2019;293(2):451‐459. doi: 10.1148/radiol.2019190357 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hosny A, Parmar C, Coroller TP, et al. Deep learning for lung cancer prognostication: a retrospective multi‐cohort radiomics study. PLoS Med. 2018;15(11):e1002711. doi: 10.1371/journal.pmed.1002711 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Paul R, Hawkins SH, Balagurunathan Y, et al. Deep feature transfer learning in combination with traditional features predicts survival among patients with lung adenocarcinoma. Tomography. 2016;2(4):388‐395. doi: 10.18383/j.tom.2016.00211 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Afshar P, Mohammadi A, Tyrrell PN, et al. DRTOP: deep learning‐based radiomics for the time‐to‐event outcome prediction in lung cancer. Sci Rep. 2020;10(1):12366. doi: 10.1038/s41598-020-69106-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Xu Y, Hosny A, Zeleznik R, et al. Deep learning predicts lung cancer treatment response from serial medical imaging. Clin Cancer Res. 2019;25(11):3266‐3275. doi: 10.1158/1078-0432.CCR-18-2495 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Oh S, Kang SR, Oh IJ, Kim MS. Deep learning model integrating positron emission tomography and clinical data for prognosis prediction in non‐small cell lung cancer patients. BMC Bioinformatics. 2023;24(1):39. doi: 10.1186/s12859-023-05160-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Mukherjee P, Zhou M, Lee E, et al. A shallow convolutional neural network predicts prognosis of lung cancer patients in multi‐institutional CT‐image data. Nat Mach Intell. 2020;2(5):274‐282. doi: 10.1038/s42256-020-0173-6 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Sasaki Y, Kondo Y, Aoki T, Koizumi N, Ozaki T, Seki H. Use of deep learning to predict postoperative recurrence of lung adenocarcinoma from preoperative CT. Int J CARS. 2022;17(9):1651‐1661. doi: 10.1007/s11548-022-02694-0 [DOI] [PubMed] [Google Scholar]
- 20. Bera K, Braman N, Gupta A, Velcheti V, Madabhushi A. Predicting cancer outcomes with radiomics and artificial intelligence in radiology. Nat Rev Clin Oncol. 2022;19(2):132‐146. doi: 10.1038/s41571-021-00560-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Zhang Y, Sun Y, Xiang J, Zhang Y, Hu H, Chen H. A clinicopathologic prediction model for postoperative recurrence in stage Ia non–small cell lung cancer. J Thorac Cardiovasc Surg. 2014;148(4):1193‐1199. doi: 10.1016/j.jtcvs.2014.02.064 [DOI] [PubMed] [Google Scholar]
- 22. Merritt RE, Abdel‐Rasoul M, Fitzgerald M, D'Souza DM, Kneuertz PJ. Nomograms for predicting overall and recurrence‐free survival from pathologic stage ia and ib lung cancer after lobectomy. Clin Lung Cancer. 2021;22(4):e574‐e583. doi: 10.1016/j.cllc.2020.10.009 [DOI] [PubMed] [Google Scholar]
- 23. Chen H, Sui X, Yang F, Liu J, Wang J. Nomograms for predicting recurrence and survival of invasive pathological stage IA non‐small cell lung cancer treated by video assisted thoracoscopic surgery lobectomy. J Thorac Dis. 2017;9(4):1046‐1053. doi: 10.21037/jtd.2017.03.130 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Mauguen A, Pignon JP, Burdett S, et al. Surrogate endpoints for overall survival in chemotherapy and radiotherapy trials in operable and locally advanced lung cancer: a re‐analysis of meta‐analyses of individual patients’ data. Lancet Oncol. 2013;14(7):619‐626. doi: 10.1016/S1470-2045(13)70158-X [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Rajaram R, Huang Q, Li RZ, et al. Recurrence‐free survival in patients with surgically resected non‐small cell lung cancer: a systematic literature review and meta‐analysis. Chest. 2024;165(5):1260‐1270. doi: 10.1016/j.chest.2023.11.042 [DOI] [PubMed] [Google Scholar]
- 26. Amin MB, Gress DM, Vega LRM, Edge SB. In: Brookland RK, Washington MK, Compton CC, eds. AJCC Cancer Staging Manual, Eighth Edition. 8th ed. American College of Surgeons; 2018. [Google Scholar]
- 27. Edge S, Byrd DR, Compton CC, Fritz AG, Greene FL, Trotti A. AJCC Cancer Staging Manual, Seventh Edition. 7th ed. Springer; 2011. [Google Scholar]
- 28. Davidson‐Pilon C. Lifelines: survival analysis in Python. J Open Source Softw. 2019;4(40):1317. doi: 10.21105/joss.01317 [DOI] [Google Scholar]
- 29. Tougui I, Jilbab A, Mhamdi JE. Impact of the choice of cross‐validation techniques on the results of machine learning‐based diagnostic applications. Healthc Inform Res. 2021;27(3):189‐199. doi: 10.4258/hir.2021.27.3.189 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Garinet S, Wang P, Mansuet‐Lupo A, Fournel L, Wislez M, Blons H. Updated prognostic factors in localized NSCLC. Cancers. 2022;14(6):1400. doi: 10.3390/cancers14061400 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31. Schaefferkoetter J, Medical Image Reader and Viewer. Published online March 3, 2024. Accessed March 3, 2024. https://www.mathworks.com/matlabcentral/fileexchange/53745‐medical‐image‐reader‐and‐viewer
- 32. Christie JR, Daher O, van DongenH, Gilliland R, Abdelrazek M, Mattonen SA. A semi‐automatic threshold‐based segmentation algorithm for lung cancer delineation. Medical Imaging 2022: Biomedical Applications in Molecular, Structural, and Functional Imaging. SPIE; 2022:495‐500. doi: 10.1117/12.2611501 [DOI] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting Information
