Skip to main content
Radiology Advances logoLink to Radiology Advances
. 2026 Feb 3;3(1):umag003. doi: 10.1093/radadv/umag003

Deep learning-based pulmonary nodule risk assessment outperforms established malignancy risk scores in lung cancer screening

Eduardo J Mortani Barbosa Jr 1,, Yohan Kim 2, Yanbo Zhang 3, Arnaud A A Setio 4, Francois Mellot 5, Philippe A Grenier 6, Mathis Zimmermann 7, Bogdan Georgescu 8, Sasa Grbic 9, Warren B Gefter 10
PMCID: PMC12944827  PMID: 41768120

Abstract

Background

Pulmonary nodules are commonly encountered in lung cancer screening. The risk of malignancy varies widely and is generally estimated using expert consensus guidelines (Lung CT Imaging Reporting and Data Systems [Lung-RADS]).

Purpose

To assess the performance of a deep learning algorithm (Deep Pulmonary Nodule Profiler [DeepPNP]) for pulmonary nodule malignancy risk estimation in a lung cancer screening dataset and the effect of data enrichment in model training.

Materials and Methods

A retrospective analysis was conducted using 3 datasets. DeepPNP is a 3D convolutional network (EfficientNet-B0-based) operating on nodule-centered 3D patches. For the DeepPNP model training and validation, the National Lung Screening Trial (NLST) dataset was combined with 2 independent malignant nodule-only datasets, resulting in a merged dataset of 28 057 nodules, including 2362 malignant nodules. An ablation model (DeepPNP-NLST) was trained on NLST only. The testing was conducted on a held-out dataset from the NLST dataset. Performance metrics, including sensitivity, specificity, precision, F1 score, and accuracy, were analyzed across 3 operating thresholds selected based on specificities of 0.80, 0.85, and 0.90 (selected on the validation set). Benchmarks included Lung-RADS v2022 and the PanCan model.

Results

On the NLST test set (including 2597 nodules from 1243 CT scans), DeepPNP achieved an area under the receiver operating characteristic curve (ROC AUC) of 0.96 (95% confidence interval [CI], 0.95-0.97), outperforming Lung-RADS AUC = 0.91 (95% CI, 0.89-0.94; P < .001) and PanCan AUC = 0.93 (95% CI, 0.91-0.95; P < .001). DeepPNP-NLST had an AUC of 0.95 (95% CI, 0.93-0.97; P = .045 vs DeepPNP), indicating a modest gain from positive-only supplementation. Subgroup analyses showed consistent outperformance across nodule sizes and types. Operating-point metrics at 0.80/0.85/0.90 specificity are reported; at 0.80 specificity, DeepPNP achieved sensitivity of 0.94 (100/107; 95% CI, 0.88-0.98) and specificity of 0.88 (2196/2490; 95% CI, 0.87-0.90).

Conclusion

DeepPNP outperformed established malignancy risk models in lung cancer screening. The inclusion of biopsy-confirmed malignant nodules from 2 external datasets provided a measurable performance gain, underscoring the importance of data enrichment during model training.

Keywords: CT, pulmonary nodule, malignancy classification, deep learning, artificial intelligence, lung cancer screening


Summary

A deep learning model trained on a malignant nodule–enriched dataset outperformed established malignancy risk models when evaluated on a lung cancer screening dataset.


Key Results

  • The Deep Pulmonary Nodule Profiler (DeepPNP) for nodule malignancy risk assessment achieved an area under the receiver operating characteristic curve (AUC) of 0.96 (95% CI, 0.95-0.97) on the National Lung Screening Trial internal test set.

  • DeepPNP outperformed the Pan-Canadian Early Detection of Lung Cancer model and the Lung CT Imaging Reporting and Data Systems (v2022) (both P < .001).

  • Adding biopsy-confirmed malignant nodules from 2 external datasets modestly improved NLST test performance in ablations.

Introduction

Lung cancer is the leading cause of cancer-related mortality worldwide,1 and early detection is essential for improving survival rates. Low-dose computed tomography (LDCT) screening has been shown to reduce lung cancer mortality by 20% compared to chest radiography.2,3 However, LDCT screening presents challenges, including high false-positive rates and overdiagnosis.4–6 Traditional radiological assessment can be subjective and variable, affecting diagnostic accuracy.7–10 Furthermore, the ever-increasing volume of imaging studies necessitates tools that enhance efficiency, accuracy and consistency.11

Artificial intelligence (AI), particularly convolutional neural networks (CNNs), has achieved state-of-the-art performance in image recognition tasks, including pulmonary nodule classification.12–19 Studies have applied AI models in lung cancer screening populations, reporting high sensitivity and specificity in distinguishing benign from malignant nodules.16–19 One study developed a 3D Deep learning (DL) algorithm achieving performance comparable to radiologists in predicting lung cancer risk.18

The objective of this study was 2-fold: (1) to develop a DL model for malignancy risk estimation in screening-detected pulmonary nodules and to compare its performance with established malignancy risk scores, and (2) to assess the effect of malignant nodule enrichment during model training. The latter involved augmenting the lung cancer screening dataset with additional biopsy-confirmed malignant nodules.

Materials and methods

Study design and ethical considerations

Institutional Review Board (IRB) approvals were obtained from all participating institutions. Given the retrospective nature of the study, the requirement for individual patient consent was waived. The study was conducted in compliance with the Health Insurance Portability and Accountability Act (HIPAA) for U.S.-based data handling. This retrospective study trained, optimized, and validated a DL-based algorithm (DeepPNP) for pulmonary nodule malignancy risk estimation.

Datasets for model development and testing

  1. National Lung Screening Trial (NLST) Dataset 2 : The NLST dataset comprised low-dose CT scans from 26 722 participants collected between 2002 and 2007 in a multicenter randomized trial. Eligibility included adults aged 55-74 years with at least 30 pack-years of smoking and no prior lung cancer. The screening protocol consisted of a baseline low-dose CT and 2 annual screens (T0-T2); in this study, we included scans from all screening rounds. Malignant nodules were biopsy-confirmed within 1 year, and benign nodules were from participants who remained free of lung cancer during follow-up. All nodules larger than or equal to 3 mm were included, yielding 26 488 nodules (793 malignant). Multiple nodules could originate from the same participant. Nodules were analyzed as independent observations, with no adjustment for within-participant clustering.

  2. Dataset A: This dataset comprised 1132 retrospectively and consecutively collected chest CT scans from patients with biopsy-confirmed lung cancer, acquired at a tertiary referral center in Europe between 2010 and 2020. Scans included both low-dose and standard-dose protocols and were performed without intravenous contrast. Radiologists identified and labeled malignant nodules ≥3 mm. This dataset contributed 998 biopsy-confirmed malignant nodules to the training set.

  3. Dataset B 20 : This dataset comprised CT scans from 608 patients with biopsy-confirmed lung cancer, sourced via a data broker across multiple U.S. centers during 2014-2022. The data broker aggregates de-identified imaging from a multi-site provider network, and the exact number of contributing centers for this cohort was not available. Cases were identified from institutional pathology/biopsy reports. CT studies included both low-dose and standard-dose scans. Radiologists identified and labeled malignant nodules ≥3 mm. This dataset contributed 571 malignant nodules to the training set.

These 3 datasets are independent. NLST participants were enrolled as part of a prospective screening trial, while the other 2 datasets consisted of biopsy-confirmed malignant nodules retrospectively collected for model enrichment. The participant characteristics and CT acquisition parameters of these 3 datasets are provided in the Supplementary Material. Table 1 summarizes the characteristics of pulmonary nodules from the 3 datasets. For each nodule, size was measured as the longest axial diameter in millimeters on thin-section CT images. Nodule type (solid, part-solid, non-solid, or calcified) and the presence of spiculation were recorded based on radiologist annotations during the truthing process. Information on lung cancer histology and stage was not consistently available across datasets and therefore was not included in the analysis. Figure 1 summarizes the inclusion and exclusion criteria for the 3 datasets and the training, validation, and testing data split.

Table 1.

Characteristics of pulmonary nodules from the 3 datasets.

Dataset NLST Dataset A Dataset B
Number of participants 15 000 1064 608
Number of scans with nodules 7431 998 571
Number of nodules 26 488 998 571
Nodule size—mm
 3-10 23 409 (88.4%) 44 (4.4%) 72 (12.6%)
 10-30 2902 (11.0%) 576 (57.7%) 340 (59.5%)
 30-200 177 (0.6%) 378 (37.9%) 159 (27.9%)
Nodule type—number
 Solid 19 221 (72.5%) 509 (89.1%)
 Part-solid 610 (2.3%) 28 (4.9%)
 Non-solid 1867 (7.1%) 23 (4.0%)
 Calcified 4790 (18.1%) 11 (2.0%)
Spiculated—number 244 (0.9%) 223 (39.1%)
Cancer—number 793 (3.0%) 998 (100%) 571 (100%)

A dash (“–”) in this table indicates that specific data were not available for that dataset.

Abbreviation: NLST = National Lung Screening Trial.

Figure 1.

Figure 1

Data inclusion and exclusion criteria for pulmonary nodules in the National Lung Screening Trial (NLST) dataset and 2 additional malignant datasets for training.

Data preparation and splitting

Participants from the NLST dataset were randomly divided into training (75%), validation (10%), and test (15%) sets, ensuring a representative distribution of the study sample across subsets. To alleviate the class imbalance between benign and malignant nodules in the NLST dataset, additional malignant nodules from Datasets A and B were incorporated into the training set. These datasets, containing only malignant nodules, strengthened the model’s capacity to identify and learn features associated with malignancy. The NLST validation and test sets retained the original population distribution, making them representative of a screening scenario. Table 2 presents the characteristics of the training, validation, and test sets in terms of nodule size, type, spiculation, and malignancy.

Table 2.

Characteristics of pulmonary nodules data split for the DL model development and testing.

Train (NLST, Dataset A, Dataset B)
Validation (NLST)
Test (NLST)
Benign (n = 20 789) Malignant (n = 2203) Benign (n = 2416) Malignant (n = 52) Benign (n = 2490) Malignant (n = 107)
Nodule size—mm
 Mean ± SD 6.1 ± 4.8 25.0 ± 17.9 6.2 ± 5.3 15.6 ± 10.3 5.1 ± 2.8 17.1 ± 11.7
 Median 5.0 19.9 5.0 12.2 4.3 14.7
 Range 3.0-195.0 3.1-127.9 3.0-123.8 4.0-51.6 3.0-35.3 3.8-73.9
Nodule type—number
 Solid 15 240 (73.3%) 1070 (48.6%) 1768 (73.1%) 43 (82.7%) 1512 (60.7%) 97 (90.7%)
 Part-solid 486 (2.3%) 61 (2.8%) 58 (2.4%) 3 (5.8%) 24 (1.0%) 6 (5.6%)
 Non-solid 1522 (7.3%) 63 (2.9%) 169 (7.0%) 6 (11.5%) 126 (5.1%) 4 (3.7%)
 Calcified 3541 (17.0%) 11 (0.5%) 421 (29.8%) 0 (0.0%) 828 (33.3%) 0 (0.0%)
 Not specified 0 (0.0%) 998 (45.3%) 0 (0.0%) 0 (0.0%) 0 (0.0%) 0 (0.0%)
Spiculated—number 89 (0.4%) 334 (15.2%) 9 (0.4%) 12 (23.1%) 3 (0.1%) 20 (18.7%)

Abbreviation: NLST = National Lung Screening Trial.

To evaluate the effectiveness of the data enrichment strategy, we trained an additional model, DeepPNP-NLST, using only the screening data (NLST) without incorporating the additional biopsy-confirmed malignant cases.

Nodule patch pre-processing

Across the 3 datasets, 14 radiologists reviewed the CT scans and labeled pulmonary nodules using an internally developed tool; each radiologist annotated a subset of cases. Each nodule was assigned a centroid, and the size was determined as the longest axial diameter. A secondary review was done by a thoracic radiologist with 8 years of experience in thoracic imaging. For the NLST dataset, all nodules >3 mm were annotated. In the 2 malignant datasets, only biopsy-proven malignant nodules were annotated, based on biopsy reports or images used for procedure guidance. For pulmonary nodules smaller than 50 mm in diameter, we cropped a 50 × 50 × 50 mm³ 3D region of interest (ROI) patch centered on the nodule and then resampled it to a 64 × 64 × 64 voxel patch. This 3D patch size was chosen to cover most pulmonary nodules and provide sufficient context around the nodule. For nodules larger than 50 mm in diameter, a larger 3D ROI patch, defined as a bounding box tightly fitting the nodule, was extracted to ensure the entire nodule was included, then resampled to a 64 × 64 × 64 voxel patch as well. The patches were normalized from a window level of -300 Hounsfield units (HU) and a window width of 1400 HU to a value range between 0 and 1, with values exceeding this range being clipped. The resulting preprocessed 3D patches served as model inputs during both training and inference.

Study outcome

The primary outcome of this study was the classification of pulmonary nodules detected through lung cancer screening as either benign or malignant. Malignant nodules were defined as those that were biopsy-confirmed within 1 year of detection in the NLST dataset, or nodules identified as malignant in the 2 external datasets (Dataset A and Dataset B), which consisted entirely of biopsy-proven lung cancers. Benign nodules were defined as nodules detected in the NLST dataset for which patients remained free of lung cancer throughout the 7-year follow-up period.

DL model

Model architecture and training

We implemented a 3D convolutional neural network (DeepPNP) based on the EfficientNet-B0 architecture, adapting it to process volumetric CT imaging data.21 The model architecture was designed to capture complex spatial features associated with pulmonary nodules by replacing the standard 2D convolutional kernels (3 × 3 and 5 × 5) with 3D kernels (3 × 3 × 3 and 5 × 5 × 5). We optimized the model’s hyperparameters using a randomized search, testing 32 configurations and selecting the hyperparameters that performed best on the validation dataset. The final model configuration included a batch size of 16, a learning rate of 2.092e-05, and a dropout rate of 0.223, which helped balance model generalization and overfitting risks. To manage class imbalance, we used focal loss with a gamma parameter set to 2 and the alpha parameter set to 0.25.22 The model was optimized using the Adam optimizer.23 The training was conducted over 50 epochs, each consisting of 500 iteration steps. To enhance the robustness and generalizability of the model, extensive 3D data augmentation techniques were employed during training, incorporating both spatial and intensity-based transformations. Spatial augmentations included random shifts along each axis (−1.0 mm to 1.0 mm), rotations (−30° to 30°), zooming (scaling factors of 0.95 to 1.05), and flipping along axial, coronal, or sagittal planes. Intensity augmentations consisted of adding Gaussian noise (10% of the window width), Gaussian blurring (standard deviation of 0.5 to 1.5), brightness scaling (0.7 to 1.3), and contrast adjustments (0.65 to 1.5). Each augmentation was applied probabilistically: flipping with a 50% probability, Gaussian blurring with a 10% probability, and all other augmentations with a 15% probability. These augmentations were composed sequentially to ensure a diverse training dataset and improve model resilience to variations in nodule size, shape, intensity, and position, while maintaining clinical relevance. The system was implemented using the PyTorch framework (version 1.10.1) and Python (version 3.9.2). Model development and testing were conducted on a high-performance supercomputing cluster comprising 200 compute nodes, each equipped with 8 GPUs, including Titan X, V100, A100, and H100 models. The training code will not be made publicly available at this time, as DeepPNP is currently a prototype.

Heatmap

We generated Gradient-weighted Class Activation Mapping (Grad-CAM) saliency maps for DeepPNP predictions.24 For each 3D nodule patch, Grad-CAM was computed and min–max normalized. Visualization used the axial, coronal, and sagittal central slices of the patch, shown both as raw images and as overlays with the corresponding heatmap to highlight image regions most influencing the prediction. Heatmaps were used solely for qualitative interpretation and were not used for training, model selection, or thresholding.

Statistical analysis

We compared DeepPNP with DeepPNP-NLST, the Pan-Canadian Early Detection of Lung Cancer (PanCan) model, and Lung CT Imaging Reporting and Data Systems (Lung-RADS).25,26 PanCan scores were calculated using variables derived from nodule size and morphology available in the NLST dataset including malignancy probability estimates using 9 variables—age, sex, family history of lung cancer, emphysema, nodule size, type, location, count, and spiculation. Lung-RADS categories were assigned according to the version v2022 criteria based on nodule size and attenuation type.

The performance of the models was evaluated using the area under the receiver operating characteristic curve (ROC AUC), and DeLong test was used to calculate P values for pairwise ROC AUC comparisons.27  P values were 2-sided, unadjusted for multiple comparisons, with a nominal significance level of 0.05. Subgroup analyses by nodule size ranges and types were performed and ROC AUCs are reported for each subgroup. The 3 specificity thresholds (0.80, 0.85, 0.90) were selected from operating points on the DeepPNP ROC curve on the NLST validation set to represent clinically relevant trade-offs. Additionally, models’ performance was further analyzed at 3 operating thresholds, with corresponding metrics—sensitivity, specificity, precision, F1 score, and accuracy (with 95% confidence intervals [CIs] calculated via bootstrapping). Metrics were calculated using the Scikit-learn (version 1.5.0) Python library.28–30 All analyses were performed on the held-out NLST internal test set.

Results

Figure 2 illustrates Grad-CAM heatmaps for 4 representative lung nodule patches from the NLST test set. The first 2 rows depict malignant nodules; the last 2 depict benign nodules. In the spiculated malignant example (row 1), activations are strong and well localized to the nodule, consistent with visually apparent malignant features. In the part-solid malignant example (row 2), activations are present but more diffuse. In the benign examples (rows 3-4), activations are weak and non-focal.

Figure 2.

Figure 2

Grad-CAM heatmap visualizations of DeepPNP malignancy predictions. Four representative NLST test-set nodules are shown: solid (row 1), part-solid (row 2), nonsolid (row 3), and calcified (row 4). For each case, axial, coronal, and sagittal central slices are displayed as raw patches in lung window (window level = −600 HU, window width = 1500 HU) and as overlays with the corresponding heatmap. The first 2 nodules are malignant; the last 2 are benign. Heatmaps highlight image regions that most influenced the prediction, with stronger, well-localized activation in the spiculated malignant example and weak, non-focal activation in benign examples.

The DeepPNP model demonstrated superior performance in distinguishing malignant from benign pulmonary nodules on the NLST test set. DeepPNP achieved AUC of 0.96 (95% CI, 0.95-0.97), outperforming the Lung CT Screening Reporting and Data System (Lung-RADS v2022) (AUC = 0.91 [95% CI, 0.89-0.94]) and the Pan-Canadian Early Detection of Lung Cancer (PanCan) model (AUC = 0.93 [95% CI, 0.91-0.95]). The DeLong test demonstrated that the DeepPNP model achieved statistically significantly better performance compared to both the Lung-RADS (P < .001) and PanCan models (P < .001) in terms of ROC AUC. DeepPNP also outperformed DeepPNP-NLST (AUC = 0.95 [95% CI, 0.93-0.97]; P = .045), indicating that incorporating additional biopsy-confirmed malignant nodules modestly improved overall discrimination. Figure 3 illustrates the ROC AUCs of the models on the NLST test set.

Figure 3.

Figure 3

ROC curves of the DeepPNP model, DeepPNP-NLST model, PanCan model, and the Lung-RADS for discrimination of malignant nodules from benign nodules in the NLST test set. (A) The results on all the nodules and the 3 operating points selected for DeepPNP are shown. (B) and (C) Comparison of the performance on the nodules within the size ranges 3-10 mm and 10-200 mm, respectively. (D) and (E) Comparison of the performance on the solid nodules and sub-solid nodules, respectively. The AUC and corresponding 95% confidence interval are reported for each method.

In subgroup analyses, DeepPNP’s performance was higher for small nodules than for larger ones—AUC 0.93 (95% CI, 0.90-0.96) for 3-10 mm (Figure 3B) vs 0.85 (95% CI, 0.79-0.89) for 10-200 mm (Figure 3C). In both size ranges, DeepPNP significantly outperformed PanCan and Lung-RADS (both P < .001). By nodule type, DeepPNP performed slightly better on sub-solid than solid nodules—AUC 0.98 (95% CI, 0.94-1.00; Figure 3E) vs 0.96 (95% CI, 0.94-0.97; Figure 3D). On solid nodules, DeepPNP outperformed PanCan and Lung-RADS (both P < .001) and DeepPNP-NLST (P = .016). On sub-solid nodules, DeepPNP outperformed Lung-RADS (P = .006) and was comparable with PanCan (P = .74) and DeepPNP-NLST (P = .46).

We evaluated 3 DeepPNP configurations, each selected based on a specific operating point on the validation dataset: specificity at 0.80 (DeepPNP-S80), 0.85 (DeepPNP-S85), and 0.90 (DeepPNP-S90)—as illustrated in Figure 3A. Table 3 summarizes the sensitivity, specificity, precision, F1 score, and accuracy of these DeepPNP configurations, and for comparison, it reports the corresponding operating points for all competing methods tuned to the same specificity targets. Because Lung-RADS thresholds are discrete, the S80 and S85 targets coincide at Lung-RADS 3, yielding identical Lung-RADS results at these 2 operating points. At the operating point S80, DeepPNP performed significantly better than DeepPNP-NLST, PanCan, and Lung-RADS on specificity, precision, F1 score, and accuracy (all P < .001). Sensitivity was also higher than PanCan (P = .034). At S85, DeepPNP showed significantly higher specificity compared to DeepPNP-NLST (P = .002) and PanCan (P = .006), and higher accuracy than DeepPNP-NLST (P = .008) and PanCan (P = .002). At S90, DeepPNP-S90 achieved higher sensitivity, specificity, accuracy than Lung-RADS (all P < .001). There were no statistically significant differences between DeepPNP-S90 and DeepPNP-NLST-S90 on any metric: sensitivity (P = .92), specificity (P = .30), precision (P = .44), F1 score (P = .56), and accuracy (P = .40).

Table 3.

Malignant nodule classification results on the NLST test set at 3 operating points targeting specificities of 0.80, 0.85, and 0.90 (S80, S85, S90).

Sensitivity Specificity Precision F1 score Accuracy
Lung-RADS
 S80
  • 0.89 (95/107)

  • [0.83-0.94]

  • 0.83 (2062/2490)

  • [0.81-0.85]

  • 0.18 (95/523)

  • [0.14-0.22]

0.30 [0.24-0.35]
  • 0.83 (2157/2597)

  • [0.81-0.85]

 S85
  • 0.89 (95/107)

  • [0.83-0.94]

  • 0.83 (2062/2490)

  • [0.81-0.85]

  • 0.18 (95/523)

  • [0.14-0.22]

0.30 [0.24-0.35]
  • 0.83 (2157/2597)

  • [0.81-0.85]

 S90
  • 0.75 (80/107)

  • [0.67-0.83]

  • 0.93 (2316/2490)

  • [0.92-0.94]

  • 0.31 (80/254)

  • [0.25-0.38]

0.44 [0.37-0.50]
  • 0.92 (2396/2597)

  • [0.91-0.93]

PanCan
 S80
  • 0.88 (94/107)

  • [0.81-0.94]

  • 0.84 (2089/2490)

  • [0.82-0.86]

  • 0.19 (94/495)

  • [0.15-0.23]

0.31 [0.25-0.36]
  • 0.84 (2183/2597)

  • [0.82-0.86]

 S85
  • 0.82 (88/107)

  • [0.74-0.89]

  • 0.89 (2209/2490)

  • [0.87-0.90]

  • 0.24 (88/369)

  • [0.19-0.29]

0.37 [0.31-0.42]
  • 0.88 (2297/2597)

  • [0.87-0.90]

 S90
  • 0.78 (83/107)

  • [0.70-0.85]

  • 0.93 (2323/2490)

  • [0.92-0.94]

  • 0.33 (83/250)

  • [0.27-0.40]

0.46 [0.39-0.53]
  • 0.93 (2406/2597)

  • [0.91-0.94]

DeepPNP-NLST
 S80
  • 0.91 (97/107)

  • [0.85-0.96]

  • 0.86 (2144/2490)

  • [0.85-0.88]

  • 0.22 (97/443)

  • [0.18-0.26]

0.35 [0.30-0.40]
  • 0.86 (2241/2597)

  • [0.85-0.88]

 S85
  • 0.87 (93/107)

  • [0.80-0.93]

  • 0.90 (2244/2490)

  • [0.89-0.91]

  • 0.27 (93/339)

  • [0.23-0.32]

0.42 [0.36-0.47]
  • 0.90 (2337/2597)

  • [0.89-0.91]

 S90
  • 0.81 (87/107)

  • [0.74-0.89]

  • 0.94 (2353/2490)

  • [0.94-0.95]

  • 0.39 (87/224)

  • [0.33-0.45]

0.53 [0.46-0.59]
  • 0.94 (2440/2597)

  • [0.93 -0.95]

DeepPNP
 S80
  • 0.94 (100/107)

  • [0.88-0.98]

  • 0.88 (2196/2490)

  • [0.87-0.90]

  • 0.25 (100/394)

  • [0.21-0.30)

0.40 [0.35-0.45]
  • 0.88 (2296/2597)

  • [0.87-0.90]

 S85
  • 0.86 (92/107)

  • [0.79-0.92]

  • 0.92 (2283/2490)

  • [0.91-0.93]

  • 0.31 (92/299)

  • [0.26-0.36]

0.45 [0.39-0.51]
  • 0.91 (2375/2597)

  • [0.90-0.93]

 S90
  • 0.82 (88/107)

  • [0.74-0.89]

  • 0.94 (2342/2490)

  • [0.93-0.95]

  • 0.37 (88/236)

  • [0.32-0.44]

0.51 [0.45-0.58]
  • 0.94 (2430/2597)

  • [0.93-0.95]

The operating points are selected based on the NLST validation set. Values in parentheses are nodule counts; values in brackets are 95% confidence intervals.

Abbreviations: DeepPNP = Deep Pulmonary Nodule Profiler; Lung-RADS = Lung CT Imaging Reporting and Data Systems; NLST = National Lung Screening Trial.

Figure 4 presents 2 examples for each decision category (true positives, true negatives, false positives, false negatives) for DeepPNP-S85. True positives were correctly identified by DeepPNP-S85 and the competing methods. True negatives were correctly classified by DeepPNP-S85; the PanCan model misclassified the first case as malignant. False positives illustrate common overcall patterns—borderline PanCan risk and small nodules in smokers whose morphology can mimic malignancy. False negatives were small/subtle with low PanCan scores. Notably, the first false-negative would be correctly classified as malignant at the more sensitive DeepPNP-S80 operating point.

Figure 4.

Figure 4

The example cases of the DeepPNP-S85 model’s true positives, true negatives, false positives, and false negatives from the test set. Each category includes 2 cases. (A) True positives: 63F, Lung-RADS 4A, 13.0 mm nodule, DeepPNP’s output score is 0.36; 66M with family history, Lung-RADS 4B, 48.5 mm spiculated nodule, output score is 0.64. (B) True negatives: 69F, 7.0 mm nodule, Lung-RADS 3, output score is 0.15; 56M smoker, non-solid 5.1 mm nodule, Lung-RADS 2, output score is 0.14. (C) False positives: 63M with family history and smoking, Lung-RADS 4A, 14.0 mm nodule, output score is 0.41; 68F smoker, Lung-RADS 3, 7.9 mm nodule, output score is 0.35. (D) False negatives: 67F, Lung-RADS 4A, 9.6 mm nodule, output score is 0.16; 63M with family history and smoking, Lung-RADS 2, 5.4 mm nodule, output score is 0.15. The arrows indicate the location of nodules. All images are axial views displayed with lung window (window level = −600 HU, window width = 1500 HU).

Discussion

We demonstrated that the DeepPNP model outperformed traditional risk models, such as the PanCan logistic regression model and an expert consensus guideline (Lung-RADS v2022), in classifying benign versus malignant pulmonary nodules within a lung cancer screening population. The DeepPNP model achieved the highest ROC AUC on the NLST test set, indicating excellent discriminative ability. Data enrichment by incorporating biopsy-confirmed malignant nodules from 2 additional external datasets during training resulted in measurably improved model performance.

The superior performance of the DeepPNP model on the NLST test set suggests that DL algorithms can effectively capture the relevant imaging features associated with malignancy in a screening population. Moreover, the fact that the DL models can identify high-risk nodules solely based on imaging features suggests that the factors driving AI performance in detecting high-risk nodules are likely correlated with the imaging characteristics that expert physicians use to determine the need for biopsy.

In clinical practice, the DeepPNP algorithm could serve as a decision-support tool to assist radiologists during CT interpretation. By providing automated and quantitative malignancy risk estimates for detected nodules, the model may help prioritize high-risk findings for biopsy or closer follow-up, enhancing consistency and efficiency in lung cancer screening workflows.

Our findings align with prior work that both showcase the promise of DL supporting lung cancer risk assessment for pulmonary nodules based on chest CT imaging.10,18,19,31 Ardila et al18 reported high accuracy in lung cancer risk prediction using a DL system applied to low-dose CT screening, achieving results comparable to expert radiologists. Similar conclusions were reached by Chung et al19 and Setio et al,17 who showed that AI models can match human observers in malignancy risk estimation and nodule characterization. Hendrix et al32 demonstrated that a DL model could accurately distinguish benign from malignant nodules in non-screening chest CTs. Collectively, these studies highlight the potential of DL-based risk estimation. In this context, our results confirm the excellent discriminative ability of DeepPNP in the intended screening setting.

Several limitations should be acknowledged. First, the current model assumes prior nodule localization and therefore cannot function as a fully automated screening pipeline. Second, testing was limited to the NLST set (ie, internal test set), and further validation on external datasets will be important to confirm generalizability. Finally, the study focused on nodule-level analysis without incorporating clinical or demographic variables beyond CT imaging features, which could further improve performance and robustness.

Aside from validation on external datasets, future research should focus on integrating clinical and sociodemographic variables into AI models to further enhance their performance and robustness. Population-specific model adaptation that incorporates comprehensive patient data beyond imaging features will be essential to ensure that AI tools are optimized for the characteristics of the target population and exhibit improved generalizability. Moreover, future investigations should evaluate outcomes by histologic subtype, disease stage, and survival to provide deeper insights into the clinical utility of malignancy risk estimation.

Our study demonstrates that a DL model (DeepPNP) can be effectively developed for malignancy risk estimation in screening-detected pulmonary nodules, achieving superior performance compared to traditional risk assessment tools. By incorporating additional biopsy-confirmed malignant nodules from 2 additional datasets into training, we further assessed the benefit of data enrichment, which improved model performance. These results demonstrate that DL models can effectively improve the accuracy of pulmonary nodule malignancy risk estimation compared with established guideline-based approaches such as Lung-RADS, indicating their future potential to enhance lung cancer screening workflows.

Supplementary Material

umag003_Supplementary_Data

Acknowledgments

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request and subject to relevant data-sharing agreements. We welcome collaborations and are willing to validate external models using our data. However, the code used for training/testing the models relies heavily on internal packages and infrastructure, making its public release infeasible.

The concepts and information presented in this paper are based on research results that are not commercially available. Future commercial availability cannot be guaranteed.

Glossary

Abbreviations

AI

artificial intelligence

DeepPNP

Deep Pulmonary Nodule Profiler

DL

deep learning

Grad-CAM

Gradient-weighted Class Activation Mapping

LDCT

low-dose computed tomography

Lung-RADS

Lung CT Imaging Reporting and Data Systems

NLST

National Lung Screening Trial

ROC AUC

area under the receiver operating characteristic curve

Contributor Information

Eduardo J Mortani Barbosa, Jr, Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States.

Yohan Kim, Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States.

Yanbo Zhang, Department of Digital Technology and Innovation, Siemens Healthineers, Princeton, NJ 08540, United States.

Arnaud A A Setio, Department of Digital Technology and Innovation, Siemens Healthineers, Erlangen, Erlangen 91052, Germany.

Francois Mellot, Department of Radiology, Foch Hospital, Suresnes 92150, France.

Philippe A Grenier, Department of Radiology, Foch Hospital, Suresnes 92150, France.

Mathis Zimmermann, Department of D&A, Siemens Healthineers, Malvern, PA 19355, United States.

Bogdan Georgescu, Department of Digital Technology and Innovation, Siemens Healthineers, Princeton, NJ 08540, United States.

Sasa Grbic, Department of Digital Technology and Innovation, Siemens Healthineers, Princeton, NJ 08540, United States.

Warren B Gefter, Department of Radiology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, United States.

Author contributions

Eduardo J. Mortani Barbosa (Conceptualization, Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Supervision, Validation, Writing—original draft, Writing—review & editing), Yohan Kim (Writing—original draft), Yanbo Zhang (Formal analysis, Investigation, Methodology, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), Arnaud A.A. Setio (Formal analysis, Writing—original draft), François Mellot(Investigation, Resources), Philippe A. Grenier (Investigation, Resources), Mathis Zimmermann (Funding acquisition, Investigation, Resources), Bogdan Georgescu (Formal analysis, Methodology, Resources, Software, Validation, Visualization), Sasa Grbic (Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing—original draft, Writing—review & editing), and Warren B. Gefter (Investigation, Resources, Writing—original draft, Writing—review & editing)

Supplementary material

Supplementary material is available at Radiology Advances online.

Funding

This study was supported by a research grant from Siemens Healthineers.

Conflicts of interest

E.J.M.B.: receives research funding support, Siemens Healthineers. Y.K.: no relevant relationships. Y.Z.: employed by Siemens Healthineers; stock/stock options in Siemens Healthineers. A.A.A.S.: employed by Siemens Healthineers; stock/stock options in Siemens Healthineers. F.M.: no relevant relationships. P.G.: no relevant relationships. M.Z.: employed by Siemens Healthineers; stock/stock options in Siemens Healthineers. B.G.: employed by Siemens Healthineers; stock/stock options in Siemens Healthineers. S.G.: employed by Siemens Healthineers; stock/stock options in Siemens Healthineers. W.B.G.: receives research funding support and consulting fees, Siemens Healthineers.

References

  • 1. Siegel RL, Miller KD, Jemal A.  Cancer statistics, 2022. CA Cancer J Clin. 2022;72(1):7-33. 10.3322/caac.21708 [DOI] [PubMed] [Google Scholar]
  • 2. National Lung Screening Trial Research Team. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med. 2011;365(5):395-409. 10.1056/NEJMoa1102873 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. de Koning HJ, van der Aalst CM, de Jong PA, et al.  Reduced lung-cancer mortality with volume CT screening in a randomized trial. N Engl J Med. 2020;382(6):503-513. 10.1056/NEJMoa1911793 [DOI] [PubMed] [Google Scholar]
  • 4. Bach PB, Mirkin JN, Oliver TK, et al.  Benefits and harms of CT screening for lung cancer: a systematic review. JAMA. 2012;307(22):2418-2429. 10.1001/jama.2012.5521 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5. Humphrey LL, Deffebach M, Pappas M, et al.  Screening for lung cancer with low-dose computed tomography: a systematic review. Ann Intern Med. 2013;159(6):411-420. 10.7326/0003-4819-159-6-201309170-00690 [DOI] [PubMed] [Google Scholar]
  • 6. Gould MK, Donington J, Lynch WR, et al.  Evaluation of individuals with pulmonary nodules: when is it lung cancer?  Chest. 2013;143(5 Suppl):e93S-e120S. 10.1378/chest.12-2351 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Wormanns D, Diederich S.  Characterization of small pulmonary nodules: potential of computed tomography and magnetic resonance imaging. Lung Cancer. 2004;45(Suppl 2):S79-S88. 10.1016/j.lungcan.2004.07.976 [DOI] [PubMed] [Google Scholar]
  • 8. Wood DE, Kazerooni EA, Baum SL, et al.  Lung cancer screening, version 3.2018. J Natl Compr Canc Netw. 2018;16(4):412-441. 10.6004/jnccn.2018.0020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9. van Griethuysen JJM, Fedorov A, Parmar C, et al.  Computational radiomics system to decode the radiographic phenotype. Cancer Res. 2017;77(21):e104-e107. 10.1158/0008-5472.CAN-17-0339 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Aerts HJWL, Velazquez ER, Leijenaar RTH, et al.  Decoding tumour phenotype by noninvasive imaging using a quantitative radiomics approach. Nat Commun. 2014;5:4006. 10.1038/ncomms5006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Esteva A, Robicquet A, Ramsundar B, et al.  A guide to deep learning in healthcare. Nat Med. 2019;25(1):24-29. 10.1038/s41591-018-0316-z [DOI] [PubMed] [Google Scholar]
  • 12. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016:770-778. 10.1109/CVPR.2016.90 [DOI]
  • 13. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. arXiv preprint, arXiv: 1409.1556, 2014, preprint: not peer reviewed. https://arxiv.org/abs/1409.1556
  • 14. Huang G, Liu Z, van der Maaten L, Weinberger KQ. Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017:4700-4708. 10.1109/CVPR.2017.243 [DOI]
  • 15. Sudre CH, Li W, Vercauteren T, Ourselin S, J, Cardoso M. Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support. Lecture Notes in Computer Science, vol 10553. Springer; 2017:240-248. 10.1007/978-3-319-67558-9_28 [DOI]
  • 16. Shen W, Zhou M, Yang F, et al.  Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognit. 2017;61:663-673. 10.1016/j.patcog.2016.05.029 [DOI] [Google Scholar]
  • 17. Setio AAA, Traverso A, de Bel T, et al.  Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the LUNA16 challenge. Med Image Anal. 2017;42:1-13. 10.1016/j.media.2017.06.015 [DOI] [PubMed] [Google Scholar]
  • 18. Ardila D, Kiraly AP, Bharadwaj S, et al.  End-to-end lung cancer screening with three-dimensional deep learning on low-dose chest computed tomography. Nat Med. 2019;25(6):954-961. 10.1038/s41591-019-0447-x [DOI] [PubMed] [Google Scholar]
  • 19. Chung K, Jacobs C, Scholten ET, et al.  Malignancy risk estimation of pulmonary nodules in screening CTs: comparison between a computer model and human observers. PLoS One. 2017;12(11):e0185032. 10.1371/journal.pone.0185032 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Multi-Center Dataset. Accessed March 16, 2022. https://www.onemednet.com/
  • 21. Tan M, Le Q. EfficientNet: rethinking model scaling for convolutional neural networks. In: Proceedings of the 36th International Conference on Machine Learning, 2019:6105-6114. Accessed 7 November, 2025. http://proceedings.mlr.press/v97/tan19a.html
  • 22. Lin T-Y, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, 2017:2980-2988. 10.1109/ICCV.2017.324 [DOI]
  • 23. Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint, arXiv: 1412.6980, 2014, preprint: not peer reviewed. https://arxiv.org/abs/1412.6980
  • 24. Selvaraju, Ramprasaath R, et al. “Grad-cam: Visual explanations from deep networks via gradient-based localization.” In: Proceedings of the IEEE International Conference on Computer Vision, 2017.
  • 25.McWilliams A, et al.  Probability of cancer in pulmonary nodules detected on first screening CT. N Engl J Med. 2013;369(10):910-919. https://www.nejm.org/doi/full/10.1056/nejmoa1214726 [Google Scholar]
  • 26. American College of Radiology. Lung CT Screening Reporting and Data System (Lung-RADS) Version 2022. Accessed November 7, 2025. https://www.acr.org/Clinical-Resources/Clinical-Tools-and-Reference/Reporting-and-Data-Systems/Lung-RADS.
  • 27. DeLong ER, DeLong DM, Clarke-Pearson DL.  Comparing the areas under two or more correlated ROC curves: a nonparametric approach. Biometrics. 1988;44(3):837-845. 10.2307/2531595 [DOI] [PubMed] [Google Scholar]
  • 28. Virtanen P, Gommers R, Oliphant TE, et al.  SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat Methods. 2020;17(3):261-272. 10.1038/s41592-019-0686-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29. Pedregosa F, Varoquaux G, Gramfort A, et al.  Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825-2830. http://jmlr.org/papers/v12/pedregosa11a.html [Google Scholar]
  • 30. McKinney W. Data structures for statistical computing in Python. In: Proceedings of the 9th Python in Science Conference, 2010:51-56. 10.25080/Majora-92bf1922-00a [DOI]
  • 31. Geppert J, et al.  Software using artificial intelligence for nodule and cancer detection in CT lung cancer screening: systematic review of test accuracy studies. Thorax. 2024;79(11):1040-1049. 10.1136/thorax-2024-221768 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Hendrix W, et al.  Deep learning for the detection of benign and malignant pulmonary nodules in non-screening chest CT scans. Commun Med. 2023;3(1):156. 10.1038/s43856-023-00388-5 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

umag003_Supplementary_Data

Articles from Radiology Advances are provided here courtesy of Oxford University Press

RESOURCES