Abstract
Background
Mathematical prediction models (MPMs) based on clinical and radiologist-assessed features have been developed to assist with lung cancer risk assessment for imaging-detected lung nodules. However, MPMs were developed using different datasets, thresholds, and feature sets, making it difficult to cross-compare the published performance metrics and determine prospective performance stability. The aim of this study is to utilize a large lung cancer screening cohort with identified pulmonary nodules to compare the performance of four MPMs, at a standardized sensitivity value, to reduce the false positive rate for lung cancer screening exams.
Methods
This retrospective study utilized low-dose computed tomography (LDCT) identified lung nodules from the National Lung Screening Trial (NLST) to evaluate four MPMs [Mayo Clinic (MC), Veterans Affairs (VA), Peking University (PU), and Brock University (BU)]. For cross-comparison, a small NLST sub-cohort (n=270) was used to determine a calibrated decision threshold for each model, targeting a sensitivity for detecting lung cancer of 95%. Performance was evaluated using area under the receiver-operating-characteristic curve (AUC-ROC), area under the precision-recall curve (AUC-PR), sensitivity, and specificity. The calibrated threshold applied to the remaining NLST cohort (n=1,083) was used to demonstrate the stability of performance metrics.
Results
A total of 1,353 patients [mean ± standard deviation (SD) age, 62.3±5.2 years; 746 male] were included, of which 122 (9.0%) had a malignant nodule. At the target sensitivity of 95%, the highest testing specificity (correctly identified benigns) was seen in the BU and MC models (55% and 52%, respectively), compared to the VA (45%) and the PU (16%). The AUC-ROCs for BU (83%), MC (83%), PU (76%), and VA (77%) suggest high-moderate performance, while AUC-PR more accurately reflects that all the models have sub-optimal precision (27–33%).
Conclusions
Tuning calibration thresholds of existing MPM aids in performance comparison and stability for application in the lung cancer screening setting. However, targeting high sensitivity (95%), the achievable specificity of the MPMs is low (16–55%), which may limit clinical utility.
Keywords: Prediction models, computed tomography (CT), screening, lung cancer, malignancy
Highlight box.
Key findings
• Use of prediction models is recommended by medical imaging societies, yet the application of these models on lung cancer screening identified nodules was ineffective in reducing false positives.
What is known and what is new?
• Routine low-dose chest computed tomography for at-risk patients reduces lung cancer mortality.
• This study demonstrates that logistical regression models for lung cancer prediction do not improve clinical management of lung nodules in the context of lung cancer screening.
What is the implication, and what should change now?
• These findings suggest that more complex models are required to be able to predict lung nodule malignancy from medical imaging.
Introduction
Lung cancer is the leading cause of cancer-related deaths in the United States (1). Annual screening with low-dose computed tomography (LDCT) in at-risk individuals reduced lung cancer mortality by 20–24% in the American National Lung Screening Trial (NLST) (2) and the European NELSON study (3). Based on the NLST results, the Centers for Medicare & Medicaid Services (CMS) approved coverage of lung cancer screening preventive services (2,4) and recently expanded the inclusion criteria (50–80 years with at least 20 pack years smoking history), almost doubling the number of Americans eligible for lung cancer screening (5-7). Despite the established benefits of lung cancer screening, significant challenges remain with broad clinical implementation. One notable challenge is the large proportion of non-cancerous nodules detected. In the NLST, approximately 39% of participants had at least one suspected malignant nodule during three rounds of screening; 70% of positive tests led to diagnostic evaluations and 96% were determined to be false positives (benign) (2). NLST also underestimates the harms from diagnostic workup compared to the general population; the rate of complications from invasive diagnostic procedures was more than doubled in community settings compared to the original NLST cohort (8). In addition to the physical risks associated with invasive procedures, 83–87% of patients reported anxiety related to benign, screening-detected nodules that require imaging follow-up (9-12). Anxiety can significantly impact patients’ overall well-being, and for some, it becomes a barrier to completing follow-up care (10,13,14). Multiple studies have demonstrated worse anxiety in the 6 months immediately following an indeterminate lung cancer screening result, compared to pre-screening and greater than 6 months (15-17). The short-term decline in health-related quality of life and elevated distress was found to subside after confirmation of benign status (18). Hence, more accurate ways of estimating lung cancer risk in LDCT screen-detected nodules are needed to improve clinical management.
The American College of Radiology developed the Lung Imaging Reporting and Data System (Lung-RADS) as a standardized reporting and classification system for pulmonary nodules detected with LDCT screening (19). Lung-RADS version 2022 comprises six numeric categories, where a higher number means imaging follow-up and associated management recommendations become increasingly frequent and invasive. For example, Lung-RADS 1 (no nodules) results in a 1-year follow-up, whereas 4B/4X nodules may result in a 1-month follow-up and/or tissue sampling. Though validated in some contexts, Lung-RADS does not incorporate other established clinical risk factors for lung cancer, which may limit its performance and is one reason it is not universally endorsed.
By comparison, equation-based post-imaging mathematical prediction models (MPMs) use multivariate logistic regression on various combinations of clinical risk factors and imaging features to predict whether a nodule is malignant. Previously published MPMs include the Mayo Clinic (MC) model (20), the Veterans Affairs (VA) model (21), the Peking University (PU) model (22), and the Brock University (BU) model (23). Of these models, the British Thoracic Society Nodule Guidance documents support incorporating the BU MPM, while the American College of Chest Physicians endorses the MC MPM (24,25). Yet, performance stability in prospective cohorts is poorly understood, creating challenges for clinical use. Prior studies have explored model performance on new patient cohorts, some of which have involved adjusting the risk assessment decision threshold (26-29). However, three of the studies focused on small cohorts (86–317 nodules) and included a higher number of lung cancer cases than is found in the clinical setting (33–69% malignant) (26,27,29). Nair et al. explored MPMs in the NLST cohort but did not include the PU model nor included any association with Lung-RADS (28). In our study, we build upon these prior studies by incorporating a large cohort from NLST, four MPMs, making associations with Lung-RADS, providing a calibrated online tool, and focusing on model assessment that is appropriate for unbalanced cohorts. Numerous investigations have been done in utilizing machine and deep learning models for lung cancer risk prediction (30-33); however, these models are not supported/recommended by clinical organizations, nor widely implemented in the clinical space, and therefore are not analyzed in this study.
In the decade since CMS established coverage of lung cancer screening services, uptake has remained low (16–18% in the most recent updates) (34,35), but is anticipated to increase in the coming years. To improve efficiency, there is an immediate need to reduce false positives, which can result in patient harm and raise costs in an already strained healthcare system. In addition, there is a need to make transparent the performance stability of the currently endorsed MPMs by the British Thoracic Society and American College of Chest Physicians, with comparison to Lung-RADS. Understanding these performance metrics for application to prospective lung cancer screening cases is vital for responsible application in the clinical environment. The aim of this study is to utilize a large portion of the NLST population with LDCT-identified pulmonary nodules to compare the performance of the four MPMs, at a standardized sensitivity value, to reduce the false positive rate for lung cancer screening exams. We also evaluate the relationship between the Lung-RADS criteria and nodule subtype (solid, part-solid, non-solid), with the MPM risk scores. We present this article in accordance with the STROBE reporting checklist (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-439/rc).
Methods
Subjects
This retrospective study utilized the sub-cohort from NLST that received LDCT and had a lung nodule ≥4 mm for which clinical (i.e., age, smoking history) and imaging features (i.e., nodule size, lobe location) were identified for the MPMs. This study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Subjects were consented as a part of the original NLST study and only deidentified data was transferred for this study via the Cancer Data Access System. The University of Iowa Institutional Review Board provided a non-human subjects research determination. For subjects with multiple nodules and a resultant cancer diagnosis, the imaging features from the nodule associated with the cancer diagnosis were used. For subjects with multiple benign nodules, the imaging features of the largest nodule were used. NLST categorized all calcified nodules as benign and did not record any of the other features for those nodules; therefore, subjects with only calcified nodules were removed from the cohort. Also, because the NLST data included a family lung cancer history (FamilyLungCancerHx) variable but did not specifically flag family cancer history (FamilyCancerHx) for other regions, the FamilyLungCancerHx variable was used in place of the FamilyCancerHx variable. For exploring the optimization of the MPMs, 20% of the subjects were used to calibrate the models (i.e., determine the threshold to produce a target sensitivity for detecting lung cancer). The remaining 80% were used to test the stability of the performance using the calibrated thresholds and compare that to the published decision threshold values. The calibration and testing cohorts maintained the same class balance as the full cohort.
MPMs
All four MPMs assessed are post-imaging models that calculate a risk score (ranging from 0 to 1), and for each MPM, a risk decision threshold must be determined to translate the continuous risk score value into a binary prediction of benign/malignant. More information on the MPMs and the differences between their development cohorts can be found in Appendix 1. For this study, in a common lung cancer screening cohort, the performance of the MPMs was calculated via the MPM Calibration & Analysis online application found on https://www.i-clic.uihc.uiowa.edu/resources/sieren/mpm/ (29). This web-based application allows the calibration of the MPMs’ risk assessment decision threshold values using Youden’s statistic or a set target sensitivity or specificity. In this study, the calibration sub-cohort was uploaded to obtain the decision thresholds when adjusted to a target sensitivity of 95% (and 90% as supplementary data). This threshold was then applied to the testing set to evaluate resultant performance statistics and stability of the risk prediction to prospective cases.
Applying Lung-RADS
Lung-RADS categories have different criteria if the nodule is seen at baseline or subsequent screenings and part-solid nodules are based on two measurements: the solid component diameter and the total diameter. However, the MPMs do not incorporate any changes over time, so nodules were considered baseline for this study, and part-solid nodules were assigned a Lung-RADS category based only on the total diameter, similar to post-hoc Lung-RADS application by Pinsky et al. (36). The highest risk Lung-RADS category (4X) relies on an individual radiologist’s interpretation and is applied when LDCT imaging has additional concerning features, such as a spiculation, lymphadenopathy, or a ground-glass nodule that has doubled in size in the past year. Since data for these additional features is not provided in the NLST dataset, none of the nodules were categorized 4X. Scatter plots to compare predictive risk values from the MPMs to Lung-RADS were created in RStudio using the ggplot2 library (R version 4.3.0).
Statistical analysis
To reduce bias of the features in the calibration and testing sub-cohorts, several statistical analyses were performed, including normality of continuous features using Shapiro-Wilks test and P values using Wilcoxon rank sum test. Models and corresponding calibrated model thresholds were calculated using the MPM Calibration & Analysis web application (29), and their performances were assessed using area under the receiver-operating-characteristic curve (AUC-ROC), area under the precision-recall curve (AUC-PR), sensitivity, specificity, accuracy, and precision. Confidence intervals were calculated using the epiR package in R (version 4.3.0). As this cohort is not class balanced (equal number of malignant and benign nodules), the AUC-PR analysis must be utilized. AUC-PR removes true negatives to highlight true and false positives which can be ‘hidden’ in AUC-ROC because the majority of nodules are benign. The AUC-ROC and AUC-PR curves were created in Python using the sklearn and matplotlib libraries (Python version 3.9.13). Modifications to the MPM Calibration & Analysis web application were made to allow clinicians to enter single prospective patient data and receive the prediction outputs from the four MPMs using the calibration thresholds determined from the NLST cohort, reported in this paper.
Results
The original dataset received from NLST had 10,127 subjects. Inclusion criteria for the study included subjects from the computed tomography (CT) screening arm of the study, with non-calcified, CT-detected lung nodules ≥4 mm (Figure 1), resulting in a total of 1,353 subjects, who were then split 20/80 into calibration and testing cohorts. Despite random assignment into the calibration and testing cohorts, the proportion of participants with a FamilyCancerHx was significantly lower in the calibration versus the testing set (Table 1). No other features were statistically different.
Figure 1.
Flow diagram of subject inclusion/exclusion criteria. LDCT, low-dose computed tomography.
Table 1. Demographic and radiological features of study cohorts.
| Features | Full dataset (n=1,353) | Calibration (n=270) | Testing (n=1,083) | P value |
|---|---|---|---|---|
| Age (years) | 62.3±5.2 | 62.8±5.4 | 62.2±5.1 | 0.15 |
| Male | 746 (55.1) | 146 (54.1) | 600 (55.4) | 0.69 |
| Currently smoking | 383 (28.3) | 74 (27.4) | 309 (28.5) | 0.43 |
| Cessation years | 3.5±4.96 | 3.3±5.0 | 3.5±4.9 | 0.17 |
| CancerHx | 4 (3.4) | 11 (4.1) | 35 (3.2) | 0.49 |
| FamilyCancerHx | 345 (25.5) | 56 (20.7) | 289 (26.7) | 0.045 |
| Emphysema | 750 (55.4) | 148 (54.8) | 602 (55.6) | 0.82 |
| Upper lobe nodule | 556 (41.1) | 120 (44.4) | 436 (40.3) | 0.21 |
| Diameter (mm) | 8.9±7.0 | 9.2±7.6 | 9.0±6.2 | 0.25 |
| Spiculation | 217 (16.0) | 46 (17.0) | 171 (15.8) | 0.62 |
| Smooth border | 756 (55.9) | 143 (53.0) | 613 (56.6) | 0.28 |
| Nodule type | 0.54 | |||
| Solid | 1,038 (76.7) | 205 (75.9) | 833 (76.9) | |
| Part-solid | 94 (6.9) | 12 (4.4) | 82 (7.6) | |
| Non-solid | 221 (16.3) | 53 (19.6) | 168 (15.5) | |
| Nodule count | 1.5±1.0 | 1.5±1.1 | 1.5±1.0 | 0.12 |
| Malignant | 122 (9.0) | 23 (8.5) | 99 (9.1) | 0.75 |
Data are presented as mean ± SD or n (%). CancerHx, cancer history; FamilyCancerHx, family cancer history; SD, standard deviation.
The uncalibrated, original study-published thresholds for the four models are presented along with the performance metrics and 95% confidence intervals applied to the NLST testing cohort in Table 2. In Table 2, the calibrated performance and calibrated threshold performance on the testing cohort are presented in columns 1 and 2 under each model, while the original study-published MPM-associated threshold (indicated by *), with results presented in the third column. In order to cross-compare the performance of these four models, a decision threshold targeting sensitivity of 95% was used. A high sensitivity minimizes the occurrence of a missed cancer case. Table S1 shows the uncalibrated, original study-published thresholds for the four models applied to the full 1,353 NLST subject cohort.
Table 2. Model results from calibration and testing sets at selected 95% sensitivity thresholds.
| Performance | MC | VA | BU | PU | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Calibration | Testing | MPM- associated |
Calibration | Testing | MPM- associated |
Calibration | Testing | MPM- associated |
Calibration | Testing | MPM- associated |
||||
| AUC-ROC (%) | 85 | 83 | 83 | 85 | 77 | 77 | 87 | 83 | 83 | 78 | 76 | 76 | |||
| AUC-PR (%) | 39 | 31 | 31 | 38 | 28 | 28 | 36 | 33 | 33 | 31 | 27 | 27 | |||
| Threshold (%) | 9 | 9 | 3* | 27 | 27 | 20* | 2 | 2 | 12* | 21 | 21 | 46* | |||
| Sensitivity (%) | 91 [72, 99] | 93 [86, 97] | 100 [96, 100] | 91 [72, 99] | 90 [82, 95] | 97 [91, 99] | 91 [72, 99] | 92 [85, 96] | 63 [52, 72] | 96 [78, 100] | 99 [95, 100] | 85 [76, 91] | |||
| Specificity (%) | 44 [37, 50] | 52 [49, 55] | 0 [0, 0] | 40 [34, 46] | 45 [42, 48] | 20 [18, 23] | 49 [43, 55] | 55 [51, 58] | 84 [82, 87] | 13 [09, 18] | 16 [14, 18] | 50 [47, 53] | |||
| Accuracy (%) | 48 [42, 54] | 56 [53, 59] | 9 [7, 11] | 44 [38, 51] | 49 [46, 52] | 27 [25, 30] | 53 [46, 59] | 58 [55, 61] | 82 [80, 85] | 20 [15, 25] | 24 [21, 26] | 53 [50, 56] | |||
| Precision (%) | 13 [8, 19] | 16 [13, 20] | 9 [7, 11] | 12 [8, 18] | 14 [11, 17] | 11 [9, 13] | 14 [9, 21] | 17 [14, 20] | 29 [23, 35] | 9 [6, 14] | 11 [9, 13] | 15 [12, 18] | |||
Data are presented as value (95% sensitivity threshold). The third column under each model indicates the results for the MPM-associated thresholds (*) on the testing set. AUC-PR, area under the precision-recall curve; AUC-ROC, area under the receiver-operating-characteristic curve; BU, Brock University; MC, Mayo Clinic; MPM, mathematical prediction model; PU, Peking University; VA, Veterans Affairs.
The PU model achieved the highest calibrated sensitivity for lung cancer risk prediction in this NLST cohort in which both the achieved calibration sensitivity performance (96%) and testing (99%) exceeded the target of 95% sensitivity (Table 2). The MC, VA, and BU models fell short of the target sensitivity for both calibration and testing (all between 90% and 93%). Table S2 contains the sensitivity, specificity, accuracy, and precision of the models when setting the sensitivity to 90%. No models achieved the target sensitivity of 90% in testing (MC 79%, VA 89%, BU 70%, and PU 87%). As expected with a reduction in the target sensitivity to 90%, all models had an increased specificity—the BU model had the largest difference in specificity between the 90% and 95% targeted sensitivity (specificity of 81% to 55%, respectively in the testing set).
For the target sensitivity of 95%, the BU and MC achieved the highest specificity, correctly identified cases without cancer (49% and 44%, respectively). Both the BU and MC models had higher specificity performance achieved in testing (55% and 52%, respectively). While these models predict low-risk scores for most of the benign cases, the malignant cases are spread between 0 and 1—this trend can also be seen in the AUC-ROC and AUC-PR (Figure 2). BU and MC have a moderately high testing AUC-ROC performance (83% for both models), and PU and VA have a moderate performance (76% and 77%, respectively); however, AUC-PR, which does not incorporate true negatives, shows that the MPMs have sub-optimal ability to correctly identify malignant cases (27–33%).
Figure 2.
AUC-ROC and AUC-PR curves of the calibration and testing sets for the four MPMs. (A) ROC curves for the 270 calibration subjects and (B) the ROC curves for the 1,083 testing subjects. The black solid line in (A) and (B) is the ‘baseline’ AUC-ROC of 0.5. Subplot is (C) the PR curves for the calibration subjects and (D) the PR curves for the testing subjects. The plots in (C) and (D) are truncated to the baseline defined as the ratio of P and N: y = P/(P + N). AUC-PR, area under the precision-recall curve; AUC-ROC, area under the receiver-operating-characteristic curve; BU, Brock University; MC, Mayo Clinic; MPM, mathematical prediction model; N, negative; P, positive; PPV, positive predictive value; PR, precision recall; PU, Peking University; ROC, receiver operating characteristic; TPR, true positive rate; VA, Veterans Affairs.
In Table 3, the nodules are retrospectively categorized into the Lung-RADS 2022 category based on diameter and nodule subtype (solid, part-solid, non-solid). There were 260 solid, benign nodules that were categorized as Lung-RADS 4A and 70 categorized as Lung-RADS 4B. For malignant cases, all 13 non-solid nodules were categorized as benign, and all 19 part-solid nodules were categorized as Lung-RADS 3.
Table 3. Categorizing nodules into Lung-RADS 2022 based on size and nodule type.
| Lung-RADS 2022 | Malignant | Benign | |||||
|---|---|---|---|---|---|---|---|
| Solid | Part solid | Non-solid | Solid | Part solid | Non-solid | ||
| 2 (benign) | 6 | 0 | 13 | 365 | 25 | 208 | |
| 3 (probably benign) | 8 | 19 | 0 | 253 | 50 | 0 | |
| 4A (suspicious) | 34 | 0* | N/A | 260 | 0* | N/A | |
| 4B (very suspicious) | 42 | 0* | N/A | 70 | 0* | N/A | |
*, NLST did not incorporate measurement of solid component for part solid nodules, hence all part solid nodules ≥6 mm were classified as Lung-RADS 2022 criteria 3 for our analysis. Lung-RADS, Lung Imaging Reporting and Data System; N/A, not available; NLST, National Lung Screening Trial.
Figure 3 illustrates how the prediction values from the four risk models vary with the Lung-RADS 2022 category and nodule type. The BU, PU, and MC models assessed all Lung-RADS category 4B nodules as high risk, even though 70/112 cases were benign. The prediction values for the BU and MC models have a positive skew (larger distribution of prediction values <0.5), while conversely, the VA and PU models have a negative skew in prediction values (high number of prediction values >0.5). For the PU model, the skew in prediction values is reflected by the resulting high sensitivity and low specificity (Table 2). As shown in Figure 3, PU and VA both show a changing range of prediction values based on Lung-RADS criteria. This suggests a potential benefit of different calibration thresholds based on Lung-RADS category (i.e., for the PU, the selected calibration threshold of 0.21 works well for Lung-RADS categories 3 and 4A, but increased performance could be achievable with a threshold of 0.19 for category 2, and 0.31 for category 4A).
Figure 3.
Scatter plot of the MPM risk prediction values for the NLST testing cohort. (A) MC model, (B) VA model, (C) BU model, and (D) PU model. The nodules are split into the Lung-RADS categories (2–4B) based on nodule diameter and type (red: non-solid, green: part-solid, blue: solid). Columns are further split by benign (left) and malignant (right). Horizontal black lines indicate the threshold calculated from the selected 95% sensitivity. BU, Brock University; Lung-RADS, Lung Imaging Reporting and Data System; MC, Mayo Clinic; MPM, mathematical prediction model; NLST, National Lung Screening Trial; PU, Peking University; VA, Veterans Affairs.
Discussion
The expanded eligibility criteria for lung cancer screening extend the potential benefit of early detection to younger and historically underserved populations; yet, they also increase eligibility among lower-risk persons who are less likely to benefit from screening (37). An ongoing clinical implementation challenge is minimizing the emotional stress and potential harm from invasive diagnostic procedures for patients with suspicious nodules while efficiently identifying the small minority with lung cancer. Multivariate logistic regression MPMs are attractive support tools because they are easy to understand, explain, and use. However, there is a lack of transparency in the performance metrics of models endorsed by various subspecialty societies and their stability when applied to new data. This study evaluated four post-imaging MPMs on a large cohort of LDCT-detected nodules from NLST. These results can be used to inform clinicians and provide the ability to perform similar threshold optimization studies on local clinical cohorts.
To compare the MPMs, the sensitivity target was set to 95%. The BU model had the best testing performance (AUC-PR: 33%; sensitivity: 92%; specificity: 55%), likely due to the model being developed on a cohort with similar trends in rate of malignancy, nodule type, and on a cohort for lung cancer screening detected nodules instead of incidentally detected nodules as with the other three models. While sensitivity was targeted to increase true positives, all models’ abilities to reduce false positives were low (specificity ranged from 16% to 55%). Similar results were observed in a study by Hammer et al. in a cohort with only larger nodules (8–26 mm), where approximately half were misclassified as benign (27).
The cohort used in this study included ≥4 mm, solid, part-solid, and non-solid nodules. The Lung-RADS 2022 guidelines categorize solid, 8–15 mm nodules as suspicious. Hence, 52% of this cohort’s solid benign cases would have been labeled ‘suspicious’ and 14% as ‘very suspicious’. This study indicated that the only model for which the Lung-RADS criteria could improve MPM performance would be the PU model. This is illustrated in Figure 3D, where there is a differential in the lowest prediction values between malignant and benign for all Lung-RADS criteria. Figure 3 also shows how ≥6 mm part-solid nodules are not accurately predicted by Lung-RADS 3 or the MPMs—more research needs to be done to understand the prevalence and associated risk for part-solid nodules (38). Thoracic radiologist assessments have been seen to be as accurate as the MPMs (39,40) and more accurate in Lung-RADS 4 nodules (41). This is complicated in the clinical setting by diversity in the expertise and experience of the reviewing radiologist. In NLST, all cases were reviewed by experienced Thoracic Radiologists, while the recent CMS screening guidelines do not require readers to have a Thoracic specialty and also removed the minimum required number of cases (formerly 300 chest CT within 3 years) to be qualified as an LDCT screening reader (4,7).
When comparing models, it is important to note that AUC-ROC values published by the four models evaluated in this paper can be misleading: (I) AUC-ROC curves are used to discriminate between binary classes, but the MC and VA models incorporated two thresholds; and (II) AUC-ROC alone does not fully capture a model’s ability to correctly identify malignant nodules. Because this NLST cohort is unbalanced with about 10% malignancy (similar imbalance is also reflected in the clinical setting of lung cancer screening), the AUC-PR metric must be utilized as this removes bias from the true negatives.
Our study aids in the cross-comparison between MPM performance using the published threshold applied to the same retrospective testing sub-cohort that meets lung cancer screening criteria. We also present additional analyses to cross-compare performance with respect to nodule composition and Lung-RADS criteria. Further, we have made publicly available the MPM Calibration & Analysis online application found on https://www.i-clic.uihc.uiowa.edu/resources/sieren/mpm/ (29), allowing others to perform similar threshold optimization analysis using their own local clinical cohorts, or apply the NLST calibrated thresholds from this study. The results from this study indicated that while MPMs are intuitive and easy to access for clinicians in calculator form as supportive decision aids, caution should be exercised in applying these results in the context of lung cancer screening. The results also support the need for better performing decision support tools (higher sensitivity and specificity), which may only be achievable with more complex approaches such as machine learning. While the included MPMs were developed with statistical logistic regression, machine learning approaches such as gradient boosting-based methods [light gradient boosting machine (LightGBM) and extreme gradient boosting (XGBoost)] or neural network-based methods [tabular prior-data fitted network (TabPFN)] may present performance advantages for applications to structure data from the electronic health record (i.e., demographics, clinical, and radiologist reports). These types of models have been constructed with electronic health record data and used for outcome prediction in colorectal cancer (42), gliomas (43), and pre-lung cancer screening risk assessment (44).
Machine and deep learning methods have been investigated to improve the early diagnosis of lung nodules detected with CT by utilizing information captured in the image data itself (30,31,33,45,46). These approaches can be computationally expensive to develop/train and less transparent/intuitive for clinicians to understand; however, they can achieve higher performances than the MPMs, which would represent increased clinical benefit. One well-explored approach is the Lung Cancer Prediction convolutional neural network (LCP-CNN) (47) for which a recent study related important model features to those incorporated in the Brock model, increasing explainability of this advanced model approach (48). Future work is needed to cross-compare these advanced prediction models in common data cohorts which adequately represent the patient diversity found in the clinical setting (49). Further, there are increased barriers for these types of models that must be overcome to be accessible to clinicians as decision support tools (commercialization, regulatory approval, workflow integration, and user skepticism) (50).
Limitations to this study include: (I) the FamilyCancerHx feature was significantly different between the calibration and testing sets, the two models that incorporate the feature (PU and BU) maintained stable performances, while the other two models lacking the feature lost stability. (II) Lung-RADS assessment was not incorporated into NLST, and was applied based on size criteria post-hoc without incorporating Lung-RADS 4X classification. (III) This is a retrospective study utilizing a sub-cohort from NLST, for which the enrollment age (55–74 years) and long smoking history (30 pack years) do not match the current clinical guidelines (50–80 years, 20 pack years). (IV) NLST, and consequently our sub-cohort, has limited racial diversity—the recently expanded CMS screening eligibility criteria will better address racial disparities, and ongoing evaluation will be needed (51,52). (V) While we explored recalibration of the decision threshold for the four MPMs included in this study, we did not recalibrate the study-provided formula (through logistic recalibration or isotonic regression).
Conclusions
The four MPMs explored in this study, in the context of calibration for application in lung cancer screening cohorts, offer minimal improvement to clinical management. The best calibrated MPM performance was achieved with the BU model, 92% sensitivity, 55% specificity, 58% accuracy, and 17% precision. The high to moderate AUC-ROC results commonly reported can mislead clinicians when applying the models to their local population, where there is significant class imbalance, and the AUC-PR better reflects performance in this situation. A web application has been made available, which allows clinicians to calculate the MPMs risk prediction on any new lung cancer screening case using the calibrated settings reported in this study; however, the low specificity of all models may limit clinical utility.
Supplementary
The article’s supplementary files as
Acknowledgments
The authors thank the National Cancer Institute for access to the data collected by the National Lung Screening Trial. We also thank Sarah Bell for statistical advice for this manuscript.
Ethical Statement: The authors are accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved. The study was conducted in accordance with the Declaration of Helsinki and its subsequent amendments. Subjects were consented as a part of the original NLST study and only deidentified data was transferred for this study via the Cancer Data Access System. The University of Iowa Institutional Review Board provided a non-human subjects research determination.
Footnotes
Reporting Checklist: The authors have completed the STROBE reporting checklist. Available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-439/rc
Funding: This research was supported by the National Institutes of Health [No. R01CA267820 (to K.E.S., K.K., R.M.H., and J.C.S.), No. T32HL144461 (to K.K.), and No. P30CA086862 (to R.M.H. and J.C.S.)].
Conflicts of Interest: All authors have completed the ICMJE uniform disclosure form (available at https://tlcr.amegroups.com/article/view/10.21037/tlcr-2025-439/coif). K.E.S. reports support for this work from NIH grant R01CA267820. K.E.S. also received a best paper award from the University of Iowa Holden Comprehensive Cancer Center, which included support for attendance and travel to a meeting. K.K. reports support for this work from NIH grants T32HL144461 and R01CA267820. R.M.H. reports support for this work from NIH grant R01CA267820. R.M.H. also reports support from NIH grant P30CA086862 for co-leadership of the Cancer Epidemiology and Population Science Program. J.C.S. reposts support for this work from NIH grant R01CA267820 and statistical support funded in part by P30CA086862. Unrelated to this work, J.C.S. also reports that a spouse is a paid consultant and has stock options for VIDA diagnostics, honorarium for serving on an NIH study section, and pilot funding from Siemens Healthineers. The other authors have no conflicts of interest to declare.
References
- 1.Siegel RL, Miller KD, Wagle NS, et al. Cancer statistics, 2023. CA Cancer J Clin 2023;73:17-48. 10.3322/caac.21763 [DOI] [PubMed] [Google Scholar]
- 2.National Lung Screening Trial Research Team ; Aberle DR, Adams AM, et al. Reduced lung-cancer mortality with low-dose computed tomographic screening. N Engl J Med 2011;365:395-409. 10.1056/NEJMoa1102873 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.de Koning HJ, van der Aalst CM, de Jong PA, et al. Reduced Lung-Cancer Mortality with Volume CT Screening in a Randomized Trial. N Engl J Med 2020;382:503-13. 10.1056/NEJMoa1911793 [DOI] [PubMed] [Google Scholar]
- 4.Centers for Medicare and Medicaid Services. Screening for Lung Cancer with Low Dose Computed Tomography (LDCT). CAG-00439N. 2015. Available online: https://www.cms.gov/medicare-coverage-database/details/nca-decision-memo.aspx?NCAId=274&bc=AAAAAAAAAgAAAA
- 5.Wolf AMD, Oeffinger KC, Shih TY, et al. Screening for lung cancer: 2023 guideline update from the American Cancer Society. CA Cancer J Clin 2024;74:50-81. 10.3322/caac.21811 [DOI] [PubMed] [Google Scholar]
- 6.Rivera MP, Katki HA, Tanner NT, et al. Addressing Disparities in Lung Cancer Screening Eligibility and Healthcare Access. An Official American Thoracic Society Statement. Am J Respir Crit Care Med 2020;202:e95-e112. 10.1164/rccm.202008-3053ST [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Centers for Medicare and Medicaid Services. Screening for Lung Cancer with Low Dose Computed Tomography (LDCT). CAG-00439R. 2022. Available online: https://www.cms.gov/medicare-coverage-database/view/ncacal-decision-memo.aspx?proposed=N&ncaid=304
- 8.Zhao H, Xu Y, Huo J, et al. Updated Analysis of Complication Rates Associated With Invasive Diagnostic Procedures After Lung Cancer Screening. JAMA Netw Open 2020;3:e2029874. 10.1001/jamanetworkopen.2020.29874 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Derry-Vick HM, Heathcote LC, Glesby N, et al. Scanxiety among Adults with Cancer: A Scoping Review to Guide Research and Interventions. Cancers (Basel) 2023;15:1381. 10.3390/cancers15051381 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Bauml JM, Troxel A, Epperson CN, et al. Scan-associated distress in lung cancer: Quantifying the impact of "scanxiety". Lung Cancer 2016;100:110-3. 10.1016/j.lungcan.2016.08.002 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Wu GX, Raz DJ, Brown L, et al. Psychological Burden Associated With Lung Cancer Screening: A Systematic Review. Clin Lung Cancer 2016;17:315-24. 10.1016/j.cllc.2016.03.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rasmussen JF, Siersma V, Malmqvist J, et al. Psychosocial consequences of false positives in the Danish Lung Cancer CT Screening Trial: a nested matched cohort study. BMJ Open 2020;10:e034682. 10.1136/bmjopen-2019-034682 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Ford ME, Havstad SL, Flickinger L, et al. Examining the effects of false positive lung cancer screening results on subsequent lung cancer screening adherence. Cancer Epidemiol Biomarkers Prev 2003;12:28-33. [PubMed] [Google Scholar]
- 14.Erkmen CP, Dako F, Moore R, et al. Adherence to annual lung cancer screening with low-dose CT scan in a diverse population. Cancer Causes Control 2021;32:291-8. 10.1007/s10552-020-01383-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.van den Bergh KA, Essink-Bot ML, Borsboom GJ, et al. Short-term health-related quality of life consequences in a lung cancer CT screening trial (NELSON). Br J Cancer 2010;102:27-34. 10.1038/sj.bjc.6605459 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.van den Bergh KA, Essink-Bot ML, Bunge EM, et al. Impact of computed tomography screening for lung cancer on participants in a randomized controlled trial (NELSON trial). Cancer 2008;113:396-404. 10.1002/cncr.23590 [DOI] [PubMed] [Google Scholar]
- 17.Bui KT, Kiely BE, Dhillon HM, et al. Prevalence and severity of scanxiety in people with advanced cancers: a multicentre survey. Support Care Cancer 2022;30:511-9. 10.1007/s00520-021-06454-9 [DOI] [PubMed] [Google Scholar]
- 18.McGovern PM, Gross CR, Krueger RA, et al. False-positive cancer screens and health-related quality of life. Cancer Nurs 2004;27:347-52. 10.1097/00002820-200409000-00003 [DOI] [PubMed] [Google Scholar]
- 19.Lung-RADS Assessment Categories . 2022. Available online: https://www.acr.org/-/media/ACR/Files/RADS/Lung-RADS/Lung-RADS-2022.pdf
- 20.Swensen SJ, Silverstein MD, Ilstrup DM, et al. The probability of malignancy in solitary pulmonary nodules. Application to small radiologically indeterminate nodules. Arch Intern Med 1997;157:849-55. [PubMed] [Google Scholar]
- 21.Gould MK, Ananth L, Barnett PG, et al. A clinical model to estimate the pretest probability of lung cancer in patients with solitary pulmonary nodules. Chest 2007;131:383-8. 10.1378/chest.06-1261 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Li Y, Wang J. A mathematical model for predicting malignancy of solitary pulmonary nodules. World J Surg 2012;36:830-5. 10.1007/s00268-012-1449-8 [DOI] [PubMed] [Google Scholar]
- 23.McWilliams A, Tammemagi MC, Mayo JR, et al. Probability of cancer in pulmonary nodules detected on first screening CT. N Engl J Med 2013;369:910-9. 10.1056/NEJMoa1214726 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Gould MK, Donington J, Lynch WR, et al. Evaluation of individuals with pulmonary nodules: when is it lung cancer? Diagnosis and management of lung cancer, 3rd ed: American College of Chest Physicians evidence-based clinical practice guidelines. Chest 2013;143:e93S-e120S. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Callister ME, Baldwin DR, Akram AR, et al. British Thoracic Society guidelines for the investigation and management of pulmonary nodules. Thorax 2015;70 Suppl 2:ii1-ii54. 10.1136/thoraxjnl-2015-207168 [DOI] [PubMed] [Google Scholar]
- 26.Al-Ameri A, Malhotra P, Thygesen H, et al. Risk of malignancy in pulmonary nodules: A validation study of four prediction models. Lung Cancer 2015;89:27-30. 10.1016/j.lungcan.2015.03.018 [DOI] [PubMed] [Google Scholar]
- 27.Hammer MM, Nachiappan AC, Barbosa EJM, Jr. Limited Utility of Pulmonary Nodule Risk Calculators for Managing Large Nodules. Curr Probl Diagn Radiol 2018;47:23-7. 10.1067/j.cpradiol.2017.04.003 [DOI] [PubMed] [Google Scholar]
- 28.Nair VS, Sundaram V, Desai M, et al. Accuracy of Models to Identify Lung Nodule Cancer Risk in the National Lung Screening Trial. Am J Respir Crit Care Med 2018;197:1220-3. 10.1164/rccm.201708-1632LE [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Uthoff J, Koehn N, Larson J, et al. Post-imaging pulmonary nodule mathematical prediction models: are they clinically relevant? Eur Radiol 2019;29:5367-77. 10.1007/s00330-019-06168-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Yang S, Lim SH, Hong JH, et al. Deep learning-based lung cancer risk assessment using chest computed tomography images without pulmonary nodules ≥8 mm. Transl Lung Cancer Res 2025;14:150-62. 10.21037/tlcr-24-882 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Massion PP, Antic S, Ather S, et al. Assessing the Accuracy of a Deep Learning Method to Risk Stratify Indeterminate Pulmonary Nodules. Am J Respir Crit Care Med 2020;202:241-9. 10.1164/rccm.201903-0505OC [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Mikhael PG, Wohlwend J, Yala A, et al. Sybil: A Validated Deep Learning Model to Predict Future Lung Cancer Risk From a Single Low-Dose Chest Computed Tomography. J Clin Oncol 2023;41:2191-200. 10.1200/JCO.22.01345 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Uthoff J, Stephens MJ, Newell JD, Jr, et al. Machine learning approach for distinguishing malignant and benign lung nodules utilizing standardized perinodular parenchymal features from CT. Med Phys 2019;46:3207-16. 10.1002/mp.13592 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Bandi P, Star J, Ashad-Bishop K, et al. Lung Cancer Screening in the US, 2022. JAMA Intern Med 2024;184:882-91. 10.1001/jamainternmed.2024.1655 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.American Lung Association. State of Lung Cancer. Lung Cancer Key Findings. 2024. Available online: https://www.lung.org/research/state-of-lung-cancer/key-findings
- 36.Pinsky PF, Gierada DS, Black W, et al. Performance of Lung-RADS in the National Lung Screening Trial: a retrospective assessment. Ann Intern Med 2015;162:485-91. 10.7326/M14-2086 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Kovalchik SA, Tammemagi M, Berg CD, et al. Targeting of low-dose CT screening according to the risk of lung-cancer death. N Engl J Med 2013;369:245-54. 10.1056/NEJMoa1301851 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Mendoza DP, Petranovic M, Som A, et al. Lung-RADS Category 3 and 4 Nodules on Lung Cancer Screening in Clinical Practice. AJR Am J Roentgenol 2022;219:55-65. 10.2214/AJR.21.27180 [DOI] [PubMed] [Google Scholar]
- 39.Tanner NT, Porter A, Gould MK, et al. Physician Assessment of Pretest Probability of Malignancy and Adherence With Guidelines for Pulmonary Nodule Evaluation. Chest 2017;152:263-70. 10.1016/j.chest.2017.01.018 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Cui X, Heuvelmans MA, Han D, et al. Comparison of Veterans Affairs, Mayo, Brock classification models and radiologist diagnosis for classifying the malignancy of pulmonary nodules in Chinese clinical population. Transl Lung Cancer Res 2019;8:605-13. 10.21037/tlcr.2019.09.17 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Gupta S, Jacobson FL, Kong CY, et al. Performance of Lung Nodule Management Algorithms for Lung-RADS Category 4 Lesions. Acad Radiol 2021;28:1037-42. 10.1016/j.acra.2020.04.041 [DOI] [PubMed] [Google Scholar]
- 42.Zhang Y, Zhang Z, Wei L, et al. Construction and validation of nomograms combined with novel machine learning algorithms to predict early death of patients with metastatic colorectal cancer. Front Public Health 2022;10:1008137. 10.3389/fpubh.2022.1008137 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Karabacak M, Schupper AJ, Carr MT, et al. Development and internal validation of machine learning models for personalized survival predictions in spinal cord glioma patients. Spine J 2024;24:1065-76. 10.1016/j.spinee.2024.02.002 [DOI] [PubMed] [Google Scholar]
- 44.Chen A, Wu E, Huang R, et al. Development of Lung Cancer Risk Prediction Machine Learning Models for Equitable Learning Health System: Retrospective Study. JMIR AI 2024;3:e56590. 10.2196/56590 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Knoernschild K, Schroeder KE, Kitzmann J, et al. Ensemble artificial neural network lung nodule classification utilizing nodular and peri-nodular radiomics. In: Medical Imaging 2025: Computer-Aided Diagnosis. SPIE; 2025. [Google Scholar]
- 46.Uthoff J, Nagpal P, Sanchez R, et al. Differentiation of non-small cell lung cancer and histoplasmosis pulmonary nodules: insights from radiomics model performance compared with clinician observers. Transl Lung Cancer Res 2019;8:979-88. 10.21037/tlcr.2019.12.19 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Heuvelmans MA, van Ooijen PMA, Ather S, et al. Lung cancer prediction by Deep Learning to identify benign lung nodules. Lung Cancer 2021;154:1-4. 10.1016/j.lungcan.2021.01.027 [DOI] [PubMed] [Google Scholar]
- 48.Chetan MR, Dowson N, Price NW, et al. Developing an understanding of artificial intelligence lung nodule risk prediction using insights from the Brock model. Eur Radiol 2022;32:5330-8. 10.1007/s00330-022-08635-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Barta JA, Farjah F, Thomson CC, et al. The American Cancer Society National Lung Cancer Roundtable strategic plan: Optimizing strategies for lung nodule evaluation and management. Cancer 2024;130:4177-87. 10.1002/cncr.35181 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50.Liu JA, Yang IY, Tsai EB. Artificial Intelligence (AI) for Lung Nodules, From the AJR Special Series on AI Applications. AJR Am J Roentgenol 2022;219:703-12. 10.2214/AJR.22.27487 [DOI] [PubMed] [Google Scholar]
- 51.Narayan AK, Chowdhry DN, Fintelmann FJ, et al. Racial and Ethnic Disparities in Lung Cancer Screening Eligibility. Radiology 2021;301:712-20. 10.1148/radiol.2021204691 [DOI] [PubMed] [Google Scholar]
- 52.Choi E, Ding VY, Luo SJ, et al. Risk Model-Based Lung Cancer Screening and Racial and Ethnic Disparities in the US. JAMA Oncol 2023;9:1640-8. 10.1001/jamaoncol.2023.4447 [DOI] [PMC free article] [PubMed] [Google Scholar]



