Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Jan 1.
Published in final edited form as: Am J Gastroenterol. 2022 Oct 13;118(1):157–167. doi: 10.14309/ajg.0000000000002050

Derivation and External Validation of Machine Learning-based Model for Detection of Pancreatic Cancer

Wansu Chen 1, Yichen Zhou 1, Fagen Xie 1, Rebecca K Butler 1, Christie Y Jeon 2, Tiffany Q Luong 1, Botao Zhou 1, Yu-Chen Lin 2, Eva Lustigova 1, Joseph R Pisegna 3, Sungjin Kim 2, Bechien U Wu 4
PMCID: PMC9822857  NIHMSID: NIHMS1840818  PMID: 36227806

Abstract

OBJECTIVES:

There is currently no widely accepted approach to screening for pancreatic cancer (PC). We aimed to develop and validate a risk prediction model for pancreatic ductal adenocarcinoma (PDAC), the most common form of PC, across two health systems using electronic health records (EHR).

METHODS:

This retrospective cohort study consisted of patients 50–84 years of age having at least one clinic-based visit over a 10-year study period at Kaiser Permanente Southern California (KPSC, model training, internal validation) and the Veterans Affairs (VA, external testing). ‘Random survival forests’ models were built to identify the most relevant predictors from >500 variables and to predict risk of PDAC within 18 months of cohort entry.

RESULTS:

The KPSC cohort consisted of 1.8 million patients (mean age 61.6) with 1,792 PDAC cases. The 18-month incidence rate of PDAC was 0.77 (95% CI 0.73–0.80)/1,000 person-years. The final main model contained age, abdominal pain, weight change, HbA1c and ALT change (c-index: mean=0.77, SD=0.02; calibration test: p-value 0.4, SD 0.3). The final early detection model comprised the same features as those selected by the main model except for abdominal pain (c-index: 0.77 and SD 0.4; calibration test: p-value 0.3 and SD 0.3). The VA testing cohort consisted of 2.7 million patients (mean age 66.1) with an 18-month incidence rate of 1.27 (1.23–1.30)/1,000 person-years. The recalibrated main and early detection models based on VA testing datasets achieved mean c-index of 0.71 (SD 0.002) and 0.68 (SD 0.003), respectively.

CONCLUSIONS:

Using widely available parameters in EHR, we developed and externally validated parsimonious machine learning-based models for detection of pancreatic cancer. These models may be suitable for real-time clinical application.

Keywords: risk prediction, pancreatic cancer, machine learning, general population, glycated hemoglobin, alanine transaminase, weight loss

INTRODUCTION

Pancreatic cancer is the third leading cause of cancer deaths in the United States with 49,830 estimated deaths in 20221. Pancreatic cancer is often diagnosed in an advanced stage and as such has very poor survival, with the overall 5-year survival reaching only 11.5%.1 Due to the low incidence of pancreatic cancer in the general population (13.2 per 100,000 person-years),1 widespread population-based screening is not currently recommended by the United States Preventative Services Task Force.4 Therefore, a targeted approach to screening among higher-risk populations represents a key opportunity to alter the natural history of this disease.

Although several studies have shown evidence of improved outcomes for high-risk individuals undergoing screening based on genetic or familial predisposition, these patients constitute a small proportion of the patients that develop pancreatic cancer. An alternative approach is needed to identify patients in the broader population at risk for pancreatic cancer for whom targeting screening may also be beneficial.

The emergence of comprehensive EHR and maturation of machine learning offers an unprecedented opportunity to enhance efforts in early detection in pancreatic cancer. Specifically, the coupling of machine-learning with robust EHR allows a comprehensive unbiased data-driven approach to selection of candidate variables facilitating development of prediction models suitable for clinical application. To date, efforts to develop clinical prediction models in pancreatic cancer have focused on specific populations such as those with new-onset diabetes57or within the confines of a case-control study.8,9 Efforts to identify high-risk patients in the general population are sparse.10 There is a critical need for novel risk stratification tools which are both sensitive and specific for rapid identification of patients at increased risk of developing pancreatic cancer.

The aim of the present study was to develop and validate a clinical prediction model for risk of PDAC across several large health systems. Specifically, we sought to apply machine-learning combined with a comprehensive approach to data in EHR to predict the risk of sporadic PDAC in a broad population-based setting.

METHODS

Source of data

We developed risk prediction models using a retrospective cohort study utilizing health plan enrollees of Kaiser Permanente Southern California (KPSC), a large integrated healthcare system that provides comprehensive healthcare services for >4.7 million enrollees from diverse racial/ethnic backgrounds across 15 medical centers and 235 medical offices. Model training and internal validation were conducted based on EHR data. The demographics and socioeconomic status of KPSC health plan enrollees are comparable to those of residents in the Southern California region.11 The internally validated models were externally tested using EHR of Veterans Affairs (VA),12 the largest integrated health care system in the US providing care to >9 million Veterans at its 1,298 health care facilities including 171 VA Medical Centers and 1,113 outpatient sites. The study protocol was approved by the Institutional Review Board of KPSC and VA.

Study participants

Model training and internal validation:

Patients 50–84 years of age and had ≥1 clinic-based visit (index visit) within a KPSC facility in 2008–2017 were identified. Patients who had history of pancreatic cancer, or not continuously enrolled in the KPSC health plan in the past 12 months (gaps 45 days or less were allowed) were excluded. The requirement of continuous enrollment allowed adequate data to define study variables. For patients with multiple qualifying index visits, we selected one randomly as the index visit. The corresponding visit date was referred to as the index date (t0). Follow-up started on t0 and ended with the earliest of the following events: disenrollment from the health plan, end of the study (December 31, 2018), reached the maximum length of follow-up (18 months), non-PDAC related death, or PDAC diagnosis or death (outcome). A minimum of 30 days of follow-up is required.

Model external testing:

Veterans 50–84 years of age who had >1 outpatient visit (index visit) within a VA facility in 2008–2017 and another clinic-based visit within the 12 months prior to the index date were identified. Patients who had history of pancreatic cancer were excluded. The same follow-up rules mentioned above were applied to the VA cohort except for “disenrollment from the health plan”.

Early detection cohort:

To facilitate earlier detection of PDAC, we also established a cohort which included patients identified in the main cohort who had ≥90 days of cancer-free follow up for both KPSC and VA patients. The same analyses described below applied to the main cohort and the early detection cohort.

Outcome

The study outcome was PDAC diagnosis or death with pancreatic cancer in the 18 months after the index date. For the KPSC cohort, PDAC was identified from the Cancer Registry by using the Tenth Revision of International Classification of Diseases, Clinical Modification (ICD-10-CM) code C25.x and histology codes (eTable 1). Pancreatic cancer deaths were derived from the linkage with the California State Death Master files and identified using ICD-10-CM codes C25.x.13 For the VA cohort, cases of PDAC were similarly identified through an internal VA Central Cancer Registry, and PDAC deaths identified through the VA Mortality Data Repository, which integrates vital status data from the National Death Index (NDI), VA, and DoD administrative records.

Predictors

A complete list of extracted and derived features for the KPSC cohort is shown in eTable 2. A total of 500+ features which include patient demographics and lifestyle variables (e.g., smoking status), medical conditions (coded by ICD-9 or ICD-10 codes), lab test values, medication dispensing, medical procedures (coded by CPT, ICD-9/ICD-10 or KPSC internal procedure codes), symptoms (e.g., abdominal pain), health care utilization, and other features (e.g., year of index visit) were added into the feature candidate pool. Except for demographic variables, values within each time interval (0–6 months, 7–12 months, 1–2 years and >2 years) were generated. Definitions of the derived variables were described in eTable 3. Since the VA dataset was solely used for testing purposes, only limited number of features were extracted (Tables 1a and 1b).

Table 1a.

Characteristics of study subjects at baseline, n (%) unless otherwise stated.

Demographics and Lifestyle Characteristics Kaiser Permanente Southern California (KPSC)
N=1,801,931
Veterans Affairs (VA)
N=2,690,895

Age, mean (SD) 61.6 (9.4) 66.1 (9.1)

Female 960266 (53.3) 153630 (5.7)

Race/Ethnicity
 Non-Hispanic White 815773 (45.3) 1828095 (67.9)
 Non-Hispanic Black 171424 (9.5) 451523 (16.8)
 Hispanic 536079 (29.8) 148690 (5.5)
 Asian and Pacific Islander 192179 (10.7) 38536 (1.4)
 Multiple/Other/Unknown 86476 (4.8) 224051 (8.3)

Medical Insurance (one or more)
 Commercial 1094182 (60.7)
 Medicare 572452 (31.8)
 Medi-CAL/Other State Programs 64150 (3.6)
 Private Pay 483042 (26.8)

Years Since First enrollment, mean (SD) 18.9 (13.5)

Family History of Pancreatic Cancer 25386 (1.4)

Tobacco Use
 Ever 700429 (38.9) 1914180 (71.1)
 Never 1101502 (61.1) 776715 (28.9)

Weight Defined by BMI (kg/m2)
 Underweight (<18.5) 19095 (1.1)
 Normal Weight (18.5–24.9) 429010 (23.8)
 Overweight (25–29.9) 640644 (35.6)
 Obese (30+) 642905 (35.7)
 Unknown 70277 (3.9)

Lab Tests in Prior 6 months N Median (IQR) N Median (IQR)

ALT, IU/L 920993 22.0 (17.0, 30.0) 908520 24 (18, 34)

HbA1c, % 744601 6.2 (5.8, 7.1) 1429886 6.2 (5.7, 7.2)

ALP, IU/L 309460 70.0 (57.0, 87.0)

Total Bilirubin, mg/dL 303803 0.7 (0.5, 0.9)

HGB for Males, g/dL 437416 14.5 (13.4, 15.4)

HGB for Females, g/dL 515110 13.2 (12.4, 14.0)

HCT, L/L 952548 41.0 (38.0, 43.7)

RBC, million/mm3 934454 4.5 (4.2, 4.9)

Sodium, mEq/L 952253 139.0 (137.0, 141.0)

Total Cholesterol, mg/dL 964899 184.0 (156.0, 215.0)

Platelets, count/L 934239 232.0 (194.0, 277.0)

Medical Conditions 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior

Gallstone Disorders 20085 (1.1) 13184 (0.7) 20183 (1.1) 104529 (5.8)

Acute Pancreatitis 3256 (0.2) 2065 (0.1) 3199 (0.2) 17975 (1.0) 9510 (0.4) 5646 (0.2) 8859 (0.3) 18209 (0.7)

Chronic Pancreatitis 1550 (0.1) 1182 (0.1) 1475 (0.1) 3738 (0.2) 7749 (0.3) 5125 (0.2) 6707 (0.2) 9719 (0.4)

Benign Pancreatic Disease 2019 (0.1) 1398 (0.1) 1711 (0.1) 2780 (0.2)

Biliary Tract Disease 26199 (1.5) 20974 (1.2) 25512 (1.4) 38538 (2.1)

Depression 189508 (10.5) 150514 (8.4) 190868 (10.6) 317686 (17.6)

Diabetes 376022 (20.9) 325789 (18.1) 335965 (18.6) 327408 (18.2) 842,311 (31.3) 775,964 (28.8) 831,566 (30.9) 735,251 (27.3)

Medical Procedures 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior

Abdominal/Chest CT 121427 (6.7) 82841 (4.6) 119308 (6.6) 304085 (16.9)

Abdominal/Chest MRI 1861 (0.1) 1366 (0.1) 2226 (0.1) 7057 (0.4)

Abdominal/Chest Ultrasound 70721 (3.9) 55004 (3.1) 93179 (5.2) 333051 (18.5)

Any Abdominal Surgery 11447 (0.6) 7360 (0.4) 11559 (0.6) 77283 (4.3)

Surgical Procedures on esophagus 18492 (1.0) 13226 (0.7) 22164 (1.2) 125701 (7.0)

Upper GI Endoscopy 37108 (2.1) 27232 (1.5) 43386 (2.4) 160899 (8.9)

Colonoscopy 95156 (5.3) 74199 (4.1) 125372 (7.0) 417049 (23.1)

Medications 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior

Pancreatic Enzyme 1246 (0.07) 1105 (0.06) 1314 (0.07) 2437 (0.1)

Antidiabetic Medications – Insulin 88277 (4.9) 79163 (4.4) 78660 (4.4) 78323 (4.4)

Antidiabetic Medications – Non-Insulin 158667 (8.8) 148703 (8.3) 155764 (8.6) 182396 (10.1)

GI-Related Signs/Symptoms 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior 0–6 m Prior 7–12 m Prior 13–24 m Prior 24+ m Prior

Abdominal Pain 133380 (7.4) 92422 (5.1) 142470 (7.9) 483652 (26.8) 107586 (4.0) 68717 (2.6) 108170 (4.0) 237763 (8.8)

Chest Pain 114152 (6.3) 79399 (4.4) 128549 (7.1) 483953 (26.7)

Constipation 60032 (3.3) 41935 (2.3) 64439 (3.6) 193331 (10.7)

Diarrhea 44025 (2.4) 32187 (1.8) 52487 (2.9) 202067 (11.2)

Itching 46314 (2.6) 33730 (1.9) 57062 (3.2) 207679 (11.5)

Malaise or Fatigue 104298 (5.8) 71136 (3.9) 111646 (6.2) 362901 (20.1)

Melena 11089 (0.6) 7433 (0.4) 13235 (0.7) 66321 (3.7)

Nausea or Vomiting 51787 (2.9) 36348 (2.0) 57767 (3.2) 215982 (12.0)

Abbreviations: ALP, alkaline phosphatase; ALT, alanine transaminase; BMI, body mass index; CT, computerized tomography; GI, gastrointestinal; HCT, hematocrit; HbA1c: glycated hemoglobin; HGB, hemoglobin; IQR, interquartile range; MRI, magnetic resonance imaging; RBC, red blood cell; SD, standard deviation.

Table 1b.

Changes of patient characteristics in one year, median (IQR).

Kaiser Permanente Southern California (KPSC) Veterans Affairs (VA)
N Median (IQR) N Median (IQR)
Weight Change in lb.a 1488802 −0.2 (−4.8, 3.5) 2597505 −0.3 (−5.8, 4.8)
ALT, IU/L 479881 0.0 (−5.0, 4.0) 650675 0 (−6, 4)
HgA1c, % 409975 0.0 (−0.3, 0.3) 898378 0.0 (−0.3, 0.3)
ALP, IU/L 96562 0.0 (−9.0, 10.0)
Total Bilirubin, mg/dL 94098 0.0 (−0.2, 0.2)
HGB for Males, g/dL 202538 −0.1 (−0.8, 0.5)
HGB for Females, g/dL 264229 −0.1 (−0.6, 0.5)
HCT, L/L 466785 −0.3 (−2.1, 1.5)
RBC, million/mm3 449246 0.0 (−0.2, 0.2)
Sodium, mEq/L 496334 0.0 (−2.0, 2.0)
Total Cholesterol, mg/dL 506608 −4.0 (−22.0, 13.0)
Platelets, count/L 449030 −3.0 (−25.0, 19.0)
a

1 lb = 0.45 kg

Sample size

The number and size of the KPSC training, KPSC internal validation and VA external testing datasets are shown in eTable 4.

Missing data

Missing values were imputed14 if the frequency of missing was <60%. We used predictive mean matching method15 with k=5. Laboratory measures with ≥60% missingness or change/change rate measures with ≥80% missingness were not included in the model development process. Nine imputed datasets were generated at KPSC and 10 were created at VA (eFigure 1).

Statistical Analysis

Dataset preparation:

First, we split each of the nine imputation datasets at KPSC into five subsets, each containing 20% of the original imputed dataset (eFigure 1). A total of 45 subsets were prepared for the training and internal validation process described below (eTable 5).

Modeling approach:

To overcome the limitations of regression-based models that are traditionally used for analysis of time-to-event data, we applied ‘random survival forests’ (RSF), a nonparametric machine learning method,1618 to pre-select features and train/validate models. First, we iteratively preselected features based on the average minimum depth (eTable 6) and used those features to develop and validate risk prediction models based on 5-fold cross validation.19 Age was forced into the model. Preselected features were added incrementally to identify the feature that yielded the maximum improvement of c-index. This process continued until the c-index increased <0.005. Of the 45 models derived from the 45 training/internal validation datasets (9 imputation datasets × 5-fold cross validation), the one that appeared the most often was selected as the final model.

Internal validation and external testing:

Algorithms of the final models were applied to the corresponding KPSC validation datasets. By design, the KPSC validation datasets did not include any observations of the KPSC training datasets from which the final model was developed. The final model was first directly applied to VA imputed testing datasets, and subsequently recalibrated to achieve better performance.

Performance measures:

The discriminative power for each of the final models (main model and early detection model) was evaluated by c-index, a concordance measure, averaged across all the relevant validation or testing datasets for cohort members. Calibration was assessed by calibration plots with five risk groups (<50th, 50–74th, 75–89th, 90–94th, 95–100th percentiles).20 Greenwood-Nam-D’Agostino (GND) calibration test was also performed to assess goodness-of-fit.

We estimated sensitivity, specificity, positive predictive value (PPV), and relative increase in risk in comparison to that of the entire cohort at various levels of risk thresholds. For this analysis we restricted the patients to those with complete follow up or developed PDAC in 18 months. The results were averaged across the validation datasets for each model. We also estimated the theoretical number (and the 95% confidence interval) needed to be evaluated to identify a single case of PDAC for the main and early detection models based on KPSC internal validation datasets and VA external testing datasets assuming 100% ability to identify pancreatic cancer through diagnostic testing.

Sensitivity and post-hoc analyses:

To understand the impact of missing data imputation on model performance, we conducted a sensitivity analysis in which only observations with known predictors were included in the validation process based on one of the internal validation datasets (DS3D, eFigure 1). C-index and calibration plot are reported. To understand whether traditional risk factors of pancreatic cancer can further improve the performance, we performed a post-hoc analysis by forcing family history of pancreatic cancer, smoking status, and BMI into the main model with the final predictors. The process was repeated on all the KPSC training and validation datasets used to generate the final model. The average c-index was reported for the model with and without family history of pancreatic cancer, smoking status, and BMI.

Exploratory analysis:

To understand whether our models weighted towards detection of advanced cancer, we examined the average predicted risks of PDAC in 18 months in patients who developed PDAC stratified by cancer stage at the time of diagnosis.

All statistical analyses were performed using SAS (Version 9.4 for Unix; SAS Institute, Cary, NC) or R Version 3.6.0 (R Foundation, Vienna, Austria).

RESULTS

Characteristics of the study participants

1.8 million KPSC patients were eligible (Figure 1), of which 53.3% were women, 45.3% were white, 29.8% were Hispanic, 9.5% were African American and 10.7% were Asian and Pacific Islanders (Table 1a). The majority (60.7%) used commercial insurance, and slightly under one-third (31.8%) were on Medicare. On average, the KPSC patients were 61.6 (SD 9.4) years of age, with average membership length of 18.9 (SD 13.5) years. 35.7% of the patients were obese and additional 35.6% were overweight. In the 6 months prior to the index date, 20.9% had diagnosis of diabetes, 10.5% depression, 1.5% biliary tract disease, 1.1% gallstone disorders, 0.2% acute pancreatitis and 0.1% chronic pancreatitis (Table 1a).

Figure 1 –

Figure 1 –

Consort Diagram (KPSC and VA main cohorts)

The 2.7 million eligible veterans were predominantly men (94.3%), white (67.9%) and African American (16.8%) and were older (66.1 years of age) than the KPSC cohort (Table 1a). Smoking, diabetes, acute and chronic pancreatitis were more prevalent in the VA cohort compared to those of the KPSC cohort. ALT and HbA1c at baseline appeared comparable between the two cohorts (Table 1a). The changes in weight and laboratory parameters during the year prior to t0 can be found in Table 1b.

Incidence of PDAC

Table 2 displays the follow-up time in years, number, incidence rate (IR) of PDAC, and time to PDAC for all patients and for subgroups of patients defined by age, sex, race/ethnicity, weight change, abdominal pain and parameters related to ALT and HbA1c. 1,792 KPSC patients developed PDAC within 18 months of follow-up (IR=0.77, 95% CI 0.73–0.80/1,000 person-years (PY) (Table 2). A total of 4,582 patients in the VA cohort developed PDAC (IR 1.27 (1.23–1.30)). In the VA cohort, abdominal pain in the 6 months prior to t0 increased the IR to 4.35 (4.01–4.70). The observed time to PDAC was longer for the VA cohort (median 233 days, IQR 116–370 days) compared to that of the KPSC cohort (median 205 days, IQR 91–358 days). The distributions of cancer stage (I-IV) were comparable between the KPSC (stages I/II 31.7%, stages III/IV 53.6%, unknown/missing 14.7%) and VA (stages I/II 19.1%, stages III/IV 32.6%, unknown/missing 48.4%) cohorts (eTable 7), although the VA cohort included higher frequency of records with unknown/missing cancer stage.

Table 2.

Total follow-up (f/u) time, number, and incidence rate of PDAC per 1,000 person-years (PY) and 95% CI.

Kaiser Permanente Southern California (KPSC) Veterans Affairs (VA)

Total f/u Time (years) No. of PDAC Incidence Rate of PDAC/ 1000 PY (95% CI) Days to PDAC (median, IQR) Total f/u Time (years) No. of PDAC Incidence Rate of PDAC/ 1000 PY (95% CI) Days to PDAC (median, IQR)

All 2331767 1792 0.77 (0.73, 0.80) 205 (91, 358) 3614215 4582 1.27 (1.23, 1.30) 233 (116, 370)
Age Group in years
 50–59 1133128 350 0.31 (0.28, 0.34) 198 (84, 357) 989814 679 0.69 (0.64, 0.74) 212 (98, 352)
 60–69 711410 624 0.88 (0.81, 0.95) 197 (91, 355) 1630532 2026 1.24 (1.19, 1.30) 239 (122, 374)
 70–79 369643 604 1.63 (1.51, 1.77) 219 (91, 362) 677576 1304 1.92 (1.82, 2.03) 237 (113, 369)
 80–84 117585 214 1.82 (1.59, 2.08) 213 (112, 353) 316293 573 1.81 (1.67, 1.96) 235 (116, 367)
Sex
 Female 1246235 864 0.69 (0.65, 0.74) 219 (97, 370) 213181 101 0.47 (0.39, 0.57) 248 (104, 398)
 Male 1085520 928 0.85 (0.80, 0.91) 220 (88, 350) 3401034 4481 1.32 (1.28, 1.36) 233 (116, 369)
Race/ Ethnicity
 Non-Hispanic White 1065194 934 0.88 (0.82, 0.93) 206 (90, 362) 2456059 2986 1.22 (1.17, 1.26) 238 (119, 370)
 Non-Hispanic Black 226546 262 1.16 (1.02, 1.30) 208 (100, 356) 612641 692 1.13 (1.05, 1.22) 220.5 (100.5, 377)
 Asian/Pacific Islander 255168 147 0.58 (0.49, 0.67) 216 (114, 367) 52570 48 0.91 (0.68, 1.20) 192 (118.5, 373.5)
 Hispanic 692443 410 0.59 (0.53, 0.65) 201 (87, 345) 202263 211 1.04 (0.91, 1.19) 239 (121, 369)
 Unknowna 290682 645 2.22 (2.05, 2.40) 220 (115, 357)
ALT change in 1 year in IU/L
 ≤ −5 156757 174 1.11 (0.95, 1.28) 216 (92, 360) 248360 359 1.45 (1.30, 1.60) 247 (123, 367)
 (−5, 5] 340273 330 0.97 (0.87, 1.08) 232 (107, 392) 434050 523 1.20 (1.10, 1.31) 246.5 (132.5, 399)5)
 > 5 123457 165 1.34 (1.14, 1.55) 149 (63, 303) 187824 322 1.71 (1.53, 1.91) 184 (86, 302)
 Unknown 1711281 1123 0.66 (0.61, 0.70) 203 (94, 355) 2743981 3378 1.23 (1.19, 1.27) 234 (116, 373)
Rate of ALT Change in 1 Year
 ≤ −0.01 180995 205 1.13 (0.99, 1.30) 210 (91, 358) 281222 402 1.43 (1.29, 1.57) 247 (126, 373)
 (−0.01, 0.01] 272305 255 0.94 (0.83, 1.06) 234 (114, 384) 343020 420 1.22 (1.11, 1.35) 244.5 (132.5, 391.5)
 > 0.01 167186 209 1.25 (1.09, 1.43) 166 (65, 330) 245991 382 1.55 (1.40, 1.71) 191 (86, 310)
 Unknown 1711281 1123 0.66 (0.62, 0.70) 203 (94, 355) 2743981 3378 1.23 (1.19, 1.27) 234 (116, 373)
HgA1c Value Prior to Index date
 <6.5% 586772 326 0.55 (0.50, 0.62) 220 (90, 369) 1132107 1094 0.97 (0.91, 1.02) 241 (118, 373)
 6.5–6.9% 12424980113 139 1.12 (0.94, 1.32) 190 (74, 360) 228430 366 1.60 (1.44, 1.77) 234 (118, 367)
 7.0–7.4% 80113 122 1.52 (1.27, 1.81) 159 (88, 310) 157876 289 1.83 (1.63, 2.05) 221 (112, 366)
 ≥7.5% 179700 305 1.70 (1.51, 1.89) 200 (90, 341) 399057 969 2.43 (2.28, 2.58) 216 (102, 354)
 Unknown 1360933 900 0.66 (0.62, 0.71) 212 (96, 365) 1696745 1864 1.10 (1.05, 1.15) 239 (121, 378)
HgA1c Change in 1 Year in %
 < −0.3 118111 143 1.21 (1.02, 1.42) 185 (88, 343) 287080 479 1.67 (1.52, 1.82) 227 (115.5, 371)
 [−0.3, 0.3] 298587 204 0.68 (0.59, 0.78) 238 (100, 372) 625990 637 1.02 (0.94, 1.10) 251 (130, 388)
 > 0.3 1201151794953 211 1.76 (1.53, 2.01) 190 (90, 329) 291,676 660 2.26 (2.10, 2.44) 220 (103, 357)
 Unknown 1794953 1234 0.69 (0.65, 0.73) 205 (90, 358) 2409469 2806 1.16 (1.12, 1.21) 233 (117, 370)
Rate of HgA1c Change in 1 Year
 < −0.0008 126922 146 1.15 (0.97, 1.35) 184 (88, 343) 322572 519 1.61 (1.47, 1.75) 227 (115,371)
 [−0.0008, 0.0008] 280155 181 0.64 (0.56, 0.75) 224 (98, 371) 554791 535 0.96 (0.89, 1.05) 249 (129, 378)
 > 0.0008 129736 231 1.78 (1.56, 2.02) 200 (92, 343) 327480 722 2.21 (2.05, 2.37) 225.5 (105, 359)
 Unknown 1794953 1234 0.69 (0.65, 0.73) 205 (90, 358) 2409469 2806 1.16 (1.12, 1.21) 233 (117, 370)
Rate of Weight Change in 1 Year
 < −0.02 503012 683 1.35 (1.26, 1.46) 160 (68, 322) 869902 2049 2.36 (2.26, 2.46) 185 (86, 313)
 [−0.02, 0.02] 1033671 703 0.68 (0.63, 0.73) 238 (124, 387) 1437632 1403 0.98 (0.93, 1.03) 286 (156, 409)
 > 0.02 393142 209 0.53 (0.46, 0.61) 238 (114, 358) 750864 621 0.83 (0.76, 0.89) 281 (145, 408)
 Unknown 401941 197 0.49 (0.43, 0.56) 202 (90, 364) 556817 509 0.91 (0.84, 0.996) 244 (133, 387)
Abdominal Pain in Prior 6 mosb
 Yes 168579 322 1.91 (1.71, 2.13) 116 (54, 256) 139459 606 4.35 (4.01, 4.70) 136 (67, 252)
 No 134525 139 1.03 (0.87, 1.22) 199 (110, 349) 3474756 3976 1.14 (1.11, 1.18) 250.5 (128, 382.5)
 Unknown 2028662 1331 0.66 (0.62, 0.69) 230 (110, 377)

Abbreviations: ALT, alanine transaminase; CI, confidence interval; F/U, follow-up; HbA1c, glycated hemoglobin A1c; IQR, interquartile range; PDAC, pancreatic ductal adenocarcinoma; PY, person-years.

a

Not estimated for KPSC dataset due to the small number of events (39) in this group.

b

For VA data, “Unknown” was interpreted as “No” due to the inability to distinguish between the two.

Model Development, Internal Validation and External Testing

For the main cohort, the preselection process identified 29 potential predictors (eTable 6). The final main model (M1) contained age, abdominal pain, weight change, HbA1c and alanine transaminase (ALT) change (c-index: mean 0.77 and SD 0.02; calibration test: χ2 mean 5.9 and SD 4.1, p-value mean 0.4 and SD 0.3). When M1 was directly applied to VA testing datasets, the mean c-index was 0.69 (SD 0.003) (data not shown); however, after the algorithm was recalibrated based on VA’s datasets, the mean c-index based on 10 testing datasets was 0.71 (SD 0.002).

For the early detection cohort, the preselection process identified 32 potential predictors (eTable 6). The final early detection model (E1) comprised the same features as those selected by M1 except that abdominal patient was not chosen (c-index: mean 0.77 and SD 0.4; calibration test: χ2 mean 7.3 and SD 4.9, p-value 0.3 and SD 0.3). The recalibrated E1 model based on the VA testing datasets achieved a mean c-index of 0.68 (SD 0.003). e hyperparameters used and the features selected by ≥5 out of 45 models for each cohort can be found in eTable 8 and eTable 9, respectively.

Figure 2 displays the calibration plots for both models at KPSC and VA. It appears that all fit well for the four out of five lower risk groups (i.e., risk<95th percentile). However, for the highest risk group (risk ≥95th percentile), M1 properly estimated the risks at KPSC, while E1 slightly overestimated the risks for KPSC patients, and both M1, E1 slightly underestimated the risks for VA patients.

Figure 2 –

Figure 2 –

Calibration plots of final models. x-axis: predicted; y-axis: observed. The five clusters represent the five risk groups defined by the ranges of predicted risks: <50th, 50–74th, 75–89th, 90–94th, 95–100th percentiles. Within each cluster, there are multiple dots representing the pairs of predicted and observed risks, calculated based on the corresponding internal validation or external testing datasets.

Sensitivity, specificity, PPV and fold increase in risk are reported in eTable 10 for the final main model (M1) and the final early detection model (E1) for both KPSC and VA patients. The top 2.5% of the KPSC internal validation sample (n=6,249) based on M1 experienced 1.1% risk of PDAC over 18 months, which was 7- to 8- fold higher than the baseline risk of PDAC in the KPSC cohort. 349 (19.5% of the 1,792 PDAC cases occurring within 18 months) were identified in this top 2.5% model-predicted high-risk group. Patients within the top 20% predicted risk of PDAC (n=50,952) experienced 0.4% risk of PDAC over 18 months. This risk threshold identified 1,014 (or 56.6% of) PDAC cases occurring within 18 months with a specificity of 79.6%. While sensitivities and fold increases in PDAC incidence rate were lower in the VA population, PPVs were higher for VA testing data compared to those of KPSC (eTable 10).

Compared to the models developed based on the main cohort (M1), the models developed based on early detection cohort (E1) had slightly reduced sensitivity, PPV and fold increase in risk.

The number of patients needed to undergo evaluation to detect a single case of PDAC is shown in eTable 11. In theory, if we screen patients at the top 2.5% model predicted risk identified based on model M1, we need to evaluate 90 patients (95% CI 73–117) at KPSC and 61 patients (95% CI 57–65) at VA to detect a single case of PDAC (eTable 11). If Model E1 were used to determine the screening population threshold of 2.5% predicted risk, more patients would need to be evaluated (164 (125–241) at KPSC and 91 (84–99) at VA).

Sensitivity and post-hoc analyses:

When we applied model M1 to the subset of one of the KPSC internal validation dataset with complete values of predictors (no imputation), the c-index reached 0.77. The corresponding c-index was 0.76 when all the observations of the same validation dataset were applied. The calibration plot of the complete case analysis is shown in eFigure 2. As expected, the average predicted and observed risks of PDAC in patients with complete predictors were higher compared to those of all patients.

When the traditional risk factors of pancreatic cancer (family history of pancreatic cancer, smoking status, BMI) were forced into model M1, the average c-index was 0.76 (SD 0.003), compared to 0.77 (SD 0.02) for the model without the three traditional risk factors. Adding the traditional risk factors did not improve performance of M1.

Exploratory analysis:

The average predicted risks of PDAC were 0.54% and 0.25% at the time of risk assessment in patients who were later diagnosed with stages I&II and III&IV cancer, respectively. This finding indicates that our model has the ability to detect both early stage and advanced cancer.

For demonstration, decision rules based on one of the trees built for M1 is displayed in Figure 3 (for right side of the decision tree) and eFigures 3 (for left side of the decision tree). For example, for patients>68 years of age, had abdominal pain and change of ALT>50, the observed and the predicted risks of PDAC in 18 months were 4.77% and 5.05%, respectively (Figure 3).

Figures 3 –

Figures 3 –

One of the decision trees based on KPSC M1 (right side). Each end note contains two numbers. The first one is the observed risk and the second one is the predicted risk of PDAC in 18 months.

Risk prediction tool

To facilitate external application of the RSF-based prediction model (M1), we have developed a publicly available web-based tool (https://pcrisk.kp-scalresearch.org/). A hypothetical 70-year-old male patient with HbA1c value of 7.5%, weight loss of 4 lbs. and ALT increase of 4 IU/L in one year has an estimated 18-month risk of PDAC 0.30%. The sample codes to training, update and validate the models were posted on Github (https://github.com/kpsc-informatics/PROTECT-RSF) and are also available in eTable 12.

DISCUSSION

We applied machine learning methods to EHR data to derive and validate clinical prediction models for pancreatic cancer across two large integrated healthcare systems. Despite inclusion of >500 potential features in the candidate pool, the machine learning models incorporated traditional parameters including age, glycated hemoglobin (HbA1c), alanine aminotransferase, weight, and abdominal pain. The final models were both parsimonious (with only 4–5 predictors) and reasonably accurate in both internal validation and external testing.

While there has been progress in developing approaches to early detection in high-risk patients based on either family history or genetic susceptibility21 as well as those with specific conditions such as late-onset diabetes,6 limited data exist on identification of patients at risk for sporadic forms of pancreatic cancer. This study presents a novel approach to risk stratification at the population-level based on dynamic parameters contained within structured data from EHR.

Although parameters included in the model are well-established parameters for PDAC,5,6,10 their selection using an unbiased, comprehensive data-driven approach helped ensure inclusion of the most relevant combination of parameters. A recently developed model to predict risk of pancreatic cancer among patients with late or new onset diabetes at age 50 or later similarly identified increasing age, weight loss and change in blood glucose as key parameters for determining risk of pancreatic cancer in this patient population.5 The current model extends the concept of model-based risk prediction to a much broader population while maintaining reasonably high levels of discriminative accuracy for prediction of pancreatic cancer.

External testing is key to assessing model performance. In a review of 127 prediction models, Siontis et al. found that only 32 (25%) had at least one external testing.22 Compared to the original studies, AUC estimates derived from the corresponding external testing studies significantly decreased by 0.05 (p<0.001).22 In the current study, the c-index declined by 0.08 (or 10%) when model M1 was directly transported and about 0.06 (or 8%) and 0.09 (or 12%) after models M1 and E1 were recalibrated, respectively. Although the comparison should be interpreted cautiously, the larger reduction observed in the current study could be attributable to multiple factors. First, a higher frequency of PDAC cases in the VA cohort were pancreatic cancer deaths identified through mortality records compared to the KPSC cohort. Second, given the differences in age and sex between KPSC and VA populations, a higher incidence rate of PDAC in the VA dataset compared to that of KP’s was observed as expected. This could have impacted model accuracy especially for the models without recalibration.

Strengths of the current study included a comprehensive, data-driven approach to model development in a racially/ethnically diverse patient population, use of structured data elements and external validation in a separate healthcare system with distinct patient population. The present study had several limitations. First, several parameters identified in the prediction models (abdominal pain, abnormal ALT) are often associated with advanced stage pancreatic cancer. The 30-day cancer-free period used in the present model is likely insufficient to provide a reasonable window of opportunity for intervention to impact the disease course. To address this concern, we also developed an early detection model that restricted the study population to patients with ≥90 days cancer-free follow-up from the index date. Despite the reasonably high performance in terms of discriminative ability, the absolute risk in the highest risk category (top 2.5%) approached 1% over 18-months. This level of risk is likely below the threshold for cost-effective screening based on currently available testing.23 Second, of the 1479 and 4,582 events identified in the KPSC and VA cohorts, respectively, 300 and 2,564 events were captured by data sources other than Cancer Registry. An evaluation based on the KPSC Cancer Registry of the same time window showed that about 90% of pancreatic cancer cases were PDAC. Third, to estimate sensitivity, specificity, PPV and fold of risk increase, we relied on a subset of patients (~70% and ~80% of the total patients in the KPSC and VA cohorts, respectively) with complete follow up unless they died of pancreatic cancer. This restriction over-estimated the risk of PDAC, because the patients who were excluded from this analysis were at-risk for some periods of time. Finally, some important predictors had high percentages of unknown values (i.e., missing data). Although multiple imputation was performed, bias may occur if the missing at random (MAR) assumption is violated. Nevertheless, our sensitivity analysis showed that model performance remined the same when the validation was limited to records with complete predictors.

In conclusion, we developed a parsimonious clinical risk prediction model for sporadic pancreatic cancer in a large, diverse integrated health system and subsequently applied the model in a separate health system. The model identified five key factors in determining risk of pancreatic cancer. Findings from the present study provide a potential framework for a systematic approach to targeted screening for pancreatic cancer based on automated analysis of data in EHR.

Supplementary Material

Supplementary File_1
Supplementary File_2

eFigure 2 – Sensitivity analysis: Calibration plot for complete case analysis (based on model M1). x-axis: predicted; y-axis: observed. The five blue and orange dots represent the five risk groups defined by the ranges of predicted risks: <50th, 50–74th, 75–89th, 90–94th, 95–100th percentiles for complete case analysis and all patient analysis, respectively.

Supplementary File_3

eFigure 3 – One of the decision trees based on KPSC M1 (left side). Each end note contains two numbers. The first one is the observed risk and the second one is the predicted risk of PDAC in 18 months.

Figure e1

eFigure 1 – Preparation of KPSC training and internal validation datasets, and VA external testing datasets.

Study Highlights.

What Is Known

  • Patients with pancreatic cancer are often diagnosed at late stages.

  • Early detection is needed to impact the natural history of disease progression and improve patient survival.

What Is New Here

  • Machine-learning was used to develop a population-based model for early detection of pancreatic cancer. The model was internally and externally tested in cohorts of 1.8 million and 2.6 million individuals, respectively.

  • The final model included age, abdominal pain, weight change, glycated hemoglobin (HbA1c) and alanine transaminase (ALT) change. A publicly accessible web-based calculator is available online (https://pcrisk.kp-scalresearch.org/).

• Financial support:

Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA230442. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.

Abbreviations

ALT

Alanine transaminase

AUC

area under the curve

CI

confidence interval

DoD

Department of Defense

EHR

electronic health record

GND

Greenwood-Nam-D’Agostino

HbA1C

glycated hemoglobin

ICD-9-CM

Ninth Revision of International Classification of Diseases, Clinical Modification

ICD-10-CM

Tenth Revision of International Classification of Diseases, Clinical Modification

IR

Incidence rate

KPSC

Kaiser Permanente Southern California

NDI

National Death Index

NOD

new onset diabetes

PC

pancreatic cancer

PDAC

pancreatic ductal adenocarcinoma

PPV

positive predictive value

RSF

Random Survival Forest

SEER

Surveillance, Epidemiology, and End Results

VA

Veterans Affairs

Footnotes

Potential competing interests: The authors declare they have no conflict of interest for this study.

Guarantor of the article: Dr. Wansu Chen accepts full responsibility for the conduct of the study. She has had access to the data and has control of the decision of the decision to publish.

References

  • 1.National Cancer Institute Surveillance, Epidemiology and End Results Program. Cancer Stat Factors: Pancreatic Cancer. https://seer.cancer.gov/statfacts/html/pancreas.html. Last accessed: July 26, 2022.
  • 2.Stathis A, Moore MJ. Advanced pancreatic carcinoma: current treatment and future challenges. Nat Rev Clin Oncol. 2010;7(3):163–172. [DOI] [PubMed] [Google Scholar]
  • 3.Stokes JB, Nolan NJ, Stelow EB, et al. Preoperative capecitabine and concurrent radiation for borderline resectable pancreatic cancer. Ann Surg Oncol. 2011;18(3):619–627. [DOI] [PubMed] [Google Scholar]
  • 4.US Preventive Services Task Force, Owens DK, Davidson KW, et al. Screening for Pancreatic Cancer: US Preventive Services Task Force Reaffirmation Recommendation Statement. JAMA. 2019;322(5):438–444. [DOI] [PubMed] [Google Scholar]
  • 5.Sharma A, Kandlakunta H, Nagpal SJS, et al. Model to Determine Risk of Pancreatic Cancer in Patients With New-Onset Diabetes. Gastroenterology. 2018;155(3):730–739 e733. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Boursi B, Finkelman B, Giantonio BJ, et al. A Clinical Prediction Model to Assess Risk for Pancreatic Cancer Among Patients With New-Onset Diabetes. Gastroenterology. 2017;152(4):840–850 e843. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Chen W, Butler RK, Lustigova E, Chari ST, Wu BU. Validation of the Enriching New-Onset Diabetes for Pancreatic Cancer Model in a Diverse and Integrated Healthcare Setting. Dig Dis Sci. 2021;66(1):78–87. [DOI] [PubMed] [Google Scholar]
  • 8.Kim J, Yuan C, Babic A, et al. Genetic and Circulating Biomarker Data Improve Risk Prediction for Pancreatic Cancer in the General Population. Cancer Epidemiol Biomarkers Prev. 2020;29(5):999–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Klein AP, Lindstrom S, Mendelsohn JB, et al. An absolute risk model to identify individuals at elevated risk for pancreatic cancer in the general population. PLoS One. 2013;8(9):e72311. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Yu A, Woo SM, Joo J, et al. Development and Validation of a Prediction Model to Estimate Individual Risk of Pancreatic Cancer. PLoS One. 2016;11(1):e0146473. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Koebnick C, Langer-Gould AM, Gould MK, et al. Sociodemographic characteristics of members of a large, integrated health care system: comparison with US Census Bureau data. Perm J. 2012;16(3):37–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Fihn SD, Francis J, Clancy C, et al. Insights from advanced analytics at the Veterans Health Administration. Health Aff (Millwood). 2014;33(7):1203–1211. [DOI] [PubMed] [Google Scholar]
  • 13.Chen W, Yao J, Liang Z, et al. Temporal Trends in Mortality Rates among Kaiser Permanente Southern California Health Plan Enrollees, 2001–2016. Perm J. 2019;23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Wright MN, Ziegler A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J Stat Softw. 2017;77(1). [Google Scholar]
  • 15.Little RJA. Missing-Data Adjustments in Large Surveys. J Bus Econ Stat. 1988;6(3):287–296. [Google Scholar]
  • 16.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008(3):841–860. [Google Scholar]
  • 17.Dietrich S, Floegel A, Troll M, et al. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int J Epidemiol. 2016;45(5):1406–1420. [DOI] [PubMed] [Google Scholar]
  • 18.Ishwaran H KU. randomForestSRC: Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). http://web.ccs.miami.edu/~hishwaran/. Accessed July 8, 2021.
  • 19.Stone M Cross-Validatory Choice and Assessment of Statistical Predictions. J R Stat Soc Series B Stat Methodol. 1974;36(2):111–147. [Google Scholar]
  • 20.Demler OV, Paynter NP, Cook NR. Tests of calibration and goodness-of-fit in the survival setting. Stat Med. 2015;34(10):1659–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Goggins M, Overbeek KA, Brand R, et al. Management of patients with increased risk for familial pancreatic cancer: updated recommendations from the International Cancer of the Pancreas Screening (CAPS) Consortium. Gut. 2020;69(1):7–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Siontis GC, Tzoulaki I, Castaldi PJ, Ioannidis JP. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 2015;68(1):25–34. [DOI] [PubMed] [Google Scholar]
  • 23.Schwartz NRM, Matrisian LM, Shrader EE, Feng Z, Chari S, Roth JA. Potential Cost-Effectiveness of Risk-Based Pancreatic Cancer Screening in Patients With New-Onset Diabetes. J Natl Compr Canc Netw. 2021:1–9. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary File_1
Supplementary File_2

eFigure 2 – Sensitivity analysis: Calibration plot for complete case analysis (based on model M1). x-axis: predicted; y-axis: observed. The five blue and orange dots represent the five risk groups defined by the ranges of predicted risks: <50th, 50–74th, 75–89th, 90–94th, 95–100th percentiles for complete case analysis and all patient analysis, respectively.

Supplementary File_3

eFigure 3 – One of the decision trees based on KPSC M1 (left side). Each end note contains two numbers. The first one is the observed risk and the second one is the predicted risk of PDAC in 18 months.

Figure e1

eFigure 1 – Preparation of KPSC training and internal validation datasets, and VA external testing datasets.

RESOURCES