Abstract
OBJECTIVES:
There is currently no widely accepted approach to screening for pancreatic cancer (PC). We aimed to develop and validate a risk prediction model for pancreatic ductal adenocarcinoma (PDAC), the most common form of PC, across two health systems using electronic health records (EHR).
METHODS:
This retrospective cohort study consisted of patients 50–84 years of age having at least one clinic-based visit over a 10-year study period at Kaiser Permanente Southern California (KPSC, model training, internal validation) and the Veterans Affairs (VA, external testing). ‘Random survival forests’ models were built to identify the most relevant predictors from >500 variables and to predict risk of PDAC within 18 months of cohort entry.
RESULTS:
The KPSC cohort consisted of 1.8 million patients (mean age 61.6) with 1,792 PDAC cases. The 18-month incidence rate of PDAC was 0.77 (95% CI 0.73–0.80)/1,000 person-years. The final main model contained age, abdominal pain, weight change, HbA1c and ALT change (c-index: mean=0.77, SD=0.02; calibration test: p-value 0.4, SD 0.3). The final early detection model comprised the same features as those selected by the main model except for abdominal pain (c-index: 0.77 and SD 0.4; calibration test: p-value 0.3 and SD 0.3). The VA testing cohort consisted of 2.7 million patients (mean age 66.1) with an 18-month incidence rate of 1.27 (1.23–1.30)/1,000 person-years. The recalibrated main and early detection models based on VA testing datasets achieved mean c-index of 0.71 (SD 0.002) and 0.68 (SD 0.003), respectively.
CONCLUSIONS:
Using widely available parameters in EHR, we developed and externally validated parsimonious machine learning-based models for detection of pancreatic cancer. These models may be suitable for real-time clinical application.
Keywords: risk prediction, pancreatic cancer, machine learning, general population, glycated hemoglobin, alanine transaminase, weight loss
INTRODUCTION
Pancreatic cancer is the third leading cause of cancer deaths in the United States with 49,830 estimated deaths in 20221. Pancreatic cancer is often diagnosed in an advanced stage and as such has very poor survival, with the overall 5-year survival reaching only 11.5%.1 Due to the low incidence of pancreatic cancer in the general population (13.2 per 100,000 person-years),1 widespread population-based screening is not currently recommended by the United States Preventative Services Task Force.4 Therefore, a targeted approach to screening among higher-risk populations represents a key opportunity to alter the natural history of this disease.
Although several studies have shown evidence of improved outcomes for high-risk individuals undergoing screening based on genetic or familial predisposition, these patients constitute a small proportion of the patients that develop pancreatic cancer. An alternative approach is needed to identify patients in the broader population at risk for pancreatic cancer for whom targeting screening may also be beneficial.
The emergence of comprehensive EHR and maturation of machine learning offers an unprecedented opportunity to enhance efforts in early detection in pancreatic cancer. Specifically, the coupling of machine-learning with robust EHR allows a comprehensive unbiased data-driven approach to selection of candidate variables facilitating development of prediction models suitable for clinical application. To date, efforts to develop clinical prediction models in pancreatic cancer have focused on specific populations such as those with new-onset diabetes5–7or within the confines of a case-control study.8,9 Efforts to identify high-risk patients in the general population are sparse.10 There is a critical need for novel risk stratification tools which are both sensitive and specific for rapid identification of patients at increased risk of developing pancreatic cancer.
The aim of the present study was to develop and validate a clinical prediction model for risk of PDAC across several large health systems. Specifically, we sought to apply machine-learning combined with a comprehensive approach to data in EHR to predict the risk of sporadic PDAC in a broad population-based setting.
METHODS
Source of data
We developed risk prediction models using a retrospective cohort study utilizing health plan enrollees of Kaiser Permanente Southern California (KPSC), a large integrated healthcare system that provides comprehensive healthcare services for >4.7 million enrollees from diverse racial/ethnic backgrounds across 15 medical centers and 235 medical offices. Model training and internal validation were conducted based on EHR data. The demographics and socioeconomic status of KPSC health plan enrollees are comparable to those of residents in the Southern California region.11 The internally validated models were externally tested using EHR of Veterans Affairs (VA),12 the largest integrated health care system in the US providing care to >9 million Veterans at its 1,298 health care facilities including 171 VA Medical Centers and 1,113 outpatient sites. The study protocol was approved by the Institutional Review Board of KPSC and VA.
Study participants
Model training and internal validation:
Patients 50–84 years of age and had ≥1 clinic-based visit (index visit) within a KPSC facility in 2008–2017 were identified. Patients who had history of pancreatic cancer, or not continuously enrolled in the KPSC health plan in the past 12 months (gaps 45 days or less were allowed) were excluded. The requirement of continuous enrollment allowed adequate data to define study variables. For patients with multiple qualifying index visits, we selected one randomly as the index visit. The corresponding visit date was referred to as the index date (t0). Follow-up started on t0 and ended with the earliest of the following events: disenrollment from the health plan, end of the study (December 31, 2018), reached the maximum length of follow-up (18 months), non-PDAC related death, or PDAC diagnosis or death (outcome). A minimum of 30 days of follow-up is required.
Model external testing:
Veterans 50–84 years of age who had >1 outpatient visit (index visit) within a VA facility in 2008–2017 and another clinic-based visit within the 12 months prior to the index date were identified. Patients who had history of pancreatic cancer were excluded. The same follow-up rules mentioned above were applied to the VA cohort except for “disenrollment from the health plan”.
Early detection cohort:
To facilitate earlier detection of PDAC, we also established a cohort which included patients identified in the main cohort who had ≥90 days of cancer-free follow up for both KPSC and VA patients. The same analyses described below applied to the main cohort and the early detection cohort.
Outcome
The study outcome was PDAC diagnosis or death with pancreatic cancer in the 18 months after the index date. For the KPSC cohort, PDAC was identified from the Cancer Registry by using the Tenth Revision of International Classification of Diseases, Clinical Modification (ICD-10-CM) code C25.x and histology codes (eTable 1). Pancreatic cancer deaths were derived from the linkage with the California State Death Master files and identified using ICD-10-CM codes C25.x.13 For the VA cohort, cases of PDAC were similarly identified through an internal VA Central Cancer Registry, and PDAC deaths identified through the VA Mortality Data Repository, which integrates vital status data from the National Death Index (NDI), VA, and DoD administrative records.
Predictors
A complete list of extracted and derived features for the KPSC cohort is shown in eTable 2. A total of 500+ features which include patient demographics and lifestyle variables (e.g., smoking status), medical conditions (coded by ICD-9 or ICD-10 codes), lab test values, medication dispensing, medical procedures (coded by CPT, ICD-9/ICD-10 or KPSC internal procedure codes), symptoms (e.g., abdominal pain), health care utilization, and other features (e.g., year of index visit) were added into the feature candidate pool. Except for demographic variables, values within each time interval (0–6 months, 7–12 months, 1–2 years and >2 years) were generated. Definitions of the derived variables were described in eTable 3. Since the VA dataset was solely used for testing purposes, only limited number of features were extracted (Tables 1a and 1b).
Table 1a.
Demographics and Lifestyle Characteristics | Kaiser Permanente Southern California (KPSC) N=1,801,931 |
Veterans Affairs (VA) N=2,690,895 |
||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Age, mean (SD) | 61.6 (9.4) | 66.1 (9.1) | ||||||
| ||||||||
Female | 960266 (53.3) | 153630 (5.7) | ||||||
| ||||||||
Race/Ethnicity | ||||||||
Non-Hispanic White | 815773 (45.3) | 1828095 (67.9) | ||||||
Non-Hispanic Black | 171424 (9.5) | 451523 (16.8) | ||||||
Hispanic | 536079 (29.8) | 148690 (5.5) | ||||||
Asian and Pacific Islander | 192179 (10.7) | 38536 (1.4) | ||||||
Multiple/Other/Unknown | 86476 (4.8) | 224051 (8.3) | ||||||
| ||||||||
Medical Insurance (one or more) | ||||||||
Commercial | 1094182 (60.7) | |||||||
Medicare | 572452 (31.8) | |||||||
Medi-CAL/Other State Programs | 64150 (3.6) | |||||||
Private Pay | 483042 (26.8) | |||||||
| ||||||||
Years Since First enrollment, mean (SD) | 18.9 (13.5) | |||||||
| ||||||||
Family History of Pancreatic Cancer | 25386 (1.4) | |||||||
| ||||||||
Tobacco Use | ||||||||
Ever | 700429 (38.9) | 1914180 (71.1) | ||||||
Never | 1101502 (61.1) | 776715 (28.9) | ||||||
| ||||||||
Weight Defined by BMI (kg/m2) | ||||||||
Underweight (<18.5) | 19095 (1.1) | |||||||
Normal Weight (18.5–24.9) | 429010 (23.8) | |||||||
Overweight (25–29.9) | 640644 (35.6) | |||||||
Obese (30+) | 642905 (35.7) | |||||||
Unknown | 70277 (3.9) | |||||||
| ||||||||
Lab Tests in Prior 6 months | N | Median (IQR) | N | Median (IQR) | ||||
| ||||||||
ALT, IU/L | 920993 | 22.0 (17.0, 30.0) | 908520 | 24 (18, 34) | ||||
| ||||||||
HbA1c, % | 744601 | 6.2 (5.8, 7.1) | 1429886 | 6.2 (5.7, 7.2) | ||||
| ||||||||
ALP, IU/L | 309460 | 70.0 (57.0, 87.0) | ||||||
| ||||||||
Total Bilirubin, mg/dL | 303803 | 0.7 (0.5, 0.9) | ||||||
| ||||||||
HGB for Males, g/dL | 437416 | 14.5 (13.4, 15.4) | ||||||
| ||||||||
HGB for Females, g/dL | 515110 | 13.2 (12.4, 14.0) | ||||||
| ||||||||
HCT, L/L | 952548 | 41.0 (38.0, 43.7) | ||||||
| ||||||||
RBC, million/mm3 | 934454 | 4.5 (4.2, 4.9) | ||||||
| ||||||||
Sodium, mEq/L | 952253 | 139.0 (137.0, 141.0) | ||||||
| ||||||||
Total Cholesterol, mg/dL | 964899 | 184.0 (156.0, 215.0) | ||||||
| ||||||||
Platelets, count/L | 934239 | 232.0 (194.0, 277.0) | ||||||
| ||||||||
Medical Conditions | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior |
| ||||||||
Gallstone Disorders | 20085 (1.1) | 13184 (0.7) | 20183 (1.1) | 104529 (5.8) | ||||
| ||||||||
Acute Pancreatitis | 3256 (0.2) | 2065 (0.1) | 3199 (0.2) | 17975 (1.0) | 9510 (0.4) | 5646 (0.2) | 8859 (0.3) | 18209 (0.7) |
| ||||||||
Chronic Pancreatitis | 1550 (0.1) | 1182 (0.1) | 1475 (0.1) | 3738 (0.2) | 7749 (0.3) | 5125 (0.2) | 6707 (0.2) | 9719 (0.4) |
| ||||||||
Benign Pancreatic Disease | 2019 (0.1) | 1398 (0.1) | 1711 (0.1) | 2780 (0.2) | ||||
| ||||||||
Biliary Tract Disease | 26199 (1.5) | 20974 (1.2) | 25512 (1.4) | 38538 (2.1) | ||||
| ||||||||
Depression | 189508 (10.5) | 150514 (8.4) | 190868 (10.6) | 317686 (17.6) | ||||
| ||||||||
Diabetes | 376022 (20.9) | 325789 (18.1) | 335965 (18.6) | 327408 (18.2) | 842,311 (31.3) | 775,964 (28.8) | 831,566 (30.9) | 735,251 (27.3) |
| ||||||||
Medical Procedures | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior |
| ||||||||
Abdominal/Chest CT | 121427 (6.7) | 82841 (4.6) | 119308 (6.6) | 304085 (16.9) | ||||
| ||||||||
Abdominal/Chest MRI | 1861 (0.1) | 1366 (0.1) | 2226 (0.1) | 7057 (0.4) | ||||
| ||||||||
Abdominal/Chest Ultrasound | 70721 (3.9) | 55004 (3.1) | 93179 (5.2) | 333051 (18.5) | ||||
| ||||||||
Any Abdominal Surgery | 11447 (0.6) | 7360 (0.4) | 11559 (0.6) | 77283 (4.3) | ||||
| ||||||||
Surgical Procedures on esophagus | 18492 (1.0) | 13226 (0.7) | 22164 (1.2) | 125701 (7.0) | ||||
| ||||||||
Upper GI Endoscopy | 37108 (2.1) | 27232 (1.5) | 43386 (2.4) | 160899 (8.9) | ||||
| ||||||||
Colonoscopy | 95156 (5.3) | 74199 (4.1) | 125372 (7.0) | 417049 (23.1) | ||||
| ||||||||
Medications | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior |
| ||||||||
Pancreatic Enzyme | 1246 (0.07) | 1105 (0.06) | 1314 (0.07) | 2437 (0.1) | ||||
| ||||||||
Antidiabetic Medications – Insulin | 88277 (4.9) | 79163 (4.4) | 78660 (4.4) | 78323 (4.4) | ||||
| ||||||||
Antidiabetic Medications – Non-Insulin | 158667 (8.8) | 148703 (8.3) | 155764 (8.6) | 182396 (10.1) | ||||
| ||||||||
GI-Related Signs/Symptoms | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior | 0–6 m Prior | 7–12 m Prior | 13–24 m Prior | 24+ m Prior |
| ||||||||
Abdominal Pain | 133380 (7.4) | 92422 (5.1) | 142470 (7.9) | 483652 (26.8) | 107586 (4.0) | 68717 (2.6) | 108170 (4.0) | 237763 (8.8) |
| ||||||||
Chest Pain | 114152 (6.3) | 79399 (4.4) | 128549 (7.1) | 483953 (26.7) | ||||
| ||||||||
Constipation | 60032 (3.3) | 41935 (2.3) | 64439 (3.6) | 193331 (10.7) | ||||
| ||||||||
Diarrhea | 44025 (2.4) | 32187 (1.8) | 52487 (2.9) | 202067 (11.2) | ||||
| ||||||||
Itching | 46314 (2.6) | 33730 (1.9) | 57062 (3.2) | 207679 (11.5) | ||||
| ||||||||
Malaise or Fatigue | 104298 (5.8) | 71136 (3.9) | 111646 (6.2) | 362901 (20.1) | ||||
| ||||||||
Melena | 11089 (0.6) | 7433 (0.4) | 13235 (0.7) | 66321 (3.7) | ||||
| ||||||||
Nausea or Vomiting | 51787 (2.9) | 36348 (2.0) | 57767 (3.2) | 215982 (12.0) |
Abbreviations: ALP, alkaline phosphatase; ALT, alanine transaminase; BMI, body mass index; CT, computerized tomography; GI, gastrointestinal; HCT, hematocrit; HbA1c: glycated hemoglobin; HGB, hemoglobin; IQR, interquartile range; MRI, magnetic resonance imaging; RBC, red blood cell; SD, standard deviation.
Table 1b.
Kaiser Permanente Southern California (KPSC) | Veterans Affairs (VA) | |||
---|---|---|---|---|
N | Median (IQR) | N | Median (IQR) | |
Weight Change in lb.a | 1488802 | −0.2 (−4.8, 3.5) | 2597505 | −0.3 (−5.8, 4.8) |
ALT, IU/L | 479881 | 0.0 (−5.0, 4.0) | 650675 | 0 (−6, 4) |
HgA1c, % | 409975 | 0.0 (−0.3, 0.3) | 898378 | 0.0 (−0.3, 0.3) |
ALP, IU/L | 96562 | 0.0 (−9.0, 10.0) | ||
Total Bilirubin, mg/dL | 94098 | 0.0 (−0.2, 0.2) | ||
HGB for Males, g/dL | 202538 | −0.1 (−0.8, 0.5) | ||
HGB for Females, g/dL | 264229 | −0.1 (−0.6, 0.5) | ||
HCT, L/L | 466785 | −0.3 (−2.1, 1.5) | ||
RBC, million/mm3 | 449246 | 0.0 (−0.2, 0.2) | ||
Sodium, mEq/L | 496334 | 0.0 (−2.0, 2.0) | ||
Total Cholesterol, mg/dL | 506608 | −4.0 (−22.0, 13.0) | ||
Platelets, count/L | 449030 | −3.0 (−25.0, 19.0) |
1 lb = 0.45 kg
Sample size
The number and size of the KPSC training, KPSC internal validation and VA external testing datasets are shown in eTable 4.
Missing data
Missing values were imputed14 if the frequency of missing was <60%. We used predictive mean matching method15 with k=5. Laboratory measures with ≥60% missingness or change/change rate measures with ≥80% missingness were not included in the model development process. Nine imputed datasets were generated at KPSC and 10 were created at VA (eFigure 1).
Statistical Analysis
Dataset preparation:
First, we split each of the nine imputation datasets at KPSC into five subsets, each containing 20% of the original imputed dataset (eFigure 1). A total of 45 subsets were prepared for the training and internal validation process described below (eTable 5).
Modeling approach:
To overcome the limitations of regression-based models that are traditionally used for analysis of time-to-event data, we applied ‘random survival forests’ (RSF), a nonparametric machine learning method,16–18 to pre-select features and train/validate models. First, we iteratively preselected features based on the average minimum depth (eTable 6) and used those features to develop and validate risk prediction models based on 5-fold cross validation.19 Age was forced into the model. Preselected features were added incrementally to identify the feature that yielded the maximum improvement of c-index. This process continued until the c-index increased <0.005. Of the 45 models derived from the 45 training/internal validation datasets (9 imputation datasets × 5-fold cross validation), the one that appeared the most often was selected as the final model.
Internal validation and external testing:
Algorithms of the final models were applied to the corresponding KPSC validation datasets. By design, the KPSC validation datasets did not include any observations of the KPSC training datasets from which the final model was developed. The final model was first directly applied to VA imputed testing datasets, and subsequently recalibrated to achieve better performance.
Performance measures:
The discriminative power for each of the final models (main model and early detection model) was evaluated by c-index, a concordance measure, averaged across all the relevant validation or testing datasets for cohort members. Calibration was assessed by calibration plots with five risk groups (<50th, 50–74th, 75–89th, 90–94th, 95–100th percentiles).20 Greenwood-Nam-D’Agostino (GND) calibration test was also performed to assess goodness-of-fit.
We estimated sensitivity, specificity, positive predictive value (PPV), and relative increase in risk in comparison to that of the entire cohort at various levels of risk thresholds. For this analysis we restricted the patients to those with complete follow up or developed PDAC in 18 months. The results were averaged across the validation datasets for each model. We also estimated the theoretical number (and the 95% confidence interval) needed to be evaluated to identify a single case of PDAC for the main and early detection models based on KPSC internal validation datasets and VA external testing datasets assuming 100% ability to identify pancreatic cancer through diagnostic testing.
Sensitivity and post-hoc analyses:
To understand the impact of missing data imputation on model performance, we conducted a sensitivity analysis in which only observations with known predictors were included in the validation process based on one of the internal validation datasets (DS3D, eFigure 1). C-index and calibration plot are reported. To understand whether traditional risk factors of pancreatic cancer can further improve the performance, we performed a post-hoc analysis by forcing family history of pancreatic cancer, smoking status, and BMI into the main model with the final predictors. The process was repeated on all the KPSC training and validation datasets used to generate the final model. The average c-index was reported for the model with and without family history of pancreatic cancer, smoking status, and BMI.
Exploratory analysis:
To understand whether our models weighted towards detection of advanced cancer, we examined the average predicted risks of PDAC in 18 months in patients who developed PDAC stratified by cancer stage at the time of diagnosis.
All statistical analyses were performed using SAS (Version 9.4 for Unix; SAS Institute, Cary, NC) or R Version 3.6.0 (R Foundation, Vienna, Austria).
RESULTS
Characteristics of the study participants
1.8 million KPSC patients were eligible (Figure 1), of which 53.3% were women, 45.3% were white, 29.8% were Hispanic, 9.5% were African American and 10.7% were Asian and Pacific Islanders (Table 1a). The majority (60.7%) used commercial insurance, and slightly under one-third (31.8%) were on Medicare. On average, the KPSC patients were 61.6 (SD 9.4) years of age, with average membership length of 18.9 (SD 13.5) years. 35.7% of the patients were obese and additional 35.6% were overweight. In the 6 months prior to the index date, 20.9% had diagnosis of diabetes, 10.5% depression, 1.5% biliary tract disease, 1.1% gallstone disorders, 0.2% acute pancreatitis and 0.1% chronic pancreatitis (Table 1a).
The 2.7 million eligible veterans were predominantly men (94.3%), white (67.9%) and African American (16.8%) and were older (66.1 years of age) than the KPSC cohort (Table 1a). Smoking, diabetes, acute and chronic pancreatitis were more prevalent in the VA cohort compared to those of the KPSC cohort. ALT and HbA1c at baseline appeared comparable between the two cohorts (Table 1a). The changes in weight and laboratory parameters during the year prior to t0 can be found in Table 1b.
Incidence of PDAC
Table 2 displays the follow-up time in years, number, incidence rate (IR) of PDAC, and time to PDAC for all patients and for subgroups of patients defined by age, sex, race/ethnicity, weight change, abdominal pain and parameters related to ALT and HbA1c. 1,792 KPSC patients developed PDAC within 18 months of follow-up (IR=0.77, 95% CI 0.73–0.80/1,000 person-years (PY) (Table 2). A total of 4,582 patients in the VA cohort developed PDAC (IR 1.27 (1.23–1.30)). In the VA cohort, abdominal pain in the 6 months prior to t0 increased the IR to 4.35 (4.01–4.70). The observed time to PDAC was longer for the VA cohort (median 233 days, IQR 116–370 days) compared to that of the KPSC cohort (median 205 days, IQR 91–358 days). The distributions of cancer stage (I-IV) were comparable between the KPSC (stages I/II 31.7%, stages III/IV 53.6%, unknown/missing 14.7%) and VA (stages I/II 19.1%, stages III/IV 32.6%, unknown/missing 48.4%) cohorts (eTable 7), although the VA cohort included higher frequency of records with unknown/missing cancer stage.
Table 2.
Kaiser Permanente Southern California (KPSC) | Veterans Affairs (VA) | |||||||
---|---|---|---|---|---|---|---|---|
| ||||||||
Total f/u Time (years) | No. of PDAC | Incidence Rate of PDAC/ 1000 PY (95% CI) | Days to PDAC (median, IQR) | Total f/u Time (years) | No. of PDAC | Incidence Rate of PDAC/ 1000 PY (95% CI) | Days to PDAC (median, IQR) | |
| ||||||||
All | 2331767 | 1792 | 0.77 (0.73, 0.80) | 205 (91, 358) | 3614215 | 4582 | 1.27 (1.23, 1.30) | 233 (116, 370) |
Age Group in years | ||||||||
50–59 | 1133128 | 350 | 0.31 (0.28, 0.34) | 198 (84, 357) | 989814 | 679 | 0.69 (0.64, 0.74) | 212 (98, 352) |
60–69 | 711410 | 624 | 0.88 (0.81, 0.95) | 197 (91, 355) | 1630532 | 2026 | 1.24 (1.19, 1.30) | 239 (122, 374) |
70–79 | 369643 | 604 | 1.63 (1.51, 1.77) | 219 (91, 362) | 677576 | 1304 | 1.92 (1.82, 2.03) | 237 (113, 369) |
80–84 | 117585 | 214 | 1.82 (1.59, 2.08) | 213 (112, 353) | 316293 | 573 | 1.81 (1.67, 1.96) | 235 (116, 367) |
Sex | ||||||||
Female | 1246235 | 864 | 0.69 (0.65, 0.74) | 219 (97, 370) | 213181 | 101 | 0.47 (0.39, 0.57) | 248 (104, 398) |
Male | 1085520 | 928 | 0.85 (0.80, 0.91) | 220 (88, 350) | 3401034 | 4481 | 1.32 (1.28, 1.36) | 233 (116, 369) |
Race/ Ethnicity | ||||||||
Non-Hispanic White | 1065194 | 934 | 0.88 (0.82, 0.93) | 206 (90, 362) | 2456059 | 2986 | 1.22 (1.17, 1.26) | 238 (119, 370) |
Non-Hispanic Black | 226546 | 262 | 1.16 (1.02, 1.30) | 208 (100, 356) | 612641 | 692 | 1.13 (1.05, 1.22) | 220.5 (100.5, 377) |
Asian/Pacific Islander | 255168 | 147 | 0.58 (0.49, 0.67) | 216 (114, 367) | 52570 | 48 | 0.91 (0.68, 1.20) | 192 (118.5, 373.5) |
Hispanic | 692443 | 410 | 0.59 (0.53, 0.65) | 201 (87, 345) | 202263 | 211 | 1.04 (0.91, 1.19) | 239 (121, 369) |
Unknowna | 290682 | 645 | 2.22 (2.05, 2.40) | 220 (115, 357) | ||||
ALT change in 1 year in IU/L | ||||||||
≤ −5 | 156757 | 174 | 1.11 (0.95, 1.28) | 216 (92, 360) | 248360 | 359 | 1.45 (1.30, 1.60) | 247 (123, 367) |
(−5, 5] | 340273 | 330 | 0.97 (0.87, 1.08) | 232 (107, 392) | 434050 | 523 | 1.20 (1.10, 1.31) | 246.5 (132.5, 399)5) |
> 5 | 123457 | 165 | 1.34 (1.14, 1.55) | 149 (63, 303) | 187824 | 322 | 1.71 (1.53, 1.91) | 184 (86, 302) |
Unknown | 1711281 | 1123 | 0.66 (0.61, 0.70) | 203 (94, 355) | 2743981 | 3378 | 1.23 (1.19, 1.27) | 234 (116, 373) |
Rate of ALT Change in 1 Year | ||||||||
≤ −0.01 | 180995 | 205 | 1.13 (0.99, 1.30) | 210 (91, 358) | 281222 | 402 | 1.43 (1.29, 1.57) | 247 (126, 373) |
(−0.01, 0.01] | 272305 | 255 | 0.94 (0.83, 1.06) | 234 (114, 384) | 343020 | 420 | 1.22 (1.11, 1.35) | 244.5 (132.5, 391.5) |
> 0.01 | 167186 | 209 | 1.25 (1.09, 1.43) | 166 (65, 330) | 245991 | 382 | 1.55 (1.40, 1.71) | 191 (86, 310) |
Unknown | 1711281 | 1123 | 0.66 (0.62, 0.70) | 203 (94, 355) | 2743981 | 3378 | 1.23 (1.19, 1.27) | 234 (116, 373) |
HgA1c Value Prior to Index date | ||||||||
<6.5% | 586772 | 326 | 0.55 (0.50, 0.62) | 220 (90, 369) | 1132107 | 1094 | 0.97 (0.91, 1.02) | 241 (118, 373) |
6.5–6.9% | 12424980113 | 139 | 1.12 (0.94, 1.32) | 190 (74, 360) | 228430 | 366 | 1.60 (1.44, 1.77) | 234 (118, 367) |
7.0–7.4% | 80113 | 122 | 1.52 (1.27, 1.81) | 159 (88, 310) | 157876 | 289 | 1.83 (1.63, 2.05) | 221 (112, 366) |
≥7.5% | 179700 | 305 | 1.70 (1.51, 1.89) | 200 (90, 341) | 399057 | 969 | 2.43 (2.28, 2.58) | 216 (102, 354) |
Unknown | 1360933 | 900 | 0.66 (0.62, 0.71) | 212 (96, 365) | 1696745 | 1864 | 1.10 (1.05, 1.15) | 239 (121, 378) |
HgA1c Change in 1 Year in % | ||||||||
< −0.3 | 118111 | 143 | 1.21 (1.02, 1.42) | 185 (88, 343) | 287080 | 479 | 1.67 (1.52, 1.82) | 227 (115.5, 371) |
[−0.3, 0.3] | 298587 | 204 | 0.68 (0.59, 0.78) | 238 (100, 372) | 625990 | 637 | 1.02 (0.94, 1.10) | 251 (130, 388) |
> 0.3 | 1201151794953 | 211 | 1.76 (1.53, 2.01) | 190 (90, 329) | 291,676 | 660 | 2.26 (2.10, 2.44) | 220 (103, 357) |
Unknown | 1794953 | 1234 | 0.69 (0.65, 0.73) | 205 (90, 358) | 2409469 | 2806 | 1.16 (1.12, 1.21) | 233 (117, 370) |
Rate of HgA1c Change in 1 Year | ||||||||
< −0.0008 | 126922 | 146 | 1.15 (0.97, 1.35) | 184 (88, 343) | 322572 | 519 | 1.61 (1.47, 1.75) | 227 (115,371) |
[−0.0008, 0.0008] | 280155 | 181 | 0.64 (0.56, 0.75) | 224 (98, 371) | 554791 | 535 | 0.96 (0.89, 1.05) | 249 (129, 378) |
> 0.0008 | 129736 | 231 | 1.78 (1.56, 2.02) | 200 (92, 343) | 327480 | 722 | 2.21 (2.05, 2.37) | 225.5 (105, 359) |
Unknown | 1794953 | 1234 | 0.69 (0.65, 0.73) | 205 (90, 358) | 2409469 | 2806 | 1.16 (1.12, 1.21) | 233 (117, 370) |
Rate of Weight Change in 1 Year | ||||||||
< −0.02 | 503012 | 683 | 1.35 (1.26, 1.46) | 160 (68, 322) | 869902 | 2049 | 2.36 (2.26, 2.46) | 185 (86, 313) |
[−0.02, 0.02] | 1033671 | 703 | 0.68 (0.63, 0.73) | 238 (124, 387) | 1437632 | 1403 | 0.98 (0.93, 1.03) | 286 (156, 409) |
> 0.02 | 393142 | 209 | 0.53 (0.46, 0.61) | 238 (114, 358) | 750864 | 621 | 0.83 (0.76, 0.89) | 281 (145, 408) |
Unknown | 401941 | 197 | 0.49 (0.43, 0.56) | 202 (90, 364) | 556817 | 509 | 0.91 (0.84, 0.996) | 244 (133, 387) |
Abdominal Pain in Prior 6 mosb | ||||||||
Yes | 168579 | 322 | 1.91 (1.71, 2.13) | 116 (54, 256) | 139459 | 606 | 4.35 (4.01, 4.70) | 136 (67, 252) |
No | 134525 | 139 | 1.03 (0.87, 1.22) | 199 (110, 349) | 3474756 | 3976 | 1.14 (1.11, 1.18) | 250.5 (128, 382.5) |
Unknown | 2028662 | 1331 | 0.66 (0.62, 0.69) | 230 (110, 377) |
Abbreviations: ALT, alanine transaminase; CI, confidence interval; F/U, follow-up; HbA1c, glycated hemoglobin A1c; IQR, interquartile range; PDAC, pancreatic ductal adenocarcinoma; PY, person-years.
Not estimated for KPSC dataset due to the small number of events (39) in this group.
For VA data, “Unknown” was interpreted as “No” due to the inability to distinguish between the two.
Model Development, Internal Validation and External Testing
For the main cohort, the preselection process identified 29 potential predictors (eTable 6). The final main model (M1) contained age, abdominal pain, weight change, HbA1c and alanine transaminase (ALT) change (c-index: mean 0.77 and SD 0.02; calibration test: χ2 mean 5.9 and SD 4.1, p-value mean 0.4 and SD 0.3). When M1 was directly applied to VA testing datasets, the mean c-index was 0.69 (SD 0.003) (data not shown); however, after the algorithm was recalibrated based on VA’s datasets, the mean c-index based on 10 testing datasets was 0.71 (SD 0.002).
For the early detection cohort, the preselection process identified 32 potential predictors (eTable 6). The final early detection model (E1) comprised the same features as those selected by M1 except that abdominal patient was not chosen (c-index: mean 0.77 and SD 0.4; calibration test: χ2 mean 7.3 and SD 4.9, p-value 0.3 and SD 0.3). The recalibrated E1 model based on the VA testing datasets achieved a mean c-index of 0.68 (SD 0.003). e hyperparameters used and the features selected by ≥5 out of 45 models for each cohort can be found in eTable 8 and eTable 9, respectively.
Figure 2 displays the calibration plots for both models at KPSC and VA. It appears that all fit well for the four out of five lower risk groups (i.e., risk<95th percentile). However, for the highest risk group (risk ≥95th percentile), M1 properly estimated the risks at KPSC, while E1 slightly overestimated the risks for KPSC patients, and both M1, E1 slightly underestimated the risks for VA patients.
Sensitivity, specificity, PPV and fold increase in risk are reported in eTable 10 for the final main model (M1) and the final early detection model (E1) for both KPSC and VA patients. The top 2.5% of the KPSC internal validation sample (n=6,249) based on M1 experienced 1.1% risk of PDAC over 18 months, which was 7- to 8- fold higher than the baseline risk of PDAC in the KPSC cohort. 349 (19.5% of the 1,792 PDAC cases occurring within 18 months) were identified in this top 2.5% model-predicted high-risk group. Patients within the top 20% predicted risk of PDAC (n=50,952) experienced 0.4% risk of PDAC over 18 months. This risk threshold identified 1,014 (or 56.6% of) PDAC cases occurring within 18 months with a specificity of 79.6%. While sensitivities and fold increases in PDAC incidence rate were lower in the VA population, PPVs were higher for VA testing data compared to those of KPSC (eTable 10).
Compared to the models developed based on the main cohort (M1), the models developed based on early detection cohort (E1) had slightly reduced sensitivity, PPV and fold increase in risk.
The number of patients needed to undergo evaluation to detect a single case of PDAC is shown in eTable 11. In theory, if we screen patients at the top 2.5% model predicted risk identified based on model M1, we need to evaluate 90 patients (95% CI 73–117) at KPSC and 61 patients (95% CI 57–65) at VA to detect a single case of PDAC (eTable 11). If Model E1 were used to determine the screening population threshold of 2.5% predicted risk, more patients would need to be evaluated (164 (125–241) at KPSC and 91 (84–99) at VA).
Sensitivity and post-hoc analyses:
When we applied model M1 to the subset of one of the KPSC internal validation dataset with complete values of predictors (no imputation), the c-index reached 0.77. The corresponding c-index was 0.76 when all the observations of the same validation dataset were applied. The calibration plot of the complete case analysis is shown in eFigure 2. As expected, the average predicted and observed risks of PDAC in patients with complete predictors were higher compared to those of all patients.
When the traditional risk factors of pancreatic cancer (family history of pancreatic cancer, smoking status, BMI) were forced into model M1, the average c-index was 0.76 (SD 0.003), compared to 0.77 (SD 0.02) for the model without the three traditional risk factors. Adding the traditional risk factors did not improve performance of M1.
Exploratory analysis:
The average predicted risks of PDAC were 0.54% and 0.25% at the time of risk assessment in patients who were later diagnosed with stages I&II and III&IV cancer, respectively. This finding indicates that our model has the ability to detect both early stage and advanced cancer.
For demonstration, decision rules based on one of the trees built for M1 is displayed in Figure 3 (for right side of the decision tree) and eFigures 3 (for left side of the decision tree). For example, for patients>68 years of age, had abdominal pain and change of ALT>50, the observed and the predicted risks of PDAC in 18 months were 4.77% and 5.05%, respectively (Figure 3).
Risk prediction tool
To facilitate external application of the RSF-based prediction model (M1), we have developed a publicly available web-based tool (https://pcrisk.kp-scalresearch.org/). A hypothetical 70-year-old male patient with HbA1c value of 7.5%, weight loss of 4 lbs. and ALT increase of 4 IU/L in one year has an estimated 18-month risk of PDAC 0.30%. The sample codes to training, update and validate the models were posted on Github (https://github.com/kpsc-informatics/PROTECT-RSF) and are also available in eTable 12.
DISCUSSION
We applied machine learning methods to EHR data to derive and validate clinical prediction models for pancreatic cancer across two large integrated healthcare systems. Despite inclusion of >500 potential features in the candidate pool, the machine learning models incorporated traditional parameters including age, glycated hemoglobin (HbA1c), alanine aminotransferase, weight, and abdominal pain. The final models were both parsimonious (with only 4–5 predictors) and reasonably accurate in both internal validation and external testing.
While there has been progress in developing approaches to early detection in high-risk patients based on either family history or genetic susceptibility21 as well as those with specific conditions such as late-onset diabetes,6 limited data exist on identification of patients at risk for sporadic forms of pancreatic cancer. This study presents a novel approach to risk stratification at the population-level based on dynamic parameters contained within structured data from EHR.
Although parameters included in the model are well-established parameters for PDAC,5,6,10 their selection using an unbiased, comprehensive data-driven approach helped ensure inclusion of the most relevant combination of parameters. A recently developed model to predict risk of pancreatic cancer among patients with late or new onset diabetes at age 50 or later similarly identified increasing age, weight loss and change in blood glucose as key parameters for determining risk of pancreatic cancer in this patient population.5 The current model extends the concept of model-based risk prediction to a much broader population while maintaining reasonably high levels of discriminative accuracy for prediction of pancreatic cancer.
External testing is key to assessing model performance. In a review of 127 prediction models, Siontis et al. found that only 32 (25%) had at least one external testing.22 Compared to the original studies, AUC estimates derived from the corresponding external testing studies significantly decreased by 0.05 (p<0.001).22 In the current study, the c-index declined by 0.08 (or 10%) when model M1 was directly transported and about 0.06 (or 8%) and 0.09 (or 12%) after models M1 and E1 were recalibrated, respectively. Although the comparison should be interpreted cautiously, the larger reduction observed in the current study could be attributable to multiple factors. First, a higher frequency of PDAC cases in the VA cohort were pancreatic cancer deaths identified through mortality records compared to the KPSC cohort. Second, given the differences in age and sex between KPSC and VA populations, a higher incidence rate of PDAC in the VA dataset compared to that of KP’s was observed as expected. This could have impacted model accuracy especially for the models without recalibration.
Strengths of the current study included a comprehensive, data-driven approach to model development in a racially/ethnically diverse patient population, use of structured data elements and external validation in a separate healthcare system with distinct patient population. The present study had several limitations. First, several parameters identified in the prediction models (abdominal pain, abnormal ALT) are often associated with advanced stage pancreatic cancer. The 30-day cancer-free period used in the present model is likely insufficient to provide a reasonable window of opportunity for intervention to impact the disease course. To address this concern, we also developed an early detection model that restricted the study population to patients with ≥90 days cancer-free follow-up from the index date. Despite the reasonably high performance in terms of discriminative ability, the absolute risk in the highest risk category (top 2.5%) approached 1% over 18-months. This level of risk is likely below the threshold for cost-effective screening based on currently available testing.23 Second, of the 1479 and 4,582 events identified in the KPSC and VA cohorts, respectively, 300 and 2,564 events were captured by data sources other than Cancer Registry. An evaluation based on the KPSC Cancer Registry of the same time window showed that about 90% of pancreatic cancer cases were PDAC. Third, to estimate sensitivity, specificity, PPV and fold of risk increase, we relied on a subset of patients (~70% and ~80% of the total patients in the KPSC and VA cohorts, respectively) with complete follow up unless they died of pancreatic cancer. This restriction over-estimated the risk of PDAC, because the patients who were excluded from this analysis were at-risk for some periods of time. Finally, some important predictors had high percentages of unknown values (i.e., missing data). Although multiple imputation was performed, bias may occur if the missing at random (MAR) assumption is violated. Nevertheless, our sensitivity analysis showed that model performance remined the same when the validation was limited to records with complete predictors.
In conclusion, we developed a parsimonious clinical risk prediction model for sporadic pancreatic cancer in a large, diverse integrated health system and subsequently applied the model in a separate health system. The model identified five key factors in determining risk of pancreatic cancer. Findings from the present study provide a potential framework for a systematic approach to targeted screening for pancreatic cancer based on automated analysis of data in EHR.
Supplementary Material
Study Highlights.
What Is Known
Patients with pancreatic cancer are often diagnosed at late stages.
Early detection is needed to impact the natural history of disease progression and improve patient survival.
What Is New Here
Machine-learning was used to develop a population-based model for early detection of pancreatic cancer. The model was internally and externally tested in cohorts of 1.8 million and 2.6 million individuals, respectively.
The final model included age, abdominal pain, weight change, glycated hemoglobin (HbA1c) and alanine transaminase (ALT) change. A publicly accessible web-based calculator is available online (https://pcrisk.kp-scalresearch.org/).
• Financial support:
Research reported in this publication was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA230442. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
Abbreviations
- ALT
Alanine transaminase
- AUC
area under the curve
- CI
confidence interval
- DoD
Department of Defense
- EHR
electronic health record
- GND
Greenwood-Nam-D’Agostino
- HbA1C
glycated hemoglobin
- ICD-9-CM
Ninth Revision of International Classification of Diseases, Clinical Modification
- ICD-10-CM
Tenth Revision of International Classification of Diseases, Clinical Modification
- IR
Incidence rate
- KPSC
Kaiser Permanente Southern California
- NDI
National Death Index
- NOD
new onset diabetes
- PC
pancreatic cancer
- PDAC
pancreatic ductal adenocarcinoma
- PPV
positive predictive value
- RSF
Random Survival Forest
- SEER
Surveillance, Epidemiology, and End Results
- VA
Veterans Affairs
Footnotes
• Potential competing interests: The authors declare they have no conflict of interest for this study.
• Guarantor of the article: Dr. Wansu Chen accepts full responsibility for the conduct of the study. She has had access to the data and has control of the decision of the decision to publish.
References
- 1.National Cancer Institute Surveillance, Epidemiology and End Results Program. Cancer Stat Factors: Pancreatic Cancer. https://seer.cancer.gov/statfacts/html/pancreas.html. Last accessed: July 26, 2022.
- 2.Stathis A, Moore MJ. Advanced pancreatic carcinoma: current treatment and future challenges. Nat Rev Clin Oncol. 2010;7(3):163–172. [DOI] [PubMed] [Google Scholar]
- 3.Stokes JB, Nolan NJ, Stelow EB, et al. Preoperative capecitabine and concurrent radiation for borderline resectable pancreatic cancer. Ann Surg Oncol. 2011;18(3):619–627. [DOI] [PubMed] [Google Scholar]
- 4.US Preventive Services Task Force, Owens DK, Davidson KW, et al. Screening for Pancreatic Cancer: US Preventive Services Task Force Reaffirmation Recommendation Statement. JAMA. 2019;322(5):438–444. [DOI] [PubMed] [Google Scholar]
- 5.Sharma A, Kandlakunta H, Nagpal SJS, et al. Model to Determine Risk of Pancreatic Cancer in Patients With New-Onset Diabetes. Gastroenterology. 2018;155(3):730–739 e733. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Boursi B, Finkelman B, Giantonio BJ, et al. A Clinical Prediction Model to Assess Risk for Pancreatic Cancer Among Patients With New-Onset Diabetes. Gastroenterology. 2017;152(4):840–850 e843. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Chen W, Butler RK, Lustigova E, Chari ST, Wu BU. Validation of the Enriching New-Onset Diabetes for Pancreatic Cancer Model in a Diverse and Integrated Healthcare Setting. Dig Dis Sci. 2021;66(1):78–87. [DOI] [PubMed] [Google Scholar]
- 8.Kim J, Yuan C, Babic A, et al. Genetic and Circulating Biomarker Data Improve Risk Prediction for Pancreatic Cancer in the General Population. Cancer Epidemiol Biomarkers Prev. 2020;29(5):999–1008. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Klein AP, Lindstrom S, Mendelsohn JB, et al. An absolute risk model to identify individuals at elevated risk for pancreatic cancer in the general population. PLoS One. 2013;8(9):e72311. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Yu A, Woo SM, Joo J, et al. Development and Validation of a Prediction Model to Estimate Individual Risk of Pancreatic Cancer. PLoS One. 2016;11(1):e0146473. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Koebnick C, Langer-Gould AM, Gould MK, et al. Sociodemographic characteristics of members of a large, integrated health care system: comparison with US Census Bureau data. Perm J. 2012;16(3):37–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Fihn SD, Francis J, Clancy C, et al. Insights from advanced analytics at the Veterans Health Administration. Health Aff (Millwood). 2014;33(7):1203–1211. [DOI] [PubMed] [Google Scholar]
- 13.Chen W, Yao J, Liang Z, et al. Temporal Trends in Mortality Rates among Kaiser Permanente Southern California Health Plan Enrollees, 2001–2016. Perm J. 2019;23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Wright MN, Ziegler A. ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R. J Stat Softw. 2017;77(1). [Google Scholar]
- 15.Little RJA. Missing-Data Adjustments in Large Surveys. J Bus Econ Stat. 1988;6(3):287–296. [Google Scholar]
- 16.Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008(3):841–860. [Google Scholar]
- 17.Dietrich S, Floegel A, Troll M, et al. Random Survival Forest in practice: a method for modelling complex metabolomics data in time to event analysis. Int J Epidemiol. 2016;45(5):1406–1420. [DOI] [PubMed] [Google Scholar]
- 18.Ishwaran H KU. randomForestSRC: Fast Unified Random Forests for Survival, Regression, and Classification (RF-SRC). http://web.ccs.miami.edu/~hishwaran/. Accessed July 8, 2021.
- 19.Stone M Cross-Validatory Choice and Assessment of Statistical Predictions. J R Stat Soc Series B Stat Methodol. 1974;36(2):111–147. [Google Scholar]
- 20.Demler OV, Paynter NP, Cook NR. Tests of calibration and goodness-of-fit in the survival setting. Stat Med. 2015;34(10):1659–1680. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Goggins M, Overbeek KA, Brand R, et al. Management of patients with increased risk for familial pancreatic cancer: updated recommendations from the International Cancer of the Pancreas Screening (CAPS) Consortium. Gut. 2020;69(1):7–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Siontis GC, Tzoulaki I, Castaldi PJ, Ioannidis JP. External validation of new risk prediction models is infrequent and reveals worse prognostic discrimination. J Clin Epidemiol. 2015;68(1):25–34. [DOI] [PubMed] [Google Scholar]
- 23.Schwartz NRM, Matrisian LM, Shrader EE, Feng Z, Chari S, Roth JA. Potential Cost-Effectiveness of Risk-Based Pancreatic Cancer Screening in Patients With New-Onset Diabetes. J Natl Compr Canc Netw. 2021:1–9. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.