Abstract
The multi-modal and unstructured nature of observational data in Electronic Health Records (EHR) is currently a significant obstacle for the application of machine learning towards risk stratification. In this study, we develop a deep learning framework for incorporating longitudinal clinical data from EHR to infer risk for pancreatic cancer (PC). This framework includes a novel training protocol, which enforces an emphasis on early detection by applying an independent Poisson-random mask on proximal-time measurements for each variable. Data fusion for irregular multivariate time-series features is enabled by a “grouped” neural network (GrpNN) architecture, which uses representation learning to generate a dimensionally reduced vector for each measurement set before making a final prediction. These models were evaluated using EHR data from Columbia University Irving Medical Center-New York Presbyterian Hospital. Our framework demonstrated better performance on early detection (AUROC 0.671, CI 95% 0.667 – 0.675, p < 0.001) at 12 months prior to diagnosis compared to a logistic regression, xgboost, and a feedforward neural network baseline. We demonstrate that our masking strategy results greater improvements at distal times prior to diagnosis, and that our GrpNN model improves generalizability by reducing overfitting relative to the feedforward baseline. The results were consistent across reported race. Our proposed algorithm is potentially generalizable to other diseases including but not limited to cancer where early detection can improve survival.
Keywords: Electronic Health Records, Pancreatic cancer, Early detection of cancer, Machine learning
1. Introduction
Pancreatic cancer (PC) will be the second most common cause of cancer-related death in the United States (US) by 2030 [1]. The majority of pancreatic cancers (93%) are pancreatic ductal adenocarcinoma (PDAC) and in 2021 it is estimated that there will be 60,430 new cases and 48,220 deaths. Compared to the incidence rates of other major US cancer types (prostate, breast, lung and bronchus, colorectal) from 1975 to 2016, PC is the only malignancy with a constant and steady rise in incidence and mortality, and most cases present in late stages with regional spread (29%) and distant metastasis (52%) [2]. This is largely because there is no effective approach for accurately identifying individuals who are at high risk for developing PC. Despite the potential benefits of early detection with advances in effective screening modalities, screening for PC is currently limited to a small subset of the population with known familial or genetic risk [3–5] which account for only 10% of patients. However, a much larger pool of individuals who are at elevated risk due to other potential risk factors such as increasing age, obesity, tobacco use, and race/ethnicity [6–10] may benefit from screening and early detection interventions. Thus, improving our ability to reliably detect individuals at high-risk for PC with nonhereditary risk factors is crucial to identifying the patients who will benefit from screening for early detection.
Electronic Health Records (EHR) contain the most comprehensive and detailed information regarding the clinical trajectories of patients, including laboratory test results, demographic information, and providers’ notes. Despite it being a rich source of data for exploring potential risk factors associated with the development of PC, its unstructured and multi-modal nature present a major obstacle for applying standard deep learning-based prediction frameworks. In this study, we used EHR data from Columbia University Irving Medical Center-New York Presbyterian Hospital (CUIMC-NYP) Our aim was to develop a novel deep learning framework for incorporating longitudinal clinical data from EHR to enhance risk prediction for PC. The central challenges associated with the development of these strategies are [11,12]:
Mitigation of confounding biases that arise from the non-randomized selection of control groups. In addition to biases stemming from baseline characteristics such as demographics and risk factors, it is well known that factors related to family history and pre-existing conditions such as chronic pancreatitis can confer a relatively high risk ratio for development of PC [13,14]. Discrepancies in the joint distributions of these observables can thus introduce significant confounding effects on clinical events that affect both the measured values of laboratory tests as well as the patterns of missingness for these measurements (e.g., missing baseline treatment, missing clinical information, missing laboratory measurements).
Robust strategies for feature selection and data fusion that specifically address the complex structure of laboratory test data, which consists of sparse and unevenly distributed clinical records over time arising from differences in clinical care for individual patients. In our analysis we found that the performance of black-box models is fairly strong when applied to a straightforward concatenation of the lab test data. However, the lack of interpretability of this approach undermines trust to a point that presents a prohibitive obstacle to its adoption.
The deep learning architecture considered here borrows ideas from representation learning and applies them to the task of data fusion over multiple irregular time-series measurements. The resulting “grouped” neural network (GrpNN) design enforces the distillation of the information from each class of measurements independently before merging them to make a final prediction and may be interpreted as a form of automated feature learning. This separation of variables allows the algorithm to maintain some level of human-interpretability as information flows through the network while producing measurable improvements in predictive performance. Furthermore, our data augmentation strategy of randomly varying the number of lab test measurements per-patient produces significant improvements in both predictive performance, particularly for early detection, as well as robustness against overfitting.
The rest of the paper has been organized as follows. Section 2 discusses related work; Section 3 details our study design and deep learning framework proposed in this paper in which the main components are GrpNN and Random masking strategy; Section 4 presents our key findings; Section 5 discusses significance of the study and findings, additional rationale on our selected group, and limitations of our study; Finally, we conclude our paper with implications of this research and suggestions for further investigation.
2. Related work
Recently, the application of deep learning techniques to the problem of cancer screening and risk stratification has seen a flurry of activity. However, thus far with respect to PC, studies in the literature have largely been limited to the analysis of well-structured data types such as genomic and image-based data. For example, a graph-based machine learning approach has been used in [15] to infer the structure of gene regulation networks distinctly associated with different progressive stages of PC. In [16–18], the authors leverage a combination of bio-markers and RNA-based variants obtained from peripheral blood samples, and use deep-learning based sequencing techniques to achieve strong separation on the problem of distinguishing PC from chronic pancreatitis. Image-based analyses in the literature include methods for performing PC classification using endosonographic images [17] as well as automated segmentation of pancreatic features from CT scans using techniques from computer vision. Similarly, for structured time-series variables such as EEG readings, data mining techniques have been demonstrated to be useful for knowledge discovery tasks [19].
Until recently, data-processing issues plaguing unstructured data types such as missing values, misalignment, high dimensionality, etc. have made them prohibitive to analysis by standard deep-learning pipelines. Traditional approaches have thus typically relied on dimensionally-reduced features [13,14], for example through the use of single type homogeneous features extracted from complex heterogeneous longitudinal data [20,21]. These simplifications of the input data typically require a significant amount of manual pre-processing and result in an irreducible loss of potentially predictive information. Furthermore, the lack of interpretability of standard “black-box” prediction algorithms undermines trust and consequently their utility for clinical applications. Recently, a number of AI-based modeling strategies based on representation learning have been developed to address the multi-modal complexity of EHR data [22]. For example, in [23] the authors use a combination of long short term memory and convolutional networks to develop high-performance models for cervical cancer prediction from semi-structured time series’ of smear images. However to our knowledge these more sophisticated approaches have not yet been applied towards the task of PC risk prediction.
Due to the richness of information contained in unstructured multi-modal biomedical data, the general problem of data fusion in this domain has recently become an area of active research, and a review of recent advances can be found in [24]. One of the most significant challenges in this area relates to the difficulty of constructing models that generalize to unseen data containing significant structural differences relative to the model training data, which may arise e.g., due to differences in reporting standards across different institutions. To address these issues the authors in [25] discuss a taxonomy for different levels of reproducibility and analyze the criteria that lead to errors at each level. In another related work [26], the authors develop an onto-logical approach for weighting the dependence of model predictions on high-level attributes with more semantic meaning, and demonstrate improvements in model generalizability on predictions related to student interactions with learning management systems. Another significant issue relates to the fidelity of information preserved under multi-modal fusion in the presence of imperfect or inaccurate sources, which has been reviewed in [27] under the context of information fusion.
3. Research approach
3.1. Study design
The study design includes the selection of lab variables, selection of negative controls via propensity score matching (Fig. 1 and Fig. 2) and configuring pre-diagnosis data to test our GrpNN deep learning algorithm for PC classification. These tests include an evaluation of the efficacy of our prediction model on early detection of PC via our random masking strategy (Fig. 3). Comparative analysis of our GrpNN model was performed against a logistic regression model (LR) and simple black-box models (i.e., GrpNN model without embedding layer, Fig. 3A).
Fig. 1.
Data preprocessing flow chart. We obtained 458,252 patient samples with 30,195 lab variables from CUIMC-NYP EHR data using our data retrieval criteria (Table 1) and processed them into the final dataset composed of patient cohort where PC patients who received pancreatitis diagnosis before PC and nonPC patients with pancreatitis were eliminated (Total 9,057 patients where 834 are PC patients).
Fig. 2.
The evaluation of data structure discrepancies between PC and nonPC. (A) Among the selected PC (1,200) and nonPC (161,849), we investigated data structure in terms of data completeness and average number of measurements for each variable and their discrepancies between PC and nonPC, which can introduce bias in the prediction model. We sorted the 418 lab variables by data completeness of PC group and the panel A shows data structure of top 50 lab variables. (B) We selected the first 33 (panel A gray dotted line) lab variables, where the measurement system changes from Cerner system to NYP. The discrepancies in data completeness between PC and nonPC decreased when we filtered patients with the patient group who have at least one of the selected 33 lab variables. We then assigned random diagnosis dates for nonPC patients and configured the nonPC dataset into pre-diagnosis based on the average percentage reduction calculated from the process of configuring PC dataset into pre-diagnosis data. This led the average number of measurements of nonPC group close to that of PC group as shown in panel B.
Fig. 3.
A grouped deep neural network (GrpNN) incorporated with a random masking strategy. (A) GrpNN, where time series measurements of thirty three lab variables individually pass through the embedding network, producing a new representation of each variable in 4-dimensions, which is then put together (i.e., merge) to pass through the prediction network for making a prediction. (B) The process of the random masking strategy applied to each variable, incorporated during GrpNN training. The random samples are generated every training epoch and applied to the batch data before fed into GrpNN model. (C) The resultant masked histograms for the exemplary lab variables (‘Creatinine’ and ‘Magnesium level’) show reduced discrepancies in the number of lab measurements between PC and nonPC.
3.1.1. Selecting lab variables
Lab variables containing missing values for more than 99% of patients were eliminated resulting in 6,392 unique variables, from which 418 of the most clinically relevant variables were identified by human experts based on common standards [28]. When selecting lab variables, we evaluated differences in average missingness and number of measurements of each lab variable between PC and nonPC patients (Fig. 2). We then sorted the 418 lab variables by data completeness of the PC group and selected the first 33 (Fig. 2A gray dotted line), where the measurement system changes from Cerner system to NYP.
3.1.2. Filtering negative controls from homogeneous sub-populations
The identified patients were dichotomized into two groups: (1) Risk factor (RF) group which has at least one of the four risk factors (i.e., smoking, obesity, diabetes, or chronic pancreatitis) documented and (2) nonRF group which does not have any of those risk factors documented. Our analysis focused on the RF group due to the potential for additional confounding effects stemming from the undocumented information in the nonRF group. Within the RF group we found that the sub-population of patients who received either an imaging diagnosis or a biopsy (red box in Fig. 1) contain significantly higher rates of measurements in their lab test results, which is consistent with the fact that imaging procedures and biopsies are typically associated with extended hospital stays and more frequent monitoring of lab test variables. We further constrained our dataset to the RF group who received either an imaging diagnosis or a biopsy to enable future prospective studies on these subpopulations.
3.1.3. Selecting negative controls with propensity score matching
In order to eliminate confounding biases in lab measurements due to baseline characteristics (e.g., race, ethnicity, sex, zip code, patient language, age, smoking, obesity, diabetes), our final negative control group was selected on the basis of matching the full joint probability distributions of these observables. This was done in a systematic way with propensity score matching using the Pymatch package for Python (version 3.9). We performed 100 iterations of fits to the logistic regression model, given the imbalance of the data (i.e., 124,387 nonPC vs 834 PC), and measured average accuracy (Fig. 4), stopping at an accuracy close to 50% (implying inseparability of the two populations in the data).
Fig. 4.
The propensity score matching. The baseline characteristics are potential confounders as they can be reflected in lab measurements. In the dataset composed of 126,655 nonPC and 835 PC (Fig. 1) before the propensity matching procedure, the separability resulting from the baseline characteristics (i.e., race, ethnicity, sex, zip code, patient language, age, smoking, obesity, diabetes) were 72.9%. By applying propensity score matching, we reduced separability of our final dataset to 54.6%.
3.1.4. Configuring pre-diagnosis data
We configured the PC dataset into pre-diagnosis data by eliminating the lab measurements obtained after their first PC diagnosis date. Based on the average percentage reduction of the total number of measurements in this process for each variable, we assigned random diagnosis dates for nonPC patients and configured the nonPC dataset into pre-diagnosis data in the same way.
3.2. Data collection and preparation
Radiology notes, pathology notes, and the International Classification of Diseases (ICD) codes were used to create the data retrieval criteria (Table 1). Our searches of the CUIMC-NYP EHR dating from 2004 to 2021 with over 1.1 M patient encounters resulted in a total of 458,252 patient samples including 30,195 individual lab variables, with 7,124 PC patients confirmed by ICD codes 157.* or C25.*. Pathologic diagnoses (i.e., histology codes) confirmed ~94% of our PC patients as having PDAC. In addition to the ICD code for PC, we included ICD codes for four well-known non-genetic risk factors associated with PC [13]: (1) smoking; (2) obesity and overweight; (3) diabetes; and (4) chronic pancreatitis. The individuals who fell into any one of the criteria in Table 1 but did not have the ICD code for PC formed the negative controls (nonPC). Considering that patients with pancreatitis undergo screening and surveillance, and thus cannot acquire benefits from our model, we excluded PC patients who received pancreatitis diagnosis before PC diagnosis and nonPC patients with pancreatitis in our analysis. The flow chart depicting the data processing pathway is shown in Fig. 1 and brief statistical descriptions of baseline characteristics of our final dataset are shown in Table 2.
Table 1.
Electronic health records data retrieval criteria. Patients can belong to more than one criterion. Thus, overlaps exist among the patients identified from each criterion.
EHR data retrieval criteria | Number of patients identified | |
---|---|---|
#1 | Any patient who received Magnetic Resonance (MR), Computerized Tomography (CT) imaging of abdomen/pancreas or Magnetic resonance cholangiopancreatography (MRCP) | 226,615 |
#2 | Any patient who received a single pathology report containing both the terms “pancrea” and any one of the terms (malignan, carcinoma, cancer, neoplas) | 279,540 |
#3 | Any patient who received an ICD diagnosis code for pancreatic cancer (157.* or C25.*). | 7,124 |
#4 | Any patient who received an ICD diagnosis code for smoking (305.1 or Z72.0 or F17.*) | 62,175 |
#5 | Any patient who received an ICD diagnosis code for obesity (278.0* or E66.*) | 135,055 |
#6 | Any patient who received an ICD diagnosis code for diabetes (250.* or E10.* or E11.*) | 158,372 |
#7 | Any patient who received an ICD diagnosis code for chronic pancreatitis (577.1 or K86.0 or K86.1) | 3,791 |
Table 2.
Baseline characteristics. This table shows brief baseline characteristics for the final dataset used in the analysis. A full demographics include 8 categories of race, 8 categories of ethnicity, 66 categories of language, and 103 categories of zip codes, which are not shown in this table.
PC/nonPC | |||
---|---|---|---|
Total | 834 (9%)/8,223 (91%) | ||
Risk factors | Smoking | Yes | 162 (19%)/1,575 (19%) |
Not documented | 672 (81%)/6,648 (81%) | ||
Obesity | Yes | 153 (18%)/1525 (19%) | |
Not documented | 681 (82%)/6,698 (81%) | ||
Diabetes | Yes | 628 (75%)/6,346 (77%) | |
Not documented | 206 (25%)/1,877 (23%) | ||
Demographics | Race | White | 406 (49%)/3990 (49%) |
Asian | 22 (3%)/206 (3%) | ||
African American | 108 (13%)/1161 (14%) | ||
Other Combinations not described | 58 (7%)/594 (7%) | ||
Unknown | 236 (28%)/2208 (27%) | ||
Ethnicity | Caucasian | 11 (1%)/120 (1%) | |
Hispanic | 6 (1%)/59 (1%) | ||
Not Hispanic | 180 (22%)/1453 (18%) | ||
African American | 89 (11%)/791 (10%) | ||
Unknown | 546 (65%)/5778 (70%) | ||
Sex | Male | 461 (55%)/4544 (55%) | |
Female | 373 (45%)/3679 (45%) | ||
Zip code | Starts with 0 (MA, NH, ME, VT, CT, NJ) | 131 (16%)/1154 (14%) | |
Starts with 1 (NY, PA) | 681 (82%)/6805 (83%) | ||
Starts with 3 (GA, FL, AL, TN, MS) | 14 (2%)/130 (2%) | ||
Language | English | 463 (56%)/4,433 (54%) | |
Spanish | 68 (8%)/794 (10%) | ||
Other | 250 (30%)/2,611 (32%) | ||
Unknown | 53 (6%)/385 (5%) | ||
Age | 73.4 (CI95%=72.6–74.2)/73.9 | (CI95%=73.6–74.2) |
3.3. Deep learning framework
3.3.1. GrpNN architecture
The irregular multivariate nature of EHR laboratory test data requires a non-trivial architecture for feature extraction and data fusion for use with AI-based predictive modeling. The GrpNN architecture we employ to address this issue consists of two sequential components as shown in Fig. 3A: an embedding layer followed by a prediction layer. In the embedding layer, an independent set of trainable weights are used to learn a dimensionally reduced representation of each time-series variable, thus producing one simplified feature vector for each sequence of measurements. These learned feature vectors are then concatenated and passed through the prediction layer which uses a simple non-linear transformation to project the data to a binary prediction space using the standard log softmax function.
This representation-based approach has several advantages: (1) these models generalize naturally to multi-task frameworks, since the learned feature vectors are often useful for a wide variety of potential downstream tasks; (2) the architectures retain a level of human-interpretability, since the graphical structure that determines how the features are segmented in the embedding layer are determined by human experts; (3) these models are associated with overall improvements in performance which is likely due to the injection of human knowledge into their embedding architectures. One salient aspect of how our GrpNN architecture is guided by human input is its utilization of independent trainable weights to learn distinct representations of each time-series sequence as shown in Fig. 3, since there is no a priori reason to assume a relationship between the utility of features between different laboratory tests. This is in contrast to architectures for natural language processing where the feature vectors for each word are learned through a single set of shared weights, since there it is assumed that the representation for each word contributes the same type of information towards downstream tasks.
This GrpNN architecture thus represents a systematic way of leveraging the complete temporal information from EHR into deep learning models in a manner that resembles a clinical workflow and further improves interpretability. For example, the feature vectors learned by the embedding layer closely resemble the way a clinician might analyze a sequence of blood-cell-count measurements to determine what their trending values might imply for the probability of being diagnosed with PC. Similarly, the process of combining different feature-vectors towards downstream tasks such as classification and prediction, often referred to as “pooling” (26), is analogous to the way a clinician might weigh distilled data from different sources towards making a final diagnosis. The structured graphical neural network architectures that are characteristic of our GrpNN model are often referred to in the literature as “explainable algorithms’, in order to distinguish them from the “block-box” architectures (e.g., GrpNN without embedding layer, Fig. 3A) of traditional deep learning-based approaches.
3.3.2. Random masking strategy
Our training protocol employs a novel strategy for data augmentation which randomly applies a mask on measurements observed at proximal times within each time-series. This random masking strategy, described in Fig. 3B, serves a dual purpose. The first is mitigating bias that arises from discrepancies in the number of lab measurements between PC and nonPC populations. The injection of randomness into the length of each time-series effectively eliminates any dependence on the predictions to this feature. Analogous techniques are standard for modern image classifiers, which apply a random rotation to each image in order to remove the dependence of predictions on the orientation of the picture. The second is incorporating an emphasis on early risk identification into the training incentives of the GrpNN model. The use of a Poisson distribution ensures that the model is only rarely exposed to examples of patients that contain the full recorded history of measurements. This training objective is consequently similar to a real clinical early diagnosis scenario in which a doctor is tasked with diagnosing patients based on a limited amount of clinical information depending on the number of measurements available prior to PC diagnosis (Fig. 5A).
Fig. 5.
Early detection performance. We evaluated early detection performance of GrpNN model with and without random masking strategy (GrpNN vs GrpNN+RM). (A) A flowchart for describing dataset preparation for estimating prediction score at x months prior to diagnosis date. (B) The early detection performance significantly improves with the random masking strategy. (C) The post hoc analysis of GrpNN+RM on the hold-out set stratified by race. The early detection performance was consistent across reported race. The abbreviations for the race, [W, A, B, O, U], represent [White, Asian, Black, Other Combinations not described, Unknown].
For the implementation of this random masking strategy, we assumed that the distribution of the number of longitudinal measurements followed a Poisson distribution, and computed the observed rate parameter for all 33 variables independently. Based on these rates we drew N number of random samples from the corresponding Poisson distribution where N equals the number of patient samples in the train set or test set. We then applied the values in the random samples to mask the patients’ lab measurements (i.e., hiding original measurements according to the value from the random samples). Thus, if the patient has 20 measurements for a specific lab variable and the value generated by the random mask is 7, the algorithm is only exposed to the first 7 measurements in the sequence while the remaining 13 measurements are masked. These random samples were regenerated at the beginning of each epoch in the training loop. By injecting variance into each measurement independently rather than uniformly (e.g., according to hospital visit dates), we additionally mitigated biases that could arise from potential confounding factors that could affect the number of lab tests that are ordered per visit.
3.3.3. Model training strategy
For the final dataset (834 PC patients and 8,223 nonPC patients) split into train set (80%) and hold-out set (20%), we used early stopping in a 100-epoch training loop by monitoring loss on the hold-out set consisting of 165 PC and 1807 nonPC patients’ data. We used area under the receiver-operating characteristic (AUROC) and area under the precision-recall curve (AUPRC) as performance evaluation metrics. We evaluated the performance of our deep learning models built based on 5 trials of shuffled train sets and compared it with logistic regression and black-box neural network models. Mean scores with 95% confidence intervals (CI 95) are presented in Table 3. Since arbitrary random seeds were used to initialize GrpNN, the resultant AUROC and AUPRC shown in Table 3 denote model sensitivities. We also conducted a post hoc analysis of GrpNN with the random masking strategy on the hold-out set stratified by race, where the categories are composed of: White: 864 (77 PC), Asian: 45 (3 PC), Black: 249 (25 PC), Other combination: 139 (12 PC), and Unknown: 493 (48 PC).
Table 3.
Comparative analysis of prediction scores. We split data into train set (80%) and hold-out set (20%) and presented mean AUROC and AUPRC with 95% confidence intervals. GrpNN model outperformed logistic regression model (LR). GrpNN model reduced overfitting phenomenon shown in black-box model and thus improved generalization, but no significant difference found in performance between GrpNN and a black-box model (i.e., GrpNN without embedding layer, Fig. 3A). With random masking strategy incorporated, GrpNN model showed statistically better performance than GrpNN model alone as well as LR or black-box model.
Prediction model | Train set | At 0 month (hold-out set) | At 12 month (hold-out set) | |||
---|---|---|---|---|---|---|
AUROC | AUPRC | AUROC | AUPRC | AUROC | AUPRC | |
Logistic regression model | 0.655 ± 0.000 | 0.641 ± 0.000 | 0.629 ± 0.000 | 0.525 ± 0.000 | 0.491 ± 0.000 | 0.094 ± 0.000 |
XGboost | 0.965 ± 0.000 | 0.996 ± 0.000 | 0.656 ± 0.000 | 0.530 ± 0.000 | 0.501 ± 0.000 | 0.089 ± 0.000 |
Black-box model | 0.919 ± 0.005 | 0.671 ± 0.012 | 0.826 ± 0.003 | 0.524 ± 0.021 | 0.644 ± 0.004 | 0.097 ± 0.001 |
GrpNN | 0.903 ± 0.005 | 0.636 ± 0.016 | 0.827 ± 0.003 | 0.543 ± 0.017 | 0.649 ± 0.006 | 0.099 ± 0.002 |
GrpNN + Random masking strategy | 0.891 ± 0.022 | 0.604 ± 0.054 | 0.818 ± 0.022 | 0.530 ± 0.034 | 0.671 ± 0.004 | 0.115 ± 0.005 |
The effectiveness of our GrpNN model specifically on the task of early prediction was measured by testing model performance as a function of the number of observed measurements, which we used as a metric for further optimizing the masking strategy towards this task. In particular, we found that strictly limiting the random mask to a set of independently measured Poisson distributions results in poor performance on predictions in the limit where many measurements are observed, due to the fact that the model observes these examples exceedingly rarely. To address this issue we included a randomized multiplicative factor between 1 and 300 (i.e., the maximum length of time series measurements for each patient) to the Poisson rates that determines the random mask, which is resampled at the start of each epoch. Thus, each epoch alternates between different training objectives that focus on distal versus proximal predictions. The early diagnosis capability of GrpNN model with and without random masking strategy was tested on the hold-out set by measuring prediction scores at 3, 6, 12, 24, and 36 months prior to diagnosis date (Fig. 5B).
4. Findings
The final dataset for evaluating our framework consisted of pre-diagnosis lab data of patients without chronic pancreatitis where the propensity score was successfully matched with respect to baseline characteristics (i.e., race, ethnicity, sex, zip code, patient language, age, smoking, obesity, and diabetes) (Fig. 4).
The baseline characteristics are potential confounders as they can be reflected in lab measurement values. In the dataset composed of 126,655 nonPC and 835 PC (Fig. 1) before the propensity score matching procedure, the separability resulting from the baseline characteristics was 72.9% (Fig. 4). We reduced this separability to 54.6% through the propensity score matching and constructed the final dataset (Fig. 1). Another potential confounder is data structure discrepancies between PC and nonPC. For example, there will be different patterns of “missingness” since not all patients have the same number of hospital visits or examinations. After filtering out patients by the selected lab variables and configuring pre-diagnosis data of nonPC patients based on PC patients’ pre-diagnosis data, we were able to minimize discrepancies of data completeness and average number of measurements between PC and nonPC (Fig. 2). The separability assessment with respect to the data structure was 65.7%, which was further addressed using the random masking strategy (Fig. 3 B and C).
GrpNN model reduced the overfitting phenomenon shown in the black-box model and improved generalization (Table 3). Using the random masking strategy, we built a GrpNN model that simulates hypothetical scenarios of doctors providing early diagnosis based on a limited number of laboratory test results (Fig. 5A). The early detection capability significantly improved with the random masking strategy as shown in Fig. 5B. We achieved AUROC of 0.671 (CI 95, 0.667 – 0.675) and AUPRC of 0.115 (CI 95, 0.110 – 0.120) at 12 months prior to diagnosis, which was statistically better performance than GrpNN model alone as well as LR or black-box model (Table 3). The performance scores between GrpNN with and without random masking strategy diverged as the months from diagnosis moved further away, suggesting that the efficacy of the random masking strategy is more significant the more distal the time prior to diagnosis. However, at 0 months prior to diagnosis when the maximum pre-diagnosis clinical information is available for each patient, the results from GrpNN with random masking strategy showed no statistical difference from either GrpNN without random masking strategy or black-box model. The early detection performance was consistent across reported race (Fig. 5C).
5. Discussion
The incorporation of longitudinal laboratory test data into AI-based modeling requires an architecture for fusing irregular multivariate time-series data. We thus developed a representation-based approach for feature learning on this modality of data and performed the first controlled experiments on the feasibility of risk identification for pancreatic cancer using the full, un-simplified set of laboratory measurements. While much progress has been made related to e.g., detection algorithms operating on imaging data, the use of AI-based methods on longitudinal laboratory data from EHR to improve risk assessment or stratification of PC remains largely unexplored [29]. This is primarily due to the lack of an unambiguous protocol for performing data-fusion and controlling bias on these complex heterogeneous time-series feature sets. Recently, deep learning strategies have been developed to address the multi-modal complexity of EHR data, the most promising of which are based on the idea of representation learning. Representation learning strategies have been well studied in the AI literature with respect to computer vision and natural language processing, where they have demonstrated superiority in their ability to generalize as well as their amenability to multi-modal data fusion. However, to our knowledge these methods have not been adapted to the task of risk identification for PC.
Defining controls in observational studies of EHR data is a significant and open research challenge. We thereby introduced a novel random masking strategy incorporated into a GrpNN prediction model to resolve discrepancies between cases and controls in classification problems. We used the masking strategy to remodel the curated datasets (e.g., cutting off information based on actual or arbitrarily chosen diagnosis date) into one closer to “real-world” data by introducing randomness while reducing data structure discrepancies between PC and nonPC datasets. Applying this novel method, our results showed that the GrpNN architecture and the random masking strategy reduce overfitting and improve early predictive performance by mitigating confounders (Table 3 and Fig. 5).
Our focused analysis on the pre-diagnosis data of a targeted subgroup of patients (those identified as RF who also received either imaging or biopsy) was designed to enable a prospective study on the nonPC patients in the RF group with no record related to imaging or biopsies. Specifically, model inference on this subgroup will identify patients at high-risk for PC patients and inform targeted screening strategies, that could be validated to produce compelling evidence for the utility of these models. We also considered the patient population who may benefit most from our model, which aims to identify high-risk patients who should undergo targeted screening for PC. For example, patients with chronic pancreatitis already undergo numerous imaging studies to assess disease activity and would therefore were deemed not relevant to our model aim and clinical application. There were total of 82 patients out of the final 835 PC patients who received a chronic pancreatitis diagnosis and 2,028 among the nonPC patients. We excluded PC patients who received a chronic pancreatitis diagnosis before a PC diagnosis and all 2,028 nonPC patients in our analysis. However, there was only one PC patient who received a chronic pancreatitis diagnosis before a PC diagnosis out of the 82 patients. Most of the PC patients with chronic pancreatitis received this diagnosis on the same day they received their PC diagnosis (47 out of 82) or later (34 out of 82). It is likely that these patients had their first visits to CUIMC-NYP hospital with full medical history taken from other hospitals.
The central limitation of our study was the small sample size due to our targeted patient selection, which led to a reduction of PC patients from 7,124 to 2,792. Selection of 418 variables further reduced PC patients to 2,249 of which only 1,200 had pre-diagnosis data. The final small number of variables selected for the analysis (33 variables) led to a total of 835 PC patients while it minimized discrepancies in data structure (e.g., data completeness) between PC and nonPC (Fig. 2B). Thus, our study design decision aimed to construct a clinically useful model for early detection of PC led to a subpopulation analysis with a limited dataset.
6. Conclusion
We presented GrpNN model incorporated with a novel protocol for data augmentation during model training, which leverages the time-series nature of the data to improve the accuracy of its predictions, specifically laboratory data from a patient prior to cancer diagnosis. Our experiments show that this protocol improves model generalizability and performance on the early identification of PC risk. Although further evaluations with external data will be required to extend the generalizability of the proposed methods, we believe that our study will provide new insights for integrating longitudinal clinical data from EHR while controlling for potential bias in selection of controls, leading to improved early risk predictions in PC and potentially other cancers. A future prospective study that applies our model to nonPC patient group with no record related to imaging or biopsies will further corroborate the validity and clinical utility of our model.
The success of our representation-based learning strategy suggests a number of promising directions for future exploration. For example, in a forthcoming study we extend the variable grouping strategy of the GrpNN architecture beyond the time-series representations utilized in this analysis, to incorporate higher-level domain knowledge about known associations between measured observables and organ system interactions in addition to ICD codes. The natural by-product of these extended subgrouping strategies is a set of interpretable high-level composite variables that inform model predictions (analogous to the body mass index, which a composite variable of height and weight). The design of these subgrouping strategies will provide a rich source of experimentation and exploration on the clinical utility of new predictors represented by groups of related data as determined by human experts. These composite predictors additionally motivate an analysis under “explainable” AI frameworks, in which their clinical utility may be inferred from deep-learning based methods for evaluating their relative importance such as attention-based weighting and SHapley Additive exPlanations [30].
Finally, this work advances the ongoing research efforts to leverage the full set of multi-modal data elements from EHR towards a unified system which can be utilized for risk stratification and identification, not limited to PC. This analysis demonstrates that longitudinal laboratory test results contain important predictors relevant for the early detection of PC, and the representation-based learning approach we employ is amenable for use within existing strategies for EHR data fusion, such as those involving hierarchical embeddings. In conjunction with the comparatively advanced progress in representation-based AI modeling for imaging data (e.g., CT, MRI) and signals-based data (e.g., ECG, EEG), the models and protocols developed in this study for irregular longitudinal data will advance the development of comprehensive models that incorporate multi-modal clinical data sets.
Funding
This work was supported by National Cancer Institute of the National Institutes of Health under 1R21CA265400-01.
Footnotes
CRediT authorship contribution statement
Jiheum Park: Data curation, Formal analysis, Funding acquisition, Investigation, Methodology, Project administration, Resources, Software, Validation, Visualization, Writing – original draft, Writing – review & editing. Michael G. Artin: Formal analysis, Investigation, Methodology, Validation, Writing – original draft, Writing – review & editing. Kate E. Lee: Investigation, Methodology, Validation, Writing – review & editing. Yoanna S. Pumpalova: Investigation, Methodology, Validation. Myles A. Ingram: Methodology, Writing – review & editing. Benjamin L. May: Investigation, Data curation, Software. Michael Park: Investigation, Methodology, Validation, Visualization, Software, Writing – original draft, Writing – review & editing, Supervision. Chin Hur: Funding acquisition, Investigation, Methodology, Validation, Project administration, Resources, Writing – review & editing, Supervision. Nicholas P. Tatonettie: Funding acquisition, Investigation, Methodology, Validation, Project administration, Resources, Writing – review & editing, Supervision.
Declaration of Competing Interest
The authors declare the following financial interests/personal relationships which may be considered as potential competing interests: Chin Hur reports financial support was provided by National Cancer Institute of the National Institutes of Health.
Data and materials availability
Our EHR data sharing is not allowed by HIPPA. The code used in the development of our model is available under CUMC-HIRE in the PC_GrpNNmodel repository at https://github.com/CUMC-HIRE/PC_GrpNNmodel.
References
- [1].Rahib L, Smith BD, Aizenberg R, Rosenzweig AB, Fleshman JM, Matrisian LM, Projecting cancer incidence and deaths to 2030: the unexpected burden of thyroid, liver, and pancreas cancers in the United States, Cancer Res. 74 (11) (2014) 2913–2921. [DOI] [PubMed] [Google Scholar]
- [2].Wagner M, Redaelli C, Lietz M, Seiler CA, Friess H, Büchler MW, Curative resection is the single most important factor determining outcome in patients with pancreatic adenocarcinoma, Br. J. Surg 91 (5) (2004) 586–594. [DOI] [PubMed] [Google Scholar]
- [3].Goggins M, Overbeek KA, Brand R, Syngal S, Del Chiaro M, Bartsch DK, Bassi C, Carrato A, Farrell J, Fishman EK, Fockens P, Gress TM, van Hooft JE, Hruban RH, Kastrinos F, Klein A, Lennon AM, Lucas A, Park W, Rustgi A, Simeone D, Stoffel E, Vasen HFA, Cahen DL, Canto MI, Bruno M, Management of patients with increased risk for familial pancreatic cancer: updated recommendations from the International Cancer of the Pancreas Screening (CAPS) Consortium, Gut 69 (1) (2020) 7–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [4].Daly MB, Pal T, Berry MP, Buys SS, Dickson P, Domchek SM, Elkhanany A, Friedman S, Goggins M, Hutton ML, Karlan BY, Khan S, Klein C, Kohlmann W, Kurian AW, Laronga C, Litton JK, Mak JS, Menendez CS, Merajver SD, Norquist BS, Offit K, Pederson HJ, Reiser G, Senter-Jamieson L, Shannon KM, Shatsky R, Visvanathan K, Weitzel JN, Wick MJ, Wisinski KB, Yurgelun MB, Darlow SD, Dwyer MA, Genetic/Familial High-Risk Assessment: Breast, Ovarian, and Pancreatic, Version 2.2021, NCCN Clinical Practice Guidelines in Oncology, J. Natl. Compr. Canc. Netw 19 (1) (2021) 77–102. [DOI] [PubMed] [Google Scholar]
- [5].Stoffel EM, McKernin SE, Brand R, Canto M, Goggins M, Moravek C, Nagarajan A, Petersen GM, Simeone DM, Yurgelun M, Khorana AA, Evaluating Susceptibility to Pancreatic Cancer: ASCO Provisional Clinical Opinion, J. Clin. Oncol 37 (2) (2019) 153–164. [DOI] [PubMed] [Google Scholar]
- [6].Iodice S, Gandini S, Maisonneuve P, Lowenfels AB, Tobacco and the risk of pancreatic cancer: a review and meta-analysis, Langenbecks Arch. Surg 393 (4) (2008) 535–545. [DOI] [PubMed] [Google Scholar]
- [7].Klein AP, Genetic susceptibility to pancreatic cancer, Mol. Carcinog 51 (2012) 14–24. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [8].Genkinger JM, Spiegelman D, Anderson KE, Bernstein L, van den Brandt PA, Calle EE, English DR, Folsom AR, Freudenheim JL, Fuchs CS, Giles GG, Giovannucci E, Horn-Ross PL, Larsson SC, Leitzmann M, Männistö S, Marshall JR, Miller AB, Patel AV, Rohan TE, Stolzenberg-Solomon RZ, Verhage BAJ, Virtamo J, Willcox BJ, Wolk A, Ziegler RG, Smith-Warner SA, A pooled analysis of 14 cohort studies of anthropometric factors and pancreatic cancer risk, Int. J. Cancer 129 (7) (2011) 1708–1717. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [9].Farrell James J, Intraductal papillary mucinous neoplasm to pancreas ductal adenocarcinoma sequence and pancreas cancer screening, Endosc. Ultrasound 7 (5) (2018) 314–318. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [10].Patra KC, Bardeesy N, Mizukami Y, Diversity of Precursor Lesions For Pancreatic Cancer: The Genetics and Biology of Intraductal Papillary Mucinous Neoplasm, Clin. Transl. Gastroenterol 8 (4) (2017) e86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [11].Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G, Potential biases in machine learning algorithms using electronic health record data, JAMA Int. Med 178 (11) (2018) 1544–1547. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [12].Haneuse S, Daniels M, A General Framework for Considering Selection Bias in EHR-Based Studies: What Data are Observed and Why? eGEMs 4 (1) (2014). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [13].Decker GA, Batheja MJ, Collins JM, Silva AC, Mekeel KL, Moss AA, et al. , Risk factors for pancreatic adenocarcinoma and prospects for screening, Gastroenterol. Hepatol. (N Y) 6 (2010) 246–254. [PMC free article] [PubMed] [Google Scholar]
- [14].Pandharipande PV, Heberle C, Dowling EC, Kong CY, Tramontano A, Perzan KE, Brugge W, Hur C, Targeted screening of individuals at high risk for pancreatic cancer: results of a simulation model, Radiology 275 (1) (2015) 177–187. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [15].Madadjim R, Using an integrative machine learning approach to study microRNA regulation networks in pancreatic cancer progression, 2021.
- [16].Al-Fatlawi A, Malekian N, García S, Henschel A, Kim I, Dahl A, Jahnke B, Bailey P, Bolz SN, Poetsch AR, Mahler S, Grützmann R, Pilarsky C, Schroeder M, Deep Learning Improves Pancreatic Cancer Diagnosis Using RNA-Based Variants, Cancers 13 (11) (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- [17].Tonozuka R, Itoi T, Nagata N, Kojima H, Sofuni A, Tsuchiya T, Ishii K, Tanaka R, Nagakawa Y, Mukai S, Deep learning analysis for the detection of pancreatic cancer on endosonographic images: a pilot study, J. Hepato-Bil-Pan Sci 28 (1) (2021) 95–104. [DOI] [PubMed] [Google Scholar]
- [18].Pocė I, Arsenjeva J, Kielaitė-Gulla A, Samuilis A, Strupas K, Dzemyda G, Pancreas segmentation in CT images: state of the art in clinical practice, Baltic J. Modern Comput 9 (2021) 25–34. [Google Scholar]
- [19].Lara JA, Lizcano D, Pérez A, Valente JP, A general framework for time series data mining based on event analysis: Application to the medical domains of electroencephalography and stabilometry, J. Biomed. Inform 51 (2014) 219–241. [DOI] [PubMed] [Google Scholar]
- [20].Muhammad W, Hart GR, Nartowt B, Farrell JJ, Johung K, Liang Y, et al. , Pancreatic cancer prediction through an artificial neural network, Front. Artif. Intelligence 2 (2019) 2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [21].Appelbaum L, Cambronero JP, Stevens JP, Horng S, Pollick K, Silva G, Haneuse S, Piatkowski G, Benhaga N, Duey S, Stevenson MA, Mamon H, Kaplan ID, Rinard MC, Development and validation of a pancreatic cancer risk model for the general population using electronic health records: An observational study, Eur. J. Cancer 143 (2021) 19–30. [DOI] [PubMed] [Google Scholar]
- [22].Cao S, Lu W, Xu Q, Deep neural networks for learning graph representations, Proceedings of the AAAI Conference on Artificial Intelligence, 2016. [Google Scholar]
- [23].Zhang CW, Jia DY, Wu NK, Guo ZG, Ge HR, Quantitative detection of cervical cancer based on time series information from smear images, Appl. Soft Comput 112 (2021). [Google Scholar]
- [24].Wang S, Celebi ME, Zhang Y-D, Yu X, Lu S, Yao X, Zhou Q, Miguel M-G, Tian Y, Gorriz JM, Tyukin I, Advances in Data Preprocessing for Biomedical Data Fusion: An Overview of the Methods, Challenges, and Prospects, Inform. Fusion 76 (2021) 376–421. [Google Scholar]
- [25].Azad TD, Ehresman J, Ahmed AK, Staartjes VE, Lubelski D, Stienen MN, Veeravagu A, Ratliff JK, Fostering reproducibility and generalizability in machine learning for clinical prediction modeling in spine surgery, Spine J. 21 (10) (2021) 1610–1616. [DOI] [PubMed] [Google Scholar]
- [26].López-Zambrano J, Lara JA, Romero C, Improving the portability of predicting students’ performance models by using ontologies, J. Comput. High Educ 34 (1) (2022) 1–19. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [27].Gutíerrez R, Rampérez V, Paggi H, Lara JA, Soriano J, On the use of information fusion techniques to improve information quality: Taxonomy, opportunities and challenges, Inform. Fusion 78 (2022) 102–137. [Google Scholar]
- [28].Luo Y, Szolovits P, Dighe AS, Baron JM, Using machine learning to predict laboratory test results, Am. J. Clin. Pathol 145 (2016) 778–788. [DOI] [PubMed] [Google Scholar]
- [29].Kenner BJ, Abrams ND, Chari ST, Field BF, Goldberg AE, Hoos WA, Klimstra DS, Rothschild LJ, Srivastava S, Young MR, Go VLW, Early Detection of Pancreatic Cancer: Applying Artificial Intelligence to Electronic Health Records, Pancreas 50 (7) (2021) 916–922. [DOI] [PMC free article] [PubMed] [Google Scholar]
- [30].Nohara Y, Matsumoto K, Soejima H, Nakashima N, Explanation of machine learning models using improved Shapley Additive Explanation, Proceedings of the 10th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, 2019. p. 546-. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
Our EHR data sharing is not allowed by HIPPA. The code used in the development of our model is available under CUMC-HIRE in the PC_GrpNNmodel repository at https://github.com/CUMC-HIRE/PC_GrpNNmodel.