Stable clinical risk prediction against distribution shift in electronic health records

Seungyeon Lee; Changchang Yin; Ping Zhang

doi:10.1016/j.patter.2023.100828

. 2023 Aug 22;4(9):100828. doi: 10.1016/j.patter.2023.100828

Stable clinical risk prediction against distribution shift in electronic health records

Seungyeon Lee ^1,^2,³, Changchang Yin ^1,^2,³, Ping Zhang ^1,^2,^4,^∗

PMCID: PMC10499849 PMID: 37720334

Summary

The availability of large-scale electronic health record datasets has led to the development of artificial intelligence (AI) methods for clinical risk prediction that help improve patient care. However, existing studies have shown that AI models suffer from severe performance decay after several years of deployment, which might be caused by various temporal dataset shifts. When the shift occurs, we have access to large-scale pre-shift data and small-scale post-shift data that are not enough to train new models in the post-shift environment. In this study, we propose a new method to address the issue. We reweight patients from the pre-shift environment to mitigate the distribution shift between pre- and post-shift environments. Moreover, we adopt a Kullback-Leibler divergence loss to force the models to learn similar patient representations in pre- and post-shift environments. Our experimental results show that our model efficiently mitigates temporal shifts, improving prediction performance.

Keywords: deep learning, stable learning, clinical risk prediction, EHR study, distribution shift, patient representation learning, sample reweighting

Graphical abstract

Highlights

•
We study temporal shifts within EHRs and the performance decay caused by the shifts
•
We propose a method that reweights patients to reduce the temporal data shift
•
The proposed method can efficiently leverage the data from different environments

The bigger picture

Artificial intelligence (AI) methods that rely on large electronic health record datasets have proved effective at predicting different kinds of “clinical risk,” the likelihood of a disease or adverse outcome given past clinical history. While these methods can help improve patient care, there are still substantial obstacles to their widespread use. In particular, studies have shown that these kinds of AI models perform more poorly after several years of deployment due to shifts in the underlying data distributions, caused by, for example, changes in patient populations or medical practice. Here, the authors propose a method that reweights patients from older data sources and show that this can substantially reduce the impact of these “temporal shifts” on model performance. Methods like this one may ultimately help make AI risk prediction models a more regular and reliable part of clinical care.

Electronic health records (EHRs) are used in clinical risk prediction to enhance early patient care. Ensuring promising performance, however, generally requires large-scale data for training predictive models. Unfortunately, EHRs are susceptible to temporal data shifts that negatively affect predictive performance, causing challenges in directly leveraging large-scale data. To address this issue, the authors propose a method to mitigate data shifts and their associated impact and experimentally validate that their approach effectively leverages the shifted data, resulting in improved predictive performance.

Introduction

The availability of large-scale electronic health record (EHR) datasets has led to the development of machine-learning methods for clinical risk prediction that help improve patient care.¹^,² Patients’ health records included in EHRs provide useful information for personal health tracking and monitoring³^,⁴^,⁵^,⁶ in various tasks in the medical domain.⁷ In this study, we focus on clinical risk prediction, which predicts the risks of future diseases by analyzing previously observed EHR information.

Many deep-learning models have been proposed to predict future diagnoses and have achieved promising results. Choi et al.⁸ developed a recurrent neural-network-based model with reverse time attention modules (RETAIN) to model reverse time-ordered EHR sequences and learn weights for all medical codes, which are used to analyze the codes’ contributions to the prediction. Ma et al.⁹ proposed a bidirectional recurrent neural network (RNN)-based model using different attention mechanisms (Dipole) to model patients’ visits in both time-ordered and reverse time-ordered ways and calculate the weights for previous visits with the attention. Ma et al.¹⁰ incorporated RNN and multi-head self-attention to consider the personal patient’s health context, extracting interdependencies between clinical features to learn the personal health context. Choi et al.¹¹ constructed a graph-based attention model using RNN to model patient visits in the sequential context. Gao et al.³ developed a model composed of an RNN and a convolutional module to model disease-stage information for risk prediction. Luo et al.¹² proposed a time-aware transformer model for health risk prediction. Figure 1A presents a basic diagram of clinical risk prediction using a neural-network-based model. In this diagram, the historical EHRs are fed as input to the model, which then predicts the future diagnosis as an output.

Illustrations of sample reweighting, clinical risk prediction, and the proposed method

(A) Diagram of clinical risk prediction.

(B) Changes in the distribution of medical codes after sample reweighting to mitigate the distribution shift.

(C) Architecture of the proposed method for sample reweighting.

Despite their successes, a fundamental challenge in EHR studies that has not been addressed in previous works is distribution shift. Most machine-learning models are tentatively based on the strong assumption that training and test data points are independently and identically distributed. However, this assumption could be violated for real-world applications, where out-of-distribution (OOD) problems often occur (i.e., the clinical data distribution changes over time). The OOD problems cause significant performance degradation in the testing environment,¹³^,¹⁴^,¹⁵^,¹⁶ which raises serious concern for the application of machine-learning models in the real-world clinical setting.

The distribution shift could appear on EHRs in various ways: (1) difference in the patient population; (2) changes in the practice of medical care; and (3) difference in data formats.¹⁷ We investigate whether the distribution shift exists in the real-world EHR dataset with respect to the aforementioned ways in Figure 2. Figures 2A and 2B show the distribution of patient demographics (i.e., gender and age). Figure 2C shows that the occurrence rates of some diseases gradually change over time. The accumulation of the changes could cause a critical data shift after several years. Moreover, the transition of the International Classification of Diseases (ICD) codes (e.g., from ICD-9-CM to ICD-10-CM) could also cause data shifts. ICD codes are widely used and play important roles in clinical risk prediction models.⁷^,⁸^,⁹ The list of potential diagnosis codes in ICD-10-CM is five times larger than its ICD-9-CM counterpart currently used in practice. When mapping the codes from ICD-9-CM to ICD-10-CM, 27% of the diagnosis codes were convoluted and 3% were found to have no mapping.¹⁸ Figures 3A and 3B show that the occurrence rates of some diseases change suddenly after the transition from ICD-9 to ICD-10. The frequencies of CEI, CIH, and DMD codes have increased by approximately two times or more since the ICD transition. It is not advisable to apply decision models from previous EHRs that were coded in ICD-9 directly to the latest EHRs without considering the changes in distribution. These changes can result in data shifts and performance decay, leading to inaccurate predictions. Therefore, it is necessary to address the temporal and/or ICD version shifts inherent in EHRs to effectively utilize historical data for predictive models.

Statistical analysis

(A) Gender distribution.

(B) Age distribution.

(C) Occurrence rates of important diseases that gradually change over time. DOR, dorsalgia; EH, essential (primary) hypertension; DLML, disorders of lipoprotein metabolism and other lipidemias; CIHD, chronic ischemic heart disease; SSD, segmental and somatic dysfunction.

Changes in the occurrence rates of diseases after the transition from ICD-9-CM to ICD-10-CM

(A) Changes in important diseases for stroke patients.

(B) Changes in important diseases for heart failure patients.

CEI, general examination and investigation; CIH, chronic ischemic heart disease; SMN, encounter for screening for malignant neoplasms; DOR, dorsalgia; DMD, dependence on enabling machines and devices; AMC, encounter for other aftercare and medical care.

Several studies have addressed the OOD problems in medical environments. For instance, Ulmer et al.¹⁹ investigated uncertainty estimation methods for detecting OOD samples in medical tabular data. However, the study demonstrated that uncertainty estimation methods may not be reliable for OOD detection, since the data are high-dimensional, complex, and noisy. In another work, Luo et al.⁶ proposed a causal representation learning model based on variable decorrelation for diagnosis prediction. This model discovers stable correlations that reflect the causal effect of each feature in different environments, resulting in mitigating bias caused by the distribution shifts between training and inference. Some existing works have focused on data shifts in medical environments. Guo et al.²⁰ proposed a domain generalization (DG)²¹-based model that leverages time information as the domain to learn robust and domain-invariant properties across time to mitigate temporal shift. Zhang et al.²² proposed AdaDiag, which is based on domain adaption (DA), to handle domain shift. AdaDiag consists of a joint feature extractor that maps input from the source and target domain to the shared feature space, a classifier that performs predictions, and a discriminator for distinguishing the source and target domain.

In this paper, we propose a new method for stable clinical risk prediction to tackle these challenges. We treat the observed EHRs before October 2015 (when the codes are recorded as ICD-9-CM) as pre-shift data and the EHRs observed after October 2015 (when the codes are recorded as ICD-10-CM) as post-shift data. We reweight training patients’ records in pre-shift data to mitigate the distribution shift between the pre- and post-shift data. Figure 1 illustrates the main concepts of the proposed method. Figure 1B presents an example of a distribution shift of medical codes in the post-shift data. After sample reweighting, the distribution changes toward mitigating the distribution shift. Figure 1C shows an architecture of the proposed model for sample reweighting. The proposed model not only directly equalizes the occurrence rate of codes in pre- and post-shift data using mean squared error but also equalizes the probability distribution in the latent space using Kullback-Leibler divergence (KL-divergence).

Note that all the ICD-9-CM codes are mapped to ICD-10-CM codes according to General Equivalence Mappings developed by the Centers for Medicare & Medicaid Services (CMS).²³ We conduct a comprehensive empirical study on a real-world EHR dataset with different scenarios to demonstrate our hypothesis and to evaluate the effectiveness of our method. To demonstrate our hypothesis that the distribution differences between pre- and post-shift data exist, we first conduct experiments with the following scenarios: (1) we train the existing clinical risk prediction models (e.g., RETAIN, Dipole) for heart failure and stroke risk prediction tasks only with patients in the pre-shift training data, and report the performance on the post-shift test data; (2) we apply our method to the models to evaluate whether our method reduces the distribution shift and improves the performance on the post-shift test data. Experimental results demonstrate our hypothesis and show that our method improves all the baselines.

Our contributions are summarized as follows.

•
We investigate the temporal distribution shift on medical codes and the performance differences caused by the shift.
•
We design a new method that reweights the pre-shift samples to reduce the distribution shift between the pre- and post-shift samples, learning stable representations for both the pre- and post-shift samples.
•
We show that the proposed method not only boosts the prediction performance by sample reweighting but also efficiently leverages the pre-shift historical data through stable learning.
•
We conduct a comprehensive experiment to demonstrate our hypothesis and to evaluate the effectiveness of our method.

Experimental results show that our method improves existing predictive models for heart failure and stroke risk, mitigating the distribution shift in diagnosis codes between the pre- and post-shift samples.

Results

Data

We conduct our experiments on a real clinical EHR data warehouse, MarketScan Commercial Claims and Encounters (CCAE),²⁴ which contains individual-level and de-identified healthcare claims information. MarketScan claims data are primarily used to evaluate health utilization and services. We identify coronary artery disease (CAD) cohorts for which criteria are defined based on ICD codes. There are 1,178,997 patients in total. All patients have a set of medical records including demographic characteristics, time information, drugs, procedures, diagnoses, and other clinically relevant indicators. We consider three categories, namely demographic characteristics, diagnosis, and procedure codes, for study variables. Demographic characteristics consist of age and gender information. Diagnosis codes are defined as ICD codes and consist of 57,089 unique ICD-9/10 codes in MarketScan data.

Study design

CAD represents a major risk factor for both heart failure²⁵^,²⁶ and stroke.²⁷^,²⁸ In this work, we focus on clinical risk prediction of whether a patient will suffer heart failure or stroke in the future. The definitions of heart failure and stroke are presented in Tables S1 and S2. We conduct a case-control study, a type of epidemiological observational study, on clinical risk prediction tasks. The case-control study identifies two groups of subjects with different diseases but similar conditions and compares them to discover factors that contribute to the differences. Patients diagnosed with heart failure or stroke are collected as case patients. Then, for each case patient, a control patient with the same demographics and characteristics, such as the same age, gender, and number of visits, is selected.

To predict the diagnosis of heart failure or stroke at some future time, it is necessary to set operation criterion and prediction dates. Figure 4 shows the settings to construct the experimental EHR data from the large database for early prediction tasks. The operation criterion date indicates the date of the future diagnosis to be predicted. The prediction date refers to the date before the prediction window from the operation criterion date to make a prediction for future diagnosis. Each patient’s EHR data are then split into an observation window and a prediction window. The prediction window includes the medical records for the last 360, 180, or 90 days tracing back from the operation criterion date. The observation window contains all the records before the prediction window and is used for analysis. For example, if a patient is diagnosed with heart failure on October 5, 2014, the records up to October 1, 2013 are included in the observation window for predicting heart failure with a prediction window of 360 days. In the case of the case patients, the date of the EHR diagnosed with heart failure or stroke is set as the operation criterion date. In the case of the control patients, the last date of the EHR is set as the operation criterion date. When selecting control patients for the case-control study, the prediction date is also included in the characteristics similar to those of the case patients to accurately analyze EHR data over time. In addition, to ensure that there are sufficient medical events to predict the future diagnosis, only patients with more than ten records (visits) in the observation window are selected for analysis.

Settings to construct the experimental EHR data for clinical risk prediction tasks

The operation criterion date refers to the date of the EHR diagnosed with target diseases (case patients) or the end date of the EHR (control patients). The prediction date represents the date before the prediction window, tracking from the operation criterion date.

Data pre-processing

We pre-process the EHR data by chronologically concatenating the medical records for each patient according to previous works,⁷^,²⁹ as the temporal information is critical. Thus, all patients are represented as a variable-length sequence of records equal to the corresponding number of visits. For convenience, all patients’ records are padded to the same size based on the maximum number of visits, and the padding records are not medically meaningful. For equivalence between codes of ICD-9-CM and ICD-10-CM versions, all medical codes in the dataset originally coded as ICD-9-CM are pre-converted into ICD-10-CM’s codes before the experiments according to General Equivalence Mappings developed by CMS.²³ In our study we only consider the first three letters, which are representative categories including more detailed codes, to reduce the number of diagnosis codes. To address the potential loss of information resulting from reducing ICD codes to a low number of letters, we conducted a validation process to ensure that the codes retained sufficient granularity to capture meaningful differences between patients’ diagnoses. Specifically, we compared the performances of models trained with full-length ICD codes and shortened codes, ranging from 5-letter to 1-letter codes. Our results show that using the full-length codes led to a lower area under the receiver-operating characteristics curve (AUROC) compared to the shortened codes. The results can be attributed to the lower frequency of 5-letter codes, which may pose challenges in effectively learning their embeddings. Conversely, using shortened codes did not adversely impact the model’s performance. For the heart failure prediction problem with a 360-day prediction window, the number of unique codes is 6,629 for full-length codes and 1,474 for three-letter codes. We found that using the first three or two letters of the ICD codes resulted in optimal performance. However, since 3-letter codes include category information about the disease codes, we decided to use only the first three letters of the ICD codes in our study. The results of the experiment can be found in Table S3.

Data shift

We observe that the occurrence rates of some important diseases gradually change over time and also change suddenly after the transition from ICD-9 to ICD-10 in Figures 2 and 3. These changes could cause distribution shifts and severe performance decay. To demonstrate the existence of the distribution shift in EHRs and how it affects the model performance, we report the prediction performance trend over time with a neural-network-based model that is trained and optimized only for patients whose prediction date is up to December 31, 2013. Figure 5 shows the prediction performance per month based on the prediction date for heart failure and stroke risk prediction tasks. The predictive model is trained only with patients whose prediction date is up to 2013. The x axis indicates the months and the y axis represents AUROC scores. As illustrated on the graph, the score gradually decreases over time, with a rapid decline observed from October to December 2015. This finding indicates that there is a significant distribution shift before and after October 2015, highlighting the need to address temporal shifts when working with EHRs. To further investigate the potential influence of gender distribution on clinical risk prediction, we also compare the average AUROC scores for the overall population, males, and females by year. Figure S1 shows the results for the model trained with patients up to 2013. Our analysis reveals that there is no significant difference in performance based on gender. As a result, we focus on the data shift rather than the gender distribution.

Visualization of performance per month for heart failure and stroke risk prediction

The x axis indicates the months and the y axis represents AUROC scores. The model is trained only with patients up to 2013. HF, heart failure; ST, stroke,

Experimental setting

Based on our findings, we treat EHRs before and after October 2015 as pre-shift and post-shift data, respectively. Aiming to decrease the significant performance difference between the pre-shift and post-shift data, we design our model with the following settings. Figure 6 shows the experimental settings of clinical risk prediction tasks for our model. The EHRs with the prediction date prior to October 1, 2015 are used as the pre-shift data. The pre-shift data are further split into the pre-shift training, validation, and test data to train, optimize, and evaluate the predictive model, respectively. To mitigate the distribution shift between the pre-shift and post-shift data, the post-shift data with the prediction date from October 1, 2015 to December 31, 2015 are used as the post-shift training data to reweight the pre-shift training data. The post-shift data with the prediction date after January 1, 2016 are then used as the post-shift test data to evaluate the prediction performance. The statistics of the dataset for heart failure and stroke risk prediction tasks are described in Tables 1 and 2.

Table 1.

Statistics of the dataset for heart failure prediction

Prediction window	360 days					180 days					90 days
Data	Pre-shift			Post-shift		Pre-shift			Post-shift		Pre-shift			Post-shift
	Train	Valid	Test	Train	Test	Train	Valid	Test	Train	Test	Train	Valid	Test	Train	Test
No. of unique codes	1,474					1,490					1,521
No. of patients	26,408	8,940	8,926	1,706	3,418	30,290	10,200	10,178	2,616	7,408	28,926	9,740	9,750	3,024	9,798
No. of visits	524,898	177,628	175,674	34,544	67,732	544,648	184,014	185,064	51,624	147,744	524,262	176,252	175,692	59,002	194,122
Avg. no. of visits per patient	18	18	18	20	19	18	18	18	19	19	18	18	18	19	19
Avg. no. of codes per visit	2.25	2.26	2.24	2.43	2.53	2.31	2.32	2.30	2.42	2.53	2.34	2.36	2.34	2.42	2.52
Max. no. of codes per visit	28	34	25	16	19	31	34	22	15	29	28	20	34	27	29

Open in a new tab

Table 2.

Statistics of the dataset for stroke prediction

Prediction window	360 days					180 days					90 days
Data	Pre-shift			Post-shift		Pre-shift			Post-shift		Pre-shift			Post-shift
	Train	Valid	Test	Train	Test	Train	Valid	Test	Train	Test	Train	Valid	Test	Train	Test
No. of unique codes	1,472					1,476					1,500
No. of patients	24,738	8,278	8,314	1,380	3,234	26,408	8,940	8,926	2,100	6,346	24,866	8,348	8,342	2,394	8,248
No. of visits	458,674	152,372	153,298	27,674	64,018	483,676	162,976	163,564	41,572	126,478	455,476	153,700	152,200	47,378	164,056
Avg. no. of visits per patient	18	18	18	20	19	18	18	18	19	19	18	18	18	19	19
Avg. no. of codes per visit	2.26	2.25	2.25	2.42	2.46	2.31	2.31	2.31	2.41	2.51	2.34	2.34	2.35	2.43	2.53
Max. no. of codes per visit	38	25	28	30	44	34	32	38	30	21	34	38	28	30	21

Open in a new tab

We compare the prediction performances of models trained with the original pre-shift training data and the reweighted pre-shift training data, respectively. We apply our method to existing clinical risk prediction models. Our method reweights the pre-shift patients’ EHRs to make their distributions similar to that of the post-shift patients, mitigating the distribution shift between them for stable learning. Moreover, we adopt KL loss to learn stable and similar patient representation extracted from the pre-shift and post-shift data. In Tables 3, 4, and 5, Basic and Weighted represent the results of the existing methods and the proposed method, respectively. Accuracy, area under the precision-recall curve (AUPRC), and AUROC are used as performance measurements.

Table 3.

Comparison of prediction performance on the post-shift test set for heart failure prediction

Prediction window		360 days		180 days		90 days
		AUPRC	Accuracy	AUPRC	Accuracy	AUPRC	Accuracy
LSTM	Basic	0.5730 $\pm$ 0.017	0.5301 $\pm$ 0.002	0.6677 $\pm$ 0.005	0.5840 $\pm$ 0.006	0.7018 $\pm$ 0.006	0.6247 $\pm$ 0.007
	Weighted	0.5865 $\pm$ 0.015	0.5319 $\pm$ 0.002	0.6763 $\pm$ 0.007	0.5859 $\pm$ 0.007	0.7133 $\pm$ 0.009	0.6344 $\pm$ 0.008
GRU	Basic	0.5781 $\pm$ 0.005	0.5309 $\pm$ 0.002	0.6718 $\pm$ 0.004	0.5889 $\pm$ 0.005	0.7095 $\pm$ 0.003	0.6309 $\pm$ 0.004
	Weighted	0.5964 $\pm$ 0.006	0.5336 $\pm$ 0.003	0.6803 $\pm$ 0.004	0.5912 $\pm$ 0.004	0.7144 $\pm$ 0.005	0.6348 $\pm$ 0.005
Dipole	Basic	0.5905 $\pm$ 0.002	0.5322 $\pm$ 0.002	0.6757 $\pm$ 0.002	0.5937 $\pm$ 0.004	0.7095 $\pm$ 0.002	0.6308 $\pm$ 0.002
	Weighted	0.5968 $\pm$ 0.003	0.5330 $\pm$ 0.002	0.6781 $\pm$ 0.003	0.5977 $\pm$ 0.005	0.7171 $\pm$ 0.002	0.6375 $\pm$ 0.003
RETAIN	Basic	0.5934 $\pm$ 0.006	0.5414 $\pm$ 0.003	0.6726 $\pm$ 0.002	0.5912 $\pm$ 0.004	0.7128 $\pm$ 0.003	0.6362 $\pm$ 0.003
	Weighted	0.5971 $\pm$ 0.006	0.5428 $\pm$ 0.003	0.6763 $\pm$ 0.003	0.5983 $\pm$ 0.004	0.7156 $\pm$ 0.004	0.6422 $\pm$ 0.003
ConCare	Basic	0.5946 $\pm$ 0.004	0.5421 $\pm$ 0.001	0.6756 $\pm$ 0.002	0.5866 $\pm$ 0.002	0.7123 $\pm$ 0.003	0.6353 $\pm$ 0.003
	Weighted	0.5965 $\pm$ 0.005	0.5491 $\pm$ 0.001	0.6781 $\pm$ 0.002	0.5906 $\pm$ 0.003	0.7140 $\pm$ 0.003	0.6437 $\pm$ 0.003
StageNet	Basic	0.5911 $\pm$ 0.003	0.5305 $\pm$ 0.001	0.6743 $\pm$ 0.003	0.5829 $\pm$ 0.002	0.7057 $\pm$ 0.002	0.6326 $\pm$ 0.001
	Weighted	0.5946 $\pm$ 0.004	0.5441 $\pm$ 0.002	0.6787 $\pm$ 0.005	0.5899 $\pm$ 0.004	0.7148 $\pm$ 0.002	0.6410 $\pm$ 0.002

Open in a new tab

The baseline and proposed method are denoted by Basic and Weighted, respectively. The average score and standard deviation under ten trials are reported. The results for other metrics can be found in Table S4.

Table 4.

Comparison of prediction performance on the post-shift test set for stroke prediction

Prediction window		360 days		180 days		90 days
		AUPRC	Accuracy	AUPRC	Accuracy	AUPRC	Accuracy
LSTM	Basic	0.5610 $\pm$ 0.011	0.5212 $\pm$ 0.003	0.5972 $\pm$ 0.008	0.5522 $\pm$ 0.002	0.6340 $\pm$ 0.006	0.5685 $\pm$ 0.008
	Weighted	0.5801 $\pm$ 0.014	0.5253 $\pm$ 0.004	0.6145 $\pm$ 0.011	0.5573 $\pm$ 0.003	0.6441 $\pm$ 0.009	0.5792 $\pm$ 0.009
GRU	Basic	0.5666 $\pm$ 0.006	0.5210 $\pm$ 0.002	0.6136 $\pm$ 0.004	0.5574 $\pm$ 0.006	0.6452 $\pm$ 0.006	0.5815 $\pm$ 0.006
	Weighted	0.5746 $\pm$ 0.008	0.5278 $\pm$ 0.003	0.6294 $\pm$ 0.005	0.5608 $\pm$ 0.007	0.6492 $\pm$ 0.008	0.5843 $\pm$ 0.006
Dipole	Basic	0.5702 $\pm$ 0.003	0.5275 $\pm$ 0.002	0.6157 $\pm$ 0.003	0.5592 $\pm$ 0.003	0.6460 $\pm$ 0.003	0.5827 $\pm$ 0.003
	Weighted	0.5900 $\pm$ 0.005	0.5290 $\pm$ 0.003	0.6260 $\pm$ 0.005	0.5601 $\pm$ 0.003	0.6528 $\pm$ 0.006	0.5920 $\pm$ 0.004
RETAIN	Basic	0.5756 $\pm$ 0.003	0.5259 $\pm$ 0.003	0.6222 $\pm$ 0.003	0.5563 $\pm$ 0.004	0.6382 $\pm$ 0.005	0.5781 $\pm$ 0.003
	Weighted	0.5869 $\pm$ 0.004	0.5279 $\pm$ 0.002	0.6339 $\pm$ 0.005	0.5598 $\pm$ 0.005	0.6519 $\pm$ 0.007	0.5986 $\pm$ 0.003
ConCare	Basic	0.5762 $\pm$ 0.006	0.5261 $\pm$ 0.005	0.6261 $\pm$ 0.002	0.5606 $\pm$ 0.003	0.6464 $\pm$ 0.004	0.5852 $\pm$ 0.002
	Weighted	0.5862 $\pm$ 0.008	0.5343 $\pm$ 0.005	0.6356 $\pm$ 0.004	0.5669 $\pm$ 0.003	0.6517 $\pm$ 0.007	0.5872 $\pm$ 0.003
StageNet	Basic	0.5684 $\pm$ 0.006	0.5201 $\pm$ 0.001	0.6263 $\pm$ 0.005	0.5594 $\pm$ 0.004	0.6419 $\pm$ 0.004	0.5780 $\pm$ 0.002
	Weighted	0.5776 $\pm$ 0.007	0.5216 $\pm$ 0.002	0.6323 $\pm$ 0.006	0.5606 $\pm$ 0.005	0.6511 $\pm$ 0.007	0.5849 $\pm$ 0.003

Open in a new tab

Table 5.

Comparison of prediction performance on the post-shift test set for heart failure and stroke prediction

Prediction window		360 days		180 days		90 days
		AUPRC	Accuracy	AUPRC	Accuracy	AUPRC	Accuracy
HF	Basic	0.5905 $\pm$ 0.002	0.5322 $\pm$ 0.002	0.6757 $\pm$ 0.002	0.5937 $\pm$ 0.004	0.7095 $\pm$ 0.002	0.6308 $\pm$ 0.002
	AdaDiag	0.5896 $\pm$ 0.028	0.5296 $\pm$ 0.011	0.6760 $\pm$ 0.013	0.5935 $\pm$ 0.002	0.7104 $\pm$ 0.007	0.6319 $\pm$ 0.002
	DG	0.5906 $\pm$ 0.009	0.5323 $\pm$ 0.005	0.6769 $\pm$ 0.003	0.5962 $\pm$ 0.005	0.7127 $\pm$ 0.002	0.6282 $\pm$ 0.001
	Weighted	0.5968 $\pm$ 0.003	0.5330 $\pm$ 0.002	0.6781 $\pm$ 0.003	0.5977 $\pm$ 0.005	0.7171 $\pm$ 0.002	0.6375 $\pm$ 0.003
ST	Basic	0.5702 $\pm$ 0.003	0.5275 $\pm$ 0.002	0.6157 $\pm$ 0.003	0.5592 $\pm$ 0.003	0.6460 $\pm$ 0.003	0.5827 $\pm$ 0.003
	AdaDiag	0.5697 $\pm$ 0.009	0.5290 $\pm$ 0.003	0.6180 $\pm$ 0.014	0.5594 $\pm$ 0.002	0.6472 $\pm$ 0.011	0.5830 $\pm$ 0.003
	DG	0.5726 $\pm$ 0.007	0.5283 $\pm$ 0.003	0.6254 $\pm$ 0.002	0.5603 $\pm$ 0.001	0.6503 $\pm$ 0.003	0.5832 $\pm$ 0.001
	Weighted	0.5900 $\pm$ 0.005	0.5290 $\pm$ 0.003	0.6260 $\pm$ 0.005	0.5601 $\pm$ 0.003	0.6528 $\pm$ 0.006	0.5920 $\pm$ 0.004

Open in a new tab

Basic, AdaDiag, and DG are baseline methods, and Weighted refers to the proposed method. We use the Dipole as a backbone network for both DG and Weighted. The average score and standard deviation under ten trials are reported. Results of statistical tests can be found in Table S6.

Results for clinical risk prediction

Tables 3 and 4 show the performances of clinical risk prediction on the post-shift test set as measured by AUPRC and accuracy scores for heart failure and stroke, respectively. The proposed method (marked as weighted in the tables) improves all baselines (marked as basic) on both AUPRC and accuracy scores. The results demonstrate that the proposed method mitigates the distribution shift and thus provides more robust performance for new patients that differ from training patients. Such findings indicate the advantage of the proposed method to learn stable representations for the post-shift data by sample reweighting. When comparing the performance of baseline models, the advanced models generally exhibit better overall performance than the simpler models such as GRU and LSTM. Specifically, ConCare and StageNet achieve superior performance across the board. The results of the experiment on other metrics, including AUROC, precision, and recall, can be found in Tables S4 and S5.

We also compare the performance of the proposed method with DG and AdaDiag methods, which are existing tools to alleviate temporal data shifts. For a fair comparison, both DG and AdaDiag methods utilize the post-shift training data for model training. DG and the proposed method (weighted) employ the Dipole model as the backbone network. Table 5 shows the performance results on AUPRC and accuracy. While all the comparative models outperform the basic model that does not utilize the post-shift training data in most cases, the proposed method exhibits the highest improvement in almost all cases. This demonstrates that the proposed method effectively mitigates data distribution shifts through the sample reweighting approach. To assess the statistical significance of the differences between the performances of the proposed method and existing works, we conduct Friedman and Wilcoxon tests on AUPRC scores. We apply the Friedman test with the null hypothesis ( $H_{0}$ ) that there is no statistically significant difference between the performances of the methods, while the alternative hypothesis ( $H_{1}$ ) assumes the presence of the difference. In addition, the Wilcoxon test is applied to test the null hypothesis $H_{0}$ that there is no statistically significant difference between the performances of the top two methods, weighted and DG, and the alternative hypothesis $H_{1}$ that there is a significant difference. Table S6 presents the results of both tests, including the p values obtained from ten repeated experiments. Based on the results, we rejected the null hypothesis at a significance level of $α = 0.05$ , indicating statistically significant differences among the performances of the methods.

The usefulness of the proposed method

We observe the temporal distribution shift in EHR records as the prediction performance changes over time. In particular, the performance decreases significantly as of October 2015, so we present our method to mitigate the distribution shift based on that time. Although we have demonstrated the effectiveness of our method through previous experiments, we further conduct an additional experiment to prove the usefulness of the proposed method. The settings for the additional experiment are as follows. (1) We randomly split the post-shift data (EHRs after October 2015) into the training, validation, and test data, then train the model only with the training data. The prediction performance is reported on the post-shift test data. (2) We further train the model with the pre-shift data reweighted by the proposed method using the post-shift training data. The prediction performance is also reported on the post-shift test data. As shown in Table 6, the experimental results using the weighted pre-shift data (denoted as pre-shift training) achieve higher performance compared to only using the post-shift training data (denoted as post-shift training) by about 17.2% on AUROC. This experiment shows that our method not only efficiently leverages large amounts of historical pre-shift data for model training but also improves performance.

Table 6.

Comparison of prediction performances on AUROC and accuracy using the post-shift data and both pre- and post-shift data as training sets

Prediction window		360 days		180 days		90 days
		AUROC	Accuracy	AUROC	Accuracy	AUROC	Accuracy
HF	Post-shift training	0.5821 $\pm$ 0.013	0.5399 $\pm$ 0.015	0.5795 $\pm$ 0.009	0.5593 $\pm$ 0.010	0.6182 $\pm$ 0.015	0.5782 $\pm$ 0.020
	Pre-shift training	0.6597 $\pm$ 0.006	0.6062 $\pm$ 0.008	0.7029 $\pm$ 0.004	0.6490 $\pm$ 0.008	0.7282 $\pm$ 0.003	0.6630 $\pm$ 0.006
ST	Post-shift training	0.5325 $\pm$ 0.020	0.5059 $\pm$ 0.006	0.5357 $\pm$ 0.022	0.5149 $\pm$ 0.022	0.5661 $\pm$ 0.012	0.5200 $\pm$ 0.020
	Pre-shift training	0.6088 $\pm$ 0.008	0.5642 $\pm$ 0.014	0.6317 $\pm$ 0.007	0.5960 $\pm$ 0.008	0.6716 $\pm$ 0.005	0.6255 $\pm$ 0.004

Open in a new tab

The average score and standard deviation under ten trials are reported. Note that we have access to small-scale post-shift data (i.e., 3 months records) in the post-shift training setting and large-scale pre-shift data (i.e., more than 3 years) in the pre-shift training setting. We use the GRU model in the two settings.

Distribution shift

The proposed method mitigates the distribution shift in EHRs, especially in the medical codes. Figure 7A–Cshow the code distributions for the pre-shift training set, post-shift test set, and reweighted training set, respectively. Here, the x and y axes indicate the codes and ratios of them, respectively. The x axis is set in descending order of the ratios on the pre-shift training data. As shown in Figures 7A and 7B, there exists a distribution shift between the pre-shift training and post-shift tests. Noticeably, the distribution of the reweighted training set (i.e., Figure 7C) becomes very similar to the post-shift test set, compared to Figure 7A. This result also evaluates that the sample weighting mitigates the distribution shift.

Visualization of code distribution

The x and y axes indicate the codes and ratios, respectively. x is set in descending order of the ratios on the pre-shift training data.

(A) Distribution of the pre-shift training data.

(B and C) Post-shift test data (B) and the reweighted pre-shift training data (C).

Ablation study

We conduct an ablation study to investigate whether each component of our model actually contributes to the predictive performance. Starting from the original version of the proposed model, each component is independently excluded to construct some model variants, proposed method without $L_{m s e}$ and proposed method without $L_{K L}$ . Table 7 shows the results of the ablation study. The prediction performance is reduced when each component is removed. These results demonstrate the effectiveness of directly equalizing the distributions of the codes and reducing the difference between the latent distributions in the sequential context.

Table 7.

Ablation study for the proposed method

Model	AUROC
Proposed method	0.6185
Proposed method without $L_{m s e}$	0.6057
Proposed method without $L_{K L}$	0.6031

Open in a new tab

The model is based on GRU, and the prediction period is 360 days.

Discussion

Principal results

In this study, we investigate the temporal distribution shift in diagnosis codes and the performance degradation that accompany the shift. Prediction performance tends to decrease slightly over time but decreases significantly since October 2015 when the ICD version was changed from ICD-9-CM to ICD-10-CM. We investigate that the post-shift data (EHRs after October 2015) achieves significantly lower performance for a predictive model trained on the pre-shift data (EHRs before October 2015), due to the distribution shift. Conversely, even if it is trained with the post-shift data, it also provides poor performance due to the small number of data. This suggests that the model trained with the past EHRs coded as ICD-9-CM cannot be generalized to the EHRs coded as ICD-10-CM and thus be exploited at all.

In this work, we address the challenges of the performance degradation over time and the ICD version changes by stable learning, which learns stable representation for both pre- and post-shift data, mitigating the distributional shift between them. Experiments on the real-world dataset demonstrate that our method not only improves state-of-the-art models but also generalizes prediction performance for new patients that differ from training patients. Our experimental findings are significant because it creates new chances for EHR studies. The experimental results showing that the past EHRs improve prediction performance provide many research opportunities to explore and pursue the benefits of the past EHRs. Furthermore, our method builds a bridge between different datasets, providing generalized performance and thus allowing the data to be cross-used.

Conclusion

Clinical risk prediction is crucial for improving healthcare quality. We investigate that there exist inconsistencies in the distributions of the diagnosis codes depending on time and ICD versions, resulting in the distribution shift between them. In this paper, we propose a novel method to address these issues for clinical risk prediction, learning the sample weights in pre-shift data to mitigate the distribution shift between the pre- and post-shift data. The proposed method not only directly equalizes the occurrence rate of codes in pre- and post-shift data but also equalizes the probability distribution in latent space using KL-divergence. The experimental results demonstrate that our proposed method degrades the distribution shift and thus improves the prediction performance.

Experimental procedures

Resource availability

Lead contact

Further information and requests for resources and reagents should be directed to and will be fulfilled by the lead contact, Ping Zhang (zhang.10631@osu.edu).

Materials availability

This study did not generate any new materials.

Clinical risk prediction definitions and basic notations

We use uppercase and bold letters (e.g., X) for matrices, lowercase and bold letters (e.g., x) for vectors, and lowercase letters (e.g., x) for scalars. Table 8 summarizes the notations used in our method.

Table 8.

Notation definitions

Notation	Description
$D_{p r e} \equiv {X_{i}, {\hat{y}}_{i}}_{i = 1}^{\| D_{p r e} \|}$	pre-shift training data
$D_{p o s t} \equiv {X_{i}, {\hat{y}}_{i}}_{i = 1}^{\| D_{p o s t} \|}$	post-shift training data
$X_{i}$	ith patient’s EHR sequence
$x_{i, t}$	ith patient’s tth EHR
$w$	sample weights
${\hat{y}}_{i}$	label for $X_{i}$
$y_{i}$	prediction for $X_{i}$
$d^{p r e}, d^{p o s t}$	code distributions for $X^{p r e}, X^{p o s t}$
$h^{p r e}, h^{p o s t}$	latent distributions for $X^{p r e}, X^{p o s t}$
$Z_{i}$	latent representation for $X_{i}$
$α, β$	weights to control losses
Q	encoder network
P	decoder network
F	classifier

Open in a new tab

EHR sequence

The EHR data for each patient are represented as a sequence of visits in the order of their occurrence. Each visit of the sequence has a set of varying numbers of diagnosis codes. Thus the vth visit of the ith patient is expressed as a binary vector $x_{i, v} \in {0,1}^{C}$ , where C is the number of unique diagnosis codes, and a value of 1 for the kth coordinate (i.e., $x_{i, v, k} = 1$ ) indicates that the kth code is recorded at the vth visit of the ith patient. The EHR sequence for the ith patient is denoted by $X_{i} = [x_{i, 1}, x_{i, 2}, \dots, x_{i, t_{i}}]$ where $t_{i}$ is the number of visits for the ith patient.

Clinical risk prediction

Given the EHR sequence $X_{i} = [x_{i, 1}, x_{i, 2}, \dots, x_{i, T}]$ , the goal of health risk predictive modeling in this paper is to predict the target disease at the end of the sequence. The label for the ith patient is denoted by ${\hat{y}}_{i} \in {0,1}$ , because we focus on two tasks to predict heart failure and stroke disease separately.

Architecture

The proposed framework consists of two steps: (1) sample reweighting that learns the sample weights for the pre-shift training patients using the corresponding EHR sequences to mitigate the temporal distribution shift between the pre- and post-shift training data; (2) classification that learns stable representations from the EHR sequences with the sample weights to predict the best future diagnosis. Figure 1C shows the architecture of the proposed method for sample reweighting.

Sample reweighting

We propose to learn sample weights for the pre-shift training samples to mitigate the distribution shift on diagnosis codes between the pre- and post-shift training sets. We use two approaches; directly equalize the occurrence rates of codes in the pre- and post-shift training samples and equalize the probability distribution of them in latent space.

To directly equalize the distributions of the codes, we first compute the target distribution of the codes for the post-shift samples by Equations 1 and 2:

s_{k}^{p o s t} = \sum_{x \in D_{p o s t}} \sum_{j = 1}^{T} x_{, j, k},

(Equation 1)

d_{k}^{p o s t} = \frac{s_{k}^{p o s t}}{\sum_{k = 1}^{| C |} s_{k}^{p o s t}},

(Equation 2)

where $D_{p o s t}$ is the post-shift training data and T is the number of visits for the corresponding patient. We use $w \in R_{+}^{| D_{p r e} |}$ to denote the sample weights, where $D_{p r e}$ is the pre-shift training data. The code distribution $d^{p r e}$ for $D_{p r e}$ can be obtained by Equations 3 and 2.

s_{k}^{p r e} = \sum_{i = 1}^{| D_{p r e} |} \sum_{j = 1}^{t} w_{i} \cdot x_{i, j, k} .

(Equation 3)

The difference between the pre- and post-shift training distributions is then computed using mean squared error (MSE). The loss is as follows:

L_{m s e} = \frac{1}{C - 1} \sum_{k = 1}^{C} {(d_{k}^{p r e} - d_{k}^{p o s t})}^{2} .

The MSE loss directly adjusts the occurrence rate of the diagnosis codes and thus mitigates the distribution differences between training and test sets, but it ignores the sequential context of EHRs. That is, the relation between a patient’s visits is not considered.

To address this issue and further force the distributions to be similar, we map the samples to latent representations via an auto-encoder network.³⁰ The main idea is to construct an embedding space from which the abstract information of the sequence for all visits is generated and to learn robust weights in the latent space. After embedding, the latent features for the training samples are weighted. We then minimize Kullback-Leibler divergence (KL-divergence) between two distributions in the latent space.

We first map pre- and post-shift training samples to the sequence of latent representations, z, with the auto-encoder model whose encoder network is $Q : R^{T \times | C |} \to R^{T \times F}$ and decoder network is $P : R^{T \times F} \to R^{T \times | C |}$ . Here T and F are the number of visits and the dimension of latent features from Q, respectively. The auto-encoder model is first trained with both pre- and post-shift data before training the sample weights to learn useful latent representations of the input code space. The reconstruction loss is as follows:

\begin{array}{l} {\hat{x}}_{i} = P (Q (x_{i})) \\ L_{r e c o n s t} = \sum_{x \in D_{p r e}, D_{p o s t}} {(x_{i} - {\hat{x}}_{i})}^{2} \end{array} .

(Equation 4)

After training the auto-encoder mode with Equation 4, the sequence of latent representations for ith patient is obtained as follows:

Z_{i} = [z_{i, 1}, z_{i, 2}, \dots, z_{i, T}] = [Q (x_{i, 1}), Q (x_{i, 2}), \dots, Q (x_{i, T})],

(Equation 5)

$w h e r e Z_{i}$ reflects the sequence of diagnosis codes for all visits in the order of their occurrence. The pre- and post-shift training distributions in the latent space are then computed as

h^{p r e} = \frac{1}{| D_{p r e} |} \sum_{i = 1}^{| D_{p r e} |} w_{i} \cdot Z_{i},

(Equation 6)

h^{p o s t} = \frac{1}{| D_{p o s t} |} \sum_{i = 1}^{| D_{p o s t} |} Z_{i} .

(Equation 7)

The KL loss between two latent distributions is expressed in Equation 8:

L_{K L} = h^{p o s t} \cdot \log \frac{h^{p o s t}}{h^{p r e}} .

(Equation 8)

We iteratively optimize sample weights by (Equation 9), (Equation 10). Here α and β are the coefficients that control MSE and KL-divergence constraints, respectively, and $Δ = {w \in R_{+}^{n}}$ . We consistently consider non-negative weights. Positive weights represent the relative importance of samples, enabling the model to effectively learn from significant samples. Conversely, the use of negative weights may result in the model considering samples in the opposite manner, which could lead to confusion and misinterpretation of the intended meaning of the weights. $w$ is also regularized so that the sum of $w$ equals the number of data. The reason for this regularization is that if the sample weights are too small or large, it can cause instability or non-convergence of the model during training. By constraining the sum of sample weights, the model training can be stabilized and facilitated to converge, thereby enhancing the performance and robustness of the model:

L_{w} = α \cdot L_{M S E} + β \cdot L_{K L} + {(\sum_{i = 1}^{N} w_{i} - N)}^{2},

(Equation 9)

w^{t + 1} = \arg \min_{w \in Δ} L_{w} .

(Equation 10)

Classification

The clinical risk prediction is conducted with a classification network $f : R^{T \times | C |} \to R$ . Given the trained sample weights, the weights are fixed and then multiplied by the classification losses for the corresponding training data to train the classification model. Samples with smaller weights have less impact on the model training, and larger weights have more impact. The weighted losses allow learning stable representations for both the pre- and post-shift training data.

Our algorithm iteratively optimizes the prediction function f as follows:

f^{t + 1} = \arg \min_{f} \sum_{X \in D_{p r e}} w_{i} \cdot L_{l a b e l} (f (X, {\hat{y}}_{i})),

(Equation 11)

where $L_{l a b e l} (\cdot)$ represents the binary cross-entropy loss function.

In the training phase, we optimize the predictive model parameters with the weighted training samples. On the other hand, in the inference phase, the model directly predicts the label without any sample weights.

Optimization

To apply the proposed method, we use a two-stage optimization process as follows. First the sample weights $w$ are trained by minimizing $L_{w}$ on the pre- and post-shift training data, $D_{p r e}$ and $D_{p o s t}$ . The trained weights $w$ are then used in the training of the classification network f in which the classification losses for $D_{p r e}$ are multiplied by the corresponding weights. The loss $L_{l a b e l}$ is minimized for prediction.

Baseline methods

We apply our method to several deep-learning-based models for health risk prediction to validate the effectiveness of our method. All models only use historical diagnoses as input without additional information such as ontology and temporal intervals for a fair comparison. The baseline models we use are described as follows. LSTM³¹: the variant of RNN with a long-short term gating mechanism. GRU³²: the variant of RNN. Dipole⁹: the bidirectional recurrent-neural-network-based model with attention mechanisms. Dipole models patients’ visits in both time-ordered and reverse time-ordered ways and calculates the weights for previous visits with attention. RETAIN⁸: the RNN-based model with reverse time attention modules to model reverse time-ordered EHR. The attention learns weights for all medical codes, which are used to analyze the codes’ contributions to the prediction. ConCare¹⁰: the RNN-based model with multi-head self-attention to consider the personal patient’s health context. ConCare extracts interdependencies between clinical features to learn the personal health context. StageNet³: The neural-network-based model with an LSTM module and a convolutional module to model disease-stage information for risk prediction.

To further evaluate our method, we compare our method with existing methods for mitigating temporal data shift. DG refers to a DG-based model that learns robust representation over time.²⁰ DG leverages the aforementioned baseline model as its backbone network and has a one-layer adversarial network after the last hidden layer. Each year is set in a different domain, and both pre- and post-shift training sets are utilized for the model training phase. AdaDiag is a DA-based model that consists of a transformer encoder, domain discriminator, and disease classifier. The pre- and post-shift training sets are set to the source and target domains, respectively.

Implementation and evaluation

All models are implemented by PyTorch.³³ We use the ADAM algorithm on a mini-batch of 32 patients to optimize the predictive model. The optimal hyper-parameters are found with the validation data in the training phase. The training phase stops when the validation metric is not improved for ten epochs, then test performance is reported. Hyper-parameters used by all baseline methods include the learning rate, the number of hidden nodes, and the number of hidden layers. The ranges of the hyper-parameters are {1e−3, 1e−4} for the learning rate, {128, 256, 512} for the number of hidden nodes, and {2, 3} for the number of layers. For the proposed method, the hyper-parameters used to optimize the auto-encoder include the number of hidden nodes. The learning rate and the number of epochs for training the auto-encoder are fixed at 0.001 and 1,000, respectively. Additionally, the hyper-parameters used to learn the sample weights are the learning rate, the number of epochs, and the coefficients (i.e., α and β). The ranges of the hyper-parameters are {16, 32, 64, 128} for the hidden nodes, {0.001, 0.01} for the learning rate, and {100, 300, 500} for the epochs. Both α and β are set from {1, 1e+4, 1e+7, 1e+10}. The effect of hyper-parameter tuning for our method is visualized in Figure S2. All neural-network models, including the auto-encoder for the proposed model, are initialized with a uniform distribution. We use BCELoss as a loss function for classification.

Acknowledgments

This work was funded in part by the National Science Foundation under award numbers IIS-2145625 and CBET-2037398.

Author contributions

Conceptualization, C.Y. and P.Z.; methodology, S.L., C.Y., and P.Z.; formal analysis, S.L., C.Y., and P.Z.; writing – review & editing, S.L., C.Y., and P.Z.; supervision, P.Z.

Declaration of interests

The authors declare no competing interests.

Inclusion and diversity

We support inclusive, diverse, and equitable conduct of research.

Published: August 22, 2023

Footnotes

Supplemental information can be found online at https://doi.org/10.1016/j.patter.2023.100828.

Supplemental information

Document S1. Tables S1–S6 and Figures S1 and S2

mmc1.pdf^{(1.1MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(4.4MB, pdf)}

Data and code availability

The data analyzed in this paper are from MarketScan Commercial Claims and Encounters, with more than 100 million patients from 2012 to 2017. Access to the MarketScan data are provided by the Ohio State University. The dataset is available from IBM at MarketScan: https://www.ibm.com/products/marketscan-research-databases. The source code is available from the Github repository at https://github.com/yeon-lab/stable-prediction or the Zenodo repository at https://doi.org/10.5281/zenodo.7826125.

References

1.Guo L.L., Steinberg E., Fleming S.L., Posada J., Lemmon J., Pfohl S.R., Shah N., Fries J., Sung L. EHR Foundation Models Improve Robustness in the Presence of Temporal Distribution Shift. medRxiv. 2022 doi: 10.1101/2022.04.15.22273900. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Rajkomar A., Oren E., Chen K., Dai A.M., Hajaj N., Hardt M., Liu P.J., Liu X., Marcus J., Sun M., et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018;1 doi: 10.1038/s41746-018-0029-1. 18–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Gao J., Xiao C., Wang Y., Tang W., Glass L.M., Sun J. Proceedings of The Web Conference 2020. 2020. Stagenet: Stage-aware neural networks for health risk prediction; pp. 530–540. [Google Scholar]
4.Ma T., Xiao C., Wang F. Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM; 2018. Health-atm: A deep architecture for multifaceted patient health record representation and risk prediction; pp. 261–269. [Google Scholar]
5.Zhang X.S., Tang F., Dodge H.H., Zhou J., Wang F. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019. Metapred: Meta-learning for clinical risk prediction with limited patient electronic health records; pp. 2487–2495. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Luo Y., Liu Z., Liu Q. Deep stable representation learning on electronic health records. arXiv. 2022 doi: 10.48550/arXiv.2209.01321. Preprint at. [DOI] [Google Scholar]
7.Yin C., Zhao R., Qian B., Lv X., Zhang P. 2019 IEEE International Conference on Data Mining (ICDM) IEEE; 2019. Domain knowledge guided deep learning with electronic health records; pp. 738–747. [Google Scholar]
8.Choi E., Bahadori M.T., Sun J., Kulas J., Schuetz A., Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. Adv. Neural Inf. Process. Syst. 2016;29 [Google Scholar]
9.Ma F., Chitta R., Zhou J., You Q., Sun T., Gao J. Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks; pp. 1903–1911. [Google Scholar]
10.Ma L., Zhang C., Wang Y., Ruan W., Wang J., Tang W., Ma X., Gao X., Gao J. Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 2020. Concare: Personalized clinical feature embedding via capturing the healthcare context; pp. 833–840. [Google Scholar]
11.Choi E., Bahadori M.T., Song L., Stewart W.F., Sun J. Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017. GRAM: graph-based attention model for healthcare representation learning; pp. 787–795. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Luo J., Ye M., Xiao C., Ma F. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. Hitanet: Hierarchical time-aware attention networks for risk prediction on electronic health records; pp. 647–656. [Google Scholar]
13.Duchi J., Namkoong H. Learning models with uniform performance via distributionally robust optimization. arXiv. 2018 doi: 10.48550/arXiv.1810.08750. Preprint at. [DOI] [Google Scholar]
14.Creager E., Jacobsen J.-H., Zemel R. International Conference on Machine Learning. PMLR; 2021. Environment inference for invariant learning; pp. 2189–2200. [Google Scholar]
15.Shen Z., Cui P., Zhang T., Kunag K. Stable learning via sample reweighting. Proc. AAAI Conf. Artif. Intell. 2020;34:5692–5699. [Google Scholar]
16.Shen Z., Liu J., He Y., Zhang X., Xu R., Yu H., Cui P. Towards out-of-distribution generalization: A survey. arXiv. 2021 doi: 10.48550/arXiv.2108.13624. Preprint at. [DOI] [Google Scholar]
17.Avati A., Seneviratne M., Xue E., Xu Z., Lakshminarayanan B., Dai A.M. BEDS-Bench: Behavior of EHR-models under Distributional Shift–A Benchmark. arXiv. 2021 doi: 10.48550/arXiv.2107.08189. Preprint at. [DOI] [Google Scholar]
18.Grief S.N., Patel J., Kochendorfer K.M., Green L.A., Lussier Y.A., Li J., Burton M., Boyd A.D. Simulation of ICD-9 to ICD-10-CM transition for family medicine: simple or convoluted? J. Am. Board Fam. Med. 2016;29:29–36. doi: 10.3122/jabfm.2016.01.150146. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Ulmer D., Meijerink L., Cinà G. Machine Learning for Health. PMLR; 2020. Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data; pp. 341–354. [Google Scholar]
20.Guo L.L., Steinberg E., Fleming S.L., Posada J., Lemmon J., Pfohl S.R., Shah N., Fries J., Sung L. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 2023;13:3767. doi: 10.1038/s41598-023-30820-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
21.Zhou K., Liu Z., Qiao Y., Xiang T., Loy C.C. Domain Generalization: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023;45:4396–4415. doi: 10.1109/TPAMI.2022.3195549. [DOI] [PubMed] [Google Scholar]
22.Zhang T., Chen M., Bui A.A.T. AdaDiag: Adversarial Domain Adaptation of Diagnostic Prediction with Clinical Event Sequences. J. Biomed. Inform. 2022;134 doi: 10.1016/j.jbi.2022.104168. 134 104168. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.National Bureau of Economic Research . 2023. General Equivalence Mappings. [Google Scholar]
24.IBM . 2020. MarketScan Research Databases.https://www.ibm.com/products/marketscan-research-databases [Google Scholar]
25.Gheorghiade M., Bonow R.O. Chronic heart failure in the United States: a manifestation of coronary artery disease. Circulation. 1998;97:282–289. doi: 10.1161/01.cir.97.3.282. [DOI] [PubMed] [Google Scholar]
26.American Heart Association . 2017. Causes of Heart Failure.https://www.heart.org/en/health-topics/heart-failure/causes-and-risks-for-heart-failure/causes-of-heart-failure [Google Scholar]
27.Centers for Disease Control and Prevention . 2018. Conditions that increase risk for stroke.https://www.cdc.gov/stroke/conditions.htm [Google Scholar]
28.Heart and Stroke Foundation of Canada . 2019. Coronary Artery Disease.https://www.heartandstroke.ca/heart/conditions/coronary-artery-disease [Google Scholar]
29.Zhu Z., Yin C., Qian B., Cheng Y., Wei J., Wang F. In 2016 IEEE 16th International Conference on Data Mining (ICDM) IEEE; 2016. Measuring patient similarities via a deep architecture with medical concept embedding; pp. 749–758. [Google Scholar]
30.Rumelhart D.E., Hinton G.E., Williams R.J. Learning internal representations by error propagation. Technical report. California Univ San Diego La Jolla Inst for Cognitive Science. 1985 [Google Scholar]
31.Graves A. 2012. Long Short-Term Memory. Supervised Sequence Labelling with Recurrent Neural Networks; pp. 37–45. [Google Scholar]
32.Cho K., Van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv. 2014 doi: 10.48550/arXiv.1406.1078. Preprint at. In EMNLP’14. [DOI] [Google Scholar]
33.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. In: Advances in Neural Information Processing Systems 32. Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., Garnett R., editors. Curran Associates, Inc; 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library; pp. 8024–8035. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Document S1. Tables S1–S6 and Figures S1 and S2

mmc1.pdf^{(1.1MB, pdf)}

Document S2. Article plus supplemental information

mmc2.pdf^{(4.4MB, pdf)}

Data Availability Statement

[bib1] 1.Guo L.L., Steinberg E., Fleming S.L., Posada J., Lemmon J., Pfohl S.R., Shah N., Fries J., Sung L. EHR Foundation Models Improve Robustness in the Presence of Temporal Distribution Shift. medRxiv. 2022 doi: 10.1101/2022.04.15.22273900. Preprint at. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib2] 2.Rajkomar A., Oren E., Chen K., Dai A.M., Hajaj N., Hardt M., Liu P.J., Liu X., Marcus J., Sun M., et al. Scalable and accurate deep learning with electronic health records. NPJ Digit. Med. 2018;1 doi: 10.1038/s41746-018-0029-1. 18–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib3] 3.Gao J., Xiao C., Wang Y., Tang W., Glass L.M., Sun J. Proceedings of The Web Conference 2020. 2020. Stagenet: Stage-aware neural networks for health risk prediction; pp. 530–540. [Google Scholar]

[bib4] 4.Ma T., Xiao C., Wang F. Proceedings of the 2018 SIAM International Conference on Data Mining. SIAM; 2018. Health-atm: A deep architecture for multifaceted patient health record representation and risk prediction; pp. 261–269. [Google Scholar]

[bib5] 5.Zhang X.S., Tang F., Dodge H.H., Zhou J., Wang F. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2019. Metapred: Meta-learning for clinical risk prediction with limited patient electronic health records; pp. 2487–2495. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib6] 6.Luo Y., Liu Z., Liu Q. Deep stable representation learning on electronic health records. arXiv. 2022 doi: 10.48550/arXiv.2209.01321. Preprint at. [DOI] [Google Scholar]

[bib7] 7.Yin C., Zhao R., Qian B., Lv X., Zhang P. 2019 IEEE International Conference on Data Mining (ICDM) IEEE; 2019. Domain knowledge guided deep learning with electronic health records; pp. 738–747. [Google Scholar]

[bib8] 8.Choi E., Bahadori M.T., Sun J., Kulas J., Schuetz A., Stewart W. Retain: An interpretable predictive model for healthcare using reverse time attention mechanism. Adv. Neural Inf. Process. Syst. 2016;29 [Google Scholar]

[bib9] 9.Ma F., Chitta R., Zhou J., You Q., Sun T., Gao J. Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017. Dipole: Diagnosis prediction in healthcare via attention-based bidirectional recurrent neural networks; pp. 1903–1911. [Google Scholar]

[bib10] 10.Ma L., Zhang C., Wang Y., Ruan W., Wang J., Tang W., Ma X., Gao X., Gao J. Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 34. 2020. Concare: Personalized clinical feature embedding via capturing the healthcare context; pp. 833–840. [Google Scholar]

[bib11] 11.Choi E., Bahadori M.T., Song L., Stewart W.F., Sun J. Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 2017. GRAM: graph-based attention model for healthcare representation learning; pp. 787–795. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib12] 12.Luo J., Ye M., Xiao C., Ma F. Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020. Hitanet: Hierarchical time-aware attention networks for risk prediction on electronic health records; pp. 647–656. [Google Scholar]

[bib13] 13.Duchi J., Namkoong H. Learning models with uniform performance via distributionally robust optimization. arXiv. 2018 doi: 10.48550/arXiv.1810.08750. Preprint at. [DOI] [Google Scholar]

[bib14] 14.Creager E., Jacobsen J.-H., Zemel R. International Conference on Machine Learning. PMLR; 2021. Environment inference for invariant learning; pp. 2189–2200. [Google Scholar]

[bib15] 15.Shen Z., Cui P., Zhang T., Kunag K. Stable learning via sample reweighting. Proc. AAAI Conf. Artif. Intell. 2020;34:5692–5699. [Google Scholar]

[bib16] 16.Shen Z., Liu J., He Y., Zhang X., Xu R., Yu H., Cui P. Towards out-of-distribution generalization: A survey. arXiv. 2021 doi: 10.48550/arXiv.2108.13624. Preprint at. [DOI] [Google Scholar]

[bib17] 17.Avati A., Seneviratne M., Xue E., Xu Z., Lakshminarayanan B., Dai A.M. BEDS-Bench: Behavior of EHR-models under Distributional Shift–A Benchmark. arXiv. 2021 doi: 10.48550/arXiv.2107.08189. Preprint at. [DOI] [Google Scholar]

[bib18] 18.Grief S.N., Patel J., Kochendorfer K.M., Green L.A., Lussier Y.A., Li J., Burton M., Boyd A.D. Simulation of ICD-9 to ICD-10-CM transition for family medicine: simple or convoluted? J. Am. Board Fam. Med. 2016;29:29–36. doi: 10.3122/jabfm.2016.01.150146. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib19] 19.Ulmer D., Meijerink L., Cinà G. Machine Learning for Health. PMLR; 2020. Trust issues: Uncertainty estimation does not enable reliable ood detection on medical tabular data; pp. 341–354. [Google Scholar]

[bib20] 20.Guo L.L., Steinberg E., Fleming S.L., Posada J., Lemmon J., Pfohl S.R., Shah N., Fries J., Sung L. EHR foundation models improve robustness in the presence of temporal distribution shift. Sci. Rep. 2023;13:3767. doi: 10.1038/s41598-023-30820-8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib21] 21.Zhou K., Liu Z., Qiao Y., Xiang T., Loy C.C. Domain Generalization: A Survey. IEEE Trans. Pattern Anal. Mach. Intell. 2023;45:4396–4415. doi: 10.1109/TPAMI.2022.3195549. [DOI] [PubMed] [Google Scholar]

[bib22] 22.Zhang T., Chen M., Bui A.A.T. AdaDiag: Adversarial Domain Adaptation of Diagnostic Prediction with Clinical Event Sequences. J. Biomed. Inform. 2022;134 doi: 10.1016/j.jbi.2022.104168. 134 104168. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bib23] 23.National Bureau of Economic Research . 2023. General Equivalence Mappings. [Google Scholar]

[bib24] 24.IBM . 2020. MarketScan Research Databases.https://www.ibm.com/products/marketscan-research-databases [Google Scholar]

[bib25] 25.Gheorghiade M., Bonow R.O. Chronic heart failure in the United States: a manifestation of coronary artery disease. Circulation. 1998;97:282–289. doi: 10.1161/01.cir.97.3.282. [DOI] [PubMed] [Google Scholar]

[bib26] 26.American Heart Association . 2017. Causes of Heart Failure.https://www.heart.org/en/health-topics/heart-failure/causes-and-risks-for-heart-failure/causes-of-heart-failure [Google Scholar]

[bib27] 27.Centers for Disease Control and Prevention . 2018. Conditions that increase risk for stroke.https://www.cdc.gov/stroke/conditions.htm [Google Scholar]

[bib28] 28.Heart and Stroke Foundation of Canada . 2019. Coronary Artery Disease.https://www.heartandstroke.ca/heart/conditions/coronary-artery-disease [Google Scholar]

[bib29] 29.Zhu Z., Yin C., Qian B., Cheng Y., Wei J., Wang F. In 2016 IEEE 16th International Conference on Data Mining (ICDM) IEEE; 2016. Measuring patient similarities via a deep architecture with medical concept embedding; pp. 749–758. [Google Scholar]

[bib30] 30.Rumelhart D.E., Hinton G.E., Williams R.J. Learning internal representations by error propagation. Technical report. California Univ San Diego La Jolla Inst for Cognitive Science. 1985 [Google Scholar]

[bib31] 31.Graves A. 2012. Long Short-Term Memory. Supervised Sequence Labelling with Recurrent Neural Networks; pp. 37–45. [Google Scholar]

[bib32] 32.Cho K., Van Merriënboer B., Gulcehre C., Bahdanau D., Bougares F., Schwenk H., Bengio Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv. 2014 doi: 10.48550/arXiv.1406.1078. Preprint at. In EMNLP’14. [DOI] [Google Scholar]

[bib33] 33.Paszke A., Gross S., Massa F., Lerer A., Bradbury J., Chanan G., Killeen T., Lin Z., Gimelshein N., Antiga L., et al. In: Advances in Neural Information Processing Systems 32. Wallach H., Larochelle H., Beygelzimer A., d'Alché-Buc F., Fox E., Garnett R., editors. Curran Associates, Inc; 2019. PyTorch: An Imperative Style, High-Performance Deep Learning Library; pp. 8024–8035. [Google Scholar]

PERMALINK

Stable clinical risk prediction against distribution shift in electronic health records

Seungyeon Lee

Changchang Yin

Ping Zhang

Summary

Graphical abstract

Highlights

The bigger picture

Introduction

Figure 1.

Figure 2.

Figure 3.

Results

Data

Study design

Figure 4.

Data pre-processing

Data shift

Figure 5.

Experimental setting

Figure 6.

Table 1.

Table 2.

Table 3.

Table 4.

Table 5.

Results for clinical risk prediction

The usefulness of the proposed method

Table 6.

Distribution shift

Figure 7.

Ablation study

Table 7.

Discussion

Principal results

Conclusion

Experimental procedures

Resource availability

Lead contact

Materials availability

Clinical risk prediction definitions and basic notations

Table 8.

EHR sequence

Clinical risk prediction

Architecture

Sample reweighting

Classification

Optimization

Baseline methods

Implementation and evaluation

Acknowledgments

Author contributions

Declaration of interests

Inclusion and diversity

Footnotes

Supplemental information

Data and code availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases