Skip to main content
AMIA Summits on Translational Science Proceedings logoLink to AMIA Summits on Translational Science Proceedings
. 2024 May 31;2024:95–104.

Pathophysiological Features in Electronic Medical Records Sustain Model Performance under Temporal Dataset Shift

Raphael Brosula 1,2, Conor K Corbin 3, Jonathan H Chen 4
PMCID: PMC11141811  PMID: 38827052

Abstract

Access to real-world data streams like electronic medical records (EMRs) has accelerated the development of supervised machine learning (ML) models for clinical applications. However, few studies investigate the differential impact of particular features in the EMR on model performance under temporal dataset shift. To explain how features in the EMR impact models over time, this study aggregates features into feature groups by their source (e.g. medication orders, diagnosis codes and lab results) and feature categories based on their reflection of patient pathophysiology or healthcare processes. We adapt Shapley values to explain feature groups’ and feature categories’ marginal contribution to initial and sustained model performance. We investigate three standard clinical prediction tasks and find that while feature contributions to initial performance differ across tasks, pathophysiological features help mitigate temporal discrimination deterioration. These results provide interpretable insights on how specific feature groups contribute to model performance and robustness to temporal dataset shift.

Introduction

Supervised machine learning applications in healthcare trained using real-world electronic medical record (EMR) data have the potential to greatly improve clinical processes as diverse as patient diagnosis, prognosis, and treatment selection.1,2 Despite their promise, clinical machine learning-based models often underperform in the deployment setting compared to the training setting.3 Addressing the causes of model underperformance and developing interpretable statistical methods to assess them are key technical challenges in the safe deployment, adoption, and evaluation of ML models in healthcare.

Most evaluation methods of ML models assume that the training and test datasets are independent and identically distributed; however, models deployed in clinical settings often experience dataset shift.4 Dataset shift occurs when the joint distribution P (X, Y ) over features X and labels Y changes throughout a model’s life cycle.5 Temporal dataset shift, also described in literature as nonstationarity, occurs when distributions evolve over time.6 Temporal dataset shift has been extensively documented as a major contributor to the degradation of model performance in real-world clinical settings, with heterogeneous impacts depending on the task, model, and metrics utilized to measure model performance .3,69 For example, when the COVID-19 pandemic led to changes in patient case load and acuity, the Epic Sepsis Model (ESM) began generating spurious alerts, causing increased alert fatigue and clinician burden, substantially deteriorating the model’s clinical utility.10,11

Approaches to improve model robustness to temporal dataset shift have typically focused on post-deployment model-level methods, such as model recalibration and retraining. However, empiric evaluations of model-level methods demonstrate failure to mitigate deteriorating discrimination ability across several common clinical ML tasks.7 Additionally, model-level methods are computationally expensive, and can cause overfitting to more recent training data that only worsens generalization ability to future observations.12,13 Further, model-level methods that fail to consider feedback effects (changes to the distribution of future data caused by use of the model in practice) can yield dangerous spurious correlations. Naive retraining of a prognostic model that has become a ”victim of it’s own success” due to successful interventions will yield a model that classifies high risk patients as low risk because it assumes the intervention will occur. These have been empirically shown to cause worse performance and amplify errors over time.1416

In contrast, feature-level temporal robustness methods attempt to bypass the need for model retraining by including and representing clinical features in a way that leads to a model less prone to performance decay.9 Developing feature-level methods poses challenges due to the heterogeneity of clinical data stored in the EMR. Common structured features include patient demographics, diagnosis codes, procedure orders, vital signs, and lab results. These features are often sparse, multimodal, and asynchronously sampled within a patient’s medical timeline.17 Few studies have analyzed the use of specific feature types in readily available EMR data and its impacts on model robustness and performance to temporal dataset shift.7 Recent work on feature-level methods have focused on two approaches: 1) feature selection and 2) feature representation. Feature selection approaches that focus on removing individual features purely off the basis of collinearity fail to leverage the natural structure of EMR feature types and the causal implications linked to their underlying generating process.1820 Methods focused on feature representation have found more success in creating models robust to performance shift. For example, studies have shown that engineered features based on clinical domain knowledge produce models more robust to temporal dataset shift.6,2123 More recent work has found that foundation models, trained on representations leveraging temporal information, are robust to temporal dataset shift.9,24

Feature-level approaches do not explicitly consider the joint distribution between features X and the target variables Y . This makes it difficult to assess how changes in feature types and representations impact model performance for clinically relevant tasks. Such considerations are broadly applicable for clinicians, data scientists, and engineers interested in deploying and monitoring and ML applications in healthcare.12,13,25 The development of statistical methods that explicitly consider the relationship between included feature types in the EMR and model performance over time is needed to guide the successful translation and maintenance of clinical ML models to the bedside.

In this study, we propose an interpretable, model-agnostic, and task-independent approach that leverages the natural structure of the EMR to determine the role of feature types in model performance and robustness to temporal dataset shift. Our approach aggregates individual features into feature groups organized by the source of data, such as diagnosis codes and medications. Our approach further organizes these feature groups into pathophysiological and process-oriented feature categories, recognizing that features in the electronic medical record are a reflection of both patient pathophysiology and the complex processes that underlie healthcare delivery and provision.19,20 Our objective was to determine whether the use of pathophysiological features yield greater performance and temporal robustness properties compared to process-oriented features. We find that while contribution to initial performance is heterogeneous and task-dependent, use of pathophysiological features mitigate temporal discrimination deterioration across three common clinical ML tasks. The method will be of interest to clinicians and ML practitioners who are interested in the eventual deployment of clinical models, and can be linked to other paradigms for retrospective assessments of models pre-deployment as well as prospective assessments for the continuous monitoring of deployed models.

Methods

Data Sources: We used the STAnford Research Repository (STARR) which contains de-identified electronic medical records for over 2.4 million unique patients from 2009 to 2021 across Stanford Hospital, Valleycare Hospital, and Stanford University Healthcare Alliance affiliated ambulatory clinics.26 Use of STARR data was approved by the institutional review board of the Stanford University School of Medicine.

Cohort Definitions: We constructed three different cohorts for three different binary clinical prediction tasks: inpatient mortality, long length of stay, and thirty-day readmission. These tasks were selected for their frequency in literature and to demonstrate the generalizability of our approach across different clinical prediction tasks. The unit of analysis is the patient encounter. Descriptive statistics for each cohort are described in Table 1.

Table 1:

Demographic characteristics across each cohort. Each cohort is constructed by sampling 2,000 observations across each year from 2009 to 2021.

Inpatient Mortality Long Length of Stay Thirty Day Readmission
Number of unique patients 23,662 23,738 23,712
Mean age (SD) 59.3 (18.7) 59.1 (18.7) 58.9 (18.7)
Sex, n (%) Female 13,194 (50.7) 13,199 (50.7) 13,371 (51.4)
Male 12,803 (49.2) 12,799 (49.2) 12,637 (48.6)
Unknown 3 (0.0) 2 (0.0) 2 (0.0)
Number of positive labels (%) 629 (2.4) 5,633 (21.7) 3,728 (14.3)

Inpatient Mortality We defined a positive label for this task as any patient whose recorded date of death occurs during a given admission event.

Long Length of Stay We defined a positive label for this task as any patient admitted to the hospital who stayed in the inpatient setting for at least seven days.

Thirty Day Readmission We defined a positive label for this task as any patient whose next admission in the clinical record is within the next 30 days of any given admission event.

Feature Groups and Feature Categories: Features were extracted from the EMR using the DEPLOYR framework.25 All of the features derived from the structured data are grouped into seven feature groups based on the data source and location in the electronic medical record: demographics, lab results, vital signs, diagnosis codes, procedures, medications, and lab orders. These feature groups are further aggregated into two feature categories: pathophysiological and process-oriented. The pathophysiological feature category encompasses features which directly measure a patient’s current state at the time of recording. This includes demographics, lab results, and vital signs. For this study, demographics encompassed age and sex.

In contrast, the process-oriented feature category encompasses features whose raw representation in the electronic medical record pertain to dynamic clinical and recording practices. Process-oriented features reflect the model of healthcare as a process influenced by the practices and policies of stakeholders and rely on varied ontologies for their recording.19 These include diagnosis codes, procedures, medications, and lab orders. The classification of feature groups into feature categories and the number of features per feature group are shown in Table 2.

Table 2:

Classification of features and number of features across all clinical cohorts

Feature Group Feature Category Number of Features in Cohort
Inpatient Mortality Long Length of Stay Thirty-Day Readmission
Demographics Pathophysiological 4 4 4
Lab Results Pathophysiological 29 29 29
Vital Signs Pathophysiological 11 11 11
Diagnosis Codes Process-Oriented 8,726 8,701 8,933
Procedures Process-Oriented 539 543 549
Lab Orders Process-Oriented 1,961 1,967 2,250
Medications Process-Oriented 11,487 11,219 13,921
Total 22,757 22,474 25,697

Feature representation: For each task, we constructed a timeline of events which occur before the index time using the structured data available within the electronic health record. This data contains both categorical and numerical data elements. Categorical elements included diagnosis codes, procedure orders, lab orders, medication orders, and sex. Numerical elements included vital signs, lab results, and patient age. Age and sex were combined into the “demographics” feature group. Besides sex, all numerical features are in the pathophysiological feature group. All categorical features are in the process-oriented feature group, owing to their representation into the electronic medical record by specific and heterogeneous ontologies. All diagnosis codes within a patient record were considered. All medication, procedures, and lab orders within 28 days before the index time were considered. All vital signs within 3 days before the index time and lab results within 14 days before the index time were considered.

To construct the feature representation, numerical features were represented by a set of summary statistics pertaining to each patient observation and feature: minimum, maximum, first, last, mean, standard deviation, and slope. Each observation-feature group’s values were then scaled to have a mean of 0 using scikit-learn’s StandardScaler. Values were imputed using the scikit-learn SimpleImputer. Categorical variables were represented using a counts-based (”bag-of-words”) representation normalized by term-frequency inverse document-frequency, or TF-IDF.27 The summary statistic representation of numerical features and the counts-based representation of categorical features were concatenated together to form the final feature representation used as input to the model.

The separate representations of numerical and counts-based features roughly correspond to the division between pathophysiological and process-oriented feature categories. Besides categorical demographic data, the other pathophysiological features are numerical. This separation arises because pathophysiological feature groups record clinically-informative numerical values, while process-oriented feature groups represent clinical variables through differing ontologies. By using counts-based representations for process-oriented features and summary statistics-based representations for pathophysiological features, we aimed to disentangle between the feature categories as much as possible.

Training and evaluation: All cohorts were constructed by sampling 2000 random observations between the years 2009 and 2021. For all tasks, the training set was set to 2009, and the test set ranged from 2010 to 2021. The validation set is set to 15% of the training set. This one-year training regime was chosen to demonstrate a scenario where models are trained on historical data and deployed without model revision and recalibration.22 The prediction time (index time) for each task is set to the start of an admission.

For each task, we train 2N – 1 LightGBM gradient-boosted decision tree models28 that uses a unique subset of the ex-tracted feature groups, where N is the number of extracted feature groups. Then, we evaluate the model by calculating Shapley value contributions of each feature group for the following metrics: initial area under the receiver operating characteristic curve (AUROC), and the full difference in AUROC between initial and final test set years. We normalize the measures against the sum of the absolute Shapley values, calculating a feature group’s scaled contribution to the metric. We bootstrap the test set 1000 times to estimate the 95% confidence interval for each Shapley value. A graphical flowchart of the training and evaluation process is shown in Figure 1.

Figure 1:

Figure 1:

Flowchart of method for calculating Shapley values. After extracting N feature groups, 2N – 1 models are trained, accounting for all non-zero combinations of different feature groups. The results of each model are calculated for metrics such as initial area under the receiver-operating characteristic curve (AUROC) and full AUROC difference. Shapley values are calculated for each feature group.

Shapley Values: Shapley values are derived from cooperative game theory to assign value based on a “player’s” effective marginal contribution to all possible coalitions of players.29 Shapley values for player i is calculated as:

ϕi=SF\{i}|S|!(N|S|1)!N!(v(Si)v(S))

where S is the current subset, F is the set of all players, N is the number of total players, and v(S) is the value of the current subset. In this setting, F is the set of feature groups, and v(S) is the relevant performance metric. While approximations of Shapley values have been used in machine learning literature to enhance model interpretability (e.g. contribution of features towards a prediction)30, we use exact Shapley values to attribute the marginal contribution of feature groups and feature categories to initial performance and robustness over time.

Shapley values were estimated using 1000 bootstrap iterations. The 95% confidence interval from the generated distribution was then used to estimate the uncertainty of each Shapley value. To estimate the Shapley values of feature categories, the bootstrapped distributions of feature groups within the specific feature category were summed, and their Shapley values were averaged based on the number of features that were in that category. The mean and the 95% confidence interval of this averaged Shapley value distribution was then calculated, and a two-tailed p-value was used to test for significance at p < 0.05.

Results

Temporal Dataset Shift and Performance Over Time: First, we measured the AUROC over time on the test on the models with the full feature set for each task. Our results showed that the models illustrate performance decay over time, exhibiting drops in performance between 0.03 to 0.13 AUROC over the entire test set (Fig. 2). AUROC decreased from 0.76 to 0.63 for inpatient mortality, 0.65 to 0.62 for long length of stay, and 0.70 to 0.61 for thirty-day readmission. Understandably, the initial AUROC is lower compared to known baselines of performance31 for these tasks due to the small training set and relatively large test set that spanned over a decade.

Figure 2:

Figure 2:

AUROC over time on the test for each clinical prediction task. The models evaluated were trained with the full feature set for each task using the one-year training regime.

Contributions by Feature Group: To estimate the contribution of different feature groups to initial model performance, we calculated the Shapley value of a feature group for each clinical prediction task (Fig. 3). The scaled Shapley values for initial AUROC reveal that the feature which contributes the most to initial performance differs based on the clinical prediction task. For inpatient mortality, lab results have the highest Shapley value contribution to the initial AUROC (mean: 0.226, 95% CI: (0.195, 0.258)). In contrast, the long length of stay and thirty-day readmission tasks respectively have lab orders (mean: 0.212, 95% CI: (0.200, 0.225)) and medications (mean: 0.173, 95% CI: (0.160, 0.185)) as the feature group with the highest contribution.

Figure 3:

Figure 3:

Shapley values for initial AUROC and full AUROC difference on the entire test set for each feature group and each clinical prediction task.

Additionally, we find that some pathophysiological feature groups provide robustness to performance decay over time. This is represented by a negative Shapley value for the full AUROC difference. Since full AUROC difference is a metric which captures deterioration of model performance over time, a negative Shapley value indicates that the feature group resists this deterioration, and therefore aids in its mitigation. Vital signs have negative Shapley values for both the inpatient mortality (mean: -0.268, 95% CI: (-0.386, -0.155)) and long length of stay tasks (mean: -0.294, 95% CI: (-0.385, -0.202)). For thirty-day readmission, the feature group with the most negative Shapley value was lab results (mean: -0.141, 95% CI: (-0.287, 0.008)). In general, feature groups which contributed more to the full AUROC difference tended to be process-oriented. For inpatient mortality, diagnosis codes had the highest Shapley value contribution full AUROC difference (mean: 0.185 , 95% CI: (0.068, 0.296)). For long length of stay, lab orders contributed the most to the full AUROC difference (mean: 0.249, 95% CI: (0.158, 0.336)). For thirty-day readmission, medications contributed the most to the full AUROC difference (mean: 0.276, 95% CI: (0.133, 0.419)). We take a further exploration of these trends by aggregating Shapley values for feature groups into their feature categories.

Contributions by Feature Category: After calculating Shapley values across all feature groups, we averaged the Shapley values across feature groups according to their respective feature categories (Fig. 4). For initial performance, we find that the feature category which contributes the most is dependent on the task. For example, in inpatient mortality, pathophysiological features contributed significantly more to initial AUROC compared to process-oriented features (mean: 0.116, 95% CI: (0.103, 0.131)). Process-oriented features significantly contributed more to initial AUROC performance than pathophysiological features in the long length of stay (mean: 0.099, 95% CI: (0.093, 0.103)) and thirty day readmission tasks (mean: 0.108, 95% CI: (0.103, 0.114)).

Figure 4:

Figure 4:

Averaged Shapley values for initial AUROC and full AUROC difference across the entire test for each feature category and each clinical prediction task.

For the Shapley value contributions to the full AUROC difference, pathophysiological features had a significantly lower Shapley value compared to process-oriented features. For long length of stay, the pathophysiological feature category’s average Shapley value and the corresponding 95% confidence interval were below 0 (mean: -0.017, 95% CI: (-0.025, -0.009)), implying that for this task, pathophysiological features mitigated performance deterioration.

Discussion

In the study we proposed and evaluated an interpretable, model-agnostic, and task-independent application and extension of Shapley values to assess how particular feature groups and categories found in the EMR differentially contribute to model performance and its preservation over time across three common clinical ML tasks. Our results indicate that while feature contributions to initial performance varies, pathophysiological features mitigate model deterioration due to temporal dataset shift. This study provides initial evidence that features which are more reflective of a patient’s state provide robustness to models compared to variables more subject to complex processes in medical practice.

Across our three tasks, lab results, lab orders, and medications contributed the most to the initial performance. The inclusion of vital signs as a feature group sustains model performance the most over time in the inpatient mortality and long length of stay tasks, while lab results sustain model performance the most in the task of predicting thirty day readmission. Previous studies have shown that lab results are informative and predictive of inpatient mortality and thirty-day readmission across specific patient cohorts.20,3234 Models incorporating vital signs have shown that they predict inpatient mortality and long length of stay.3537 Several studies have also linked polypharmacy and complex drug regimens as a good predictor for unplanned readmissions.38,39 However, it is difficult to compare different feature groups from previous studies since they limit the feature groups they use for their models a priori, which skews the relative importance of each feature group for model performance. Our results indicate that while there is a heterogeneous and task-dependent context in feature contributions toward initial performance, pathophysiological features such as such as vital signs and lab results mitigate temporal discrimination deterioration.

The method presented in this study provides insights into the dynamics of features and their influence on initial and sustained performance amid temporal dataset shift. By leveraging the structure in how electronic medical records are organized, the method provides an explainable method that links feature groups and categories to model performance. This has applications for model development and continuous monitoring in clinical practice because knowing which feature groups contribute to performance or mitigate performance deterioration can inform which features are selected to best mitigate the need for potentially costly and dangerous retraining events.13,25,40 This provides the possibility of guiding future model development under conditions of dataset shift, as it provides a retrospective assessment of which features are important without prior removal or automated selection of features. Additionally, by building off of existing technical frameworks, this method can be combined with efforts to monitor clinical models in deployment by jointly accounting for features and their impact on model performance rather than performing feature selection purely off the basis of existing collinearity.

Additionally, this method provides a way to measure the true marginal contribution of each feature group as the Shapley value method considers all subsets of features, and utilizes information on the performance of all 2N – 1 models trained on a specific subset of feature groups. This contrasts with other methods like ablation studies, in which individual features or subsets are left out. Due to the substantial collinearity between feature types in the EMR, ablation studies that only measure the impact of a feature group conditioned on the presence of others fail to reflect their true marginal contribution. Furthermore, many clinical ML workflows limit the features that a model is trained on via expert knowledge or automated feature selection algorithms. However, knowing a priori which features contribute most to model performance and make models robust to temporal dataset shift are difficult challenges that cannot be addressed without specifically accounting for changes across input features, clinical outcomes, and desired performance metrics.

Limitations and Future Work: There are several limitations to the study. First, we evaluated our method on three commonly researched clinical prediction tasks. Though this limits the generalizability of our findings, our method is task-agnostic and can be used on a case-by-case basis by ML practitioners evaluating models for deployment.

Second, we limit our analysis to structured EMR data, excluding clinical notes and the incorporation of temporally-derived features (i.e. the recording of specific intervals). While this similarly limits the scope of our analysis, the use of structured data enables the immediate assignment of features into distinct feature groups, as opposed to representations with combine the data into source-agnostic representations. Additionally, the development of explainable empirical methods for assessing the contributions of temporally-derived features will be necessary to maximize the utility of EMR-based methods and to help with model interpretability.20

Third, we limit our analysis to a single model performance measure, AUROC, across two contexts: initial model performance, and performance over time. While not explicitly done in this study, our approach is generalizable to any arbitrary performance measure to estimate Shapley values attributing predictive ability and temporal robustness to particular feature groups and categories, such as calibration.7,8,41 This even includes the estimation of Shapley values attributing performance gains and temporal robustness over subpopulations for fairness evaluations, which is important as there is increasing recognition that in-depth evaluations are necessary to equitably develop and monitor clinical decision support algorithms while preventing the amplification of bias.12,40,4246 Additionally, future work may take a more integrated approach by applying this method across a variety of model architectures and clinically-relevant predictions, allowing for a more robust evaluation of feature contributions to initial and sustained model performance under conditions of temporal dataset shift.

Lastly, due to the need to train 2N – 1 models on each subset of feature groups, the method can be computationally expensive especially with a large feature set and a large number of patients across time. Nevertheless, model training is massively parallelizable, and computational resources can be saved by calculating Shapley values on a small sample of patients over a smaller, more recent time window (i.e. the past month or the past year). This design would be more appropriate for prospective validations of this method as opposed to the retrospective design conducted in this study, or for more focused assessments across specific patient cohorts of interest or specific subpopulations.

Conclusions

In this study, we develop a method that extends Shapley values to determine the contribution of feature groups and feature categories to model performance and robustness to temporal dataset shift. Findings from the retrospective study reveal that inclusion of pathophysiological features mitigate temporal discrimination deterioration, making models more robust to temporal dataset shift. This method is broadly applicable to ML practitioners and clinicians interested in developing and deploying ML solutions in healthcare that exhibit greater robustness to performance decay over time, and can be applied over several use cases and modes of evaluation.

Figures & Table

References

  • [1].Jiang F, Jiang Y, Zhi H, Dong Y, Li H, Ma S, et al. Artificial intelligence in healthcare: past, present and future. Stroke and vascular neurology. 2017;2(4) doi: 10.1136/svn-2017-000101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Rajkomar A, Oren E, Chen K, Dai AM, Hajaj N, Hardt M, et al. Scalable and accurate deep learning with electronic health records. NPJ digital medicine. 2018;1(1):18. doi: 10.1038/s41746-018-0029-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Subbaswamy A, Saria S. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics. 2019 Nov. p. kxz041. [DOI] [PubMed]
  • [4].Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, Zittrain J, et al. The Clinician and Dataset Shift in Artificial Intelligence. New England Journal of Medicine. 2021 Jul;385(3):283–286. doi: 10.1056/NEJMc2104626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [5].Quinonero-Candela J, Sugiyama M, Schwaighofer A, Lawrence ND. Dataset shift in machine learning. Mit Press; 2008. [Google Scholar]
  • [6].Jung K, Shah NH. Implications of non-stationarity on predictive modeling using EHRs. Journal of biomedical informatics. 2015;58:168–74. doi: 10.1016/j.jbi.2015.10.006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [7].Guo LL, Pfohl SR, Fries J, Posada J, Fleming SL, Aftandilian C, et al. Systematic review of approaches to preserve machine learning performance in the presence of temporal dataset shift in clinical medicine. Applied clinical informatics. 2021;12(04):808–15. doi: 10.1055/s-0041-1735184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Davis SE, Lasko TA, Chen G, Siew ED, Matheny ME. Calibration drift in regression and machine learning models for acute kidney injury. Journal of the American Medical Informatics Association. 2017 Nov;24(6):1052–1061. doi: 10.1093/jamia/ocx030. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [9].Guo LL, Steinberg E, Fleming SL, Posada J, Lemmon J, Pfohl SR, et al. EHR foundation models improve robustness in the presence of temporal distribution shift. Scientific Reports. 2023 Mar;13(11):3767. doi: 10.1038/s41598-023-30820-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [10].Wong A, Cao J, Lyons PG, Dutta S, Major VJ, Ötleş E, et al. Quantification of Sepsis Model Alerts in 24 US Hospitals Before and During the COVID-19 Pandemic. JAMA Network Open. 2021 Nov;4(11):e2135286. doi: 10.1001/jamanetworkopen.2021.35286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, DeTroyer-Cooley O, et al. External Validation of a Widely Implemented Proprietary Sepsis Prediction Model in Hospitalized Patients. JAMA Internal Medicine. 2021 Aug;181(8):1065–1070. doi: 10.1001/jamainternmed.2021.2626. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [12].Davis SE, Walsh CG, Matheny ME. Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings. Frontiers in Digital Health. 2022;4 doi: 10.3389/fdgth.2022.958284. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Feng J, Phillips RV, Malenica I, Bishara A, Hubbard AE, Celi LA, et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. npj Digital Medicine. 2022 May;5(11):1–9. doi: 10.1038/s41746-022-00611-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [14].Lenert MC, Matheny ME, Walsh CG. Prognostic models will be victims of their own success, unless. Journal of the American Medical Informatics Association. 2019 Dec;26(12):1645–1650. doi: 10.1093/jamia/ocz145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Adam GA, Chang CHK, Haibe-Kains B, Goldenberg A. Hidden Risks of Machine Learning Applied to Healthcare: Unintended Feedback Loops Between Models and Future Data Causing Model Degradation. Proceedings of the 5th Machine Learning for Healthcare Conference. PMLR. 2020. pp. 710–731.
  • [16].Adam GA, Chang CHK, Haibe-Kains B, Goldenberg A. Error Amplification When Updating Deployed Machine Learning Models. Proceedings of the 7th Machine Learning for Healthcare Conference. PMLR. 2022. pp. 715–740.
  • [17].Xiao C, Choi E, Sun J. Opportunities and challenges in developing deep learning models using electronic health records data: a systematic review. Journal of the American Medical Informatics Association. 2018 Oct;25(10):1419–1428. doi: 10.1093/jamia/ocy068. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [18].Lemmon J, Guo LL, Posada J, Pfohl SR, Fries J, Fleming SL, et al. Evaluation of Feature Selection Methods for Preserving Machine Learning Performance in the Presence of Temporal Dataset Shift in Clinical Medicine. Methods of Information in Medicine. 2023 May;62(1/2):60–70. doi: 10.1055/s-0043-1762904. [DOI] [PubMed] [Google Scholar]
  • [19].Hripcsak G, Albers DJ. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association: JAMIA. 2013 Jan;20(1):117–121. doi: 10.1136/amiajnl-2012-001145. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [20].Agniel D, Kohane IS, Weber GM. Biases in electronic health record data due to processes within the healthcare system: retrospective observational study. Bmj. 2018;361 doi: 10.1136/bmj.k1479. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [21].Nestor B, McDermott M, Chauhan G, Naumann T, Hughes MC, Goldenberg A, et al. Rethinking clinical prediction: why machine learning must consider year of care and feature aggregation. arXiv preprint arXiv:181112583. 2018.
  • [22].Nestor B, McDermott MB, Boag W, Berner G, Naumann T, Hughes MC, et al. Feature robustness in non-stationary health records: caveats to deployable model performance in common clinical machine learning tasks. Machine Learning for Healthcare Conference. PMLR. 2019. pp. 381–405.
  • [23].Gong JJ, Naumann T, Szolovits P, Guttag JV. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Halifax NS Canada: ACM; 2017. Predicting Clinical Outcomes Across Changing Electronic Health Record Systems; pp. 1497–1505. [Google Scholar]
  • [24].Steinberg E, Jung K, Fries JA, Corbin CK, Pfohl SR, Shah NH. Language models are an effective representation learning technique for electronic health record data. Journal of Biomedical Informatics. 2021 Jan;113:103637. doi: 10.1016/j.jbi.2020.103637. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [25].Corbin CK, Maclay R, Acharya A, Mony S, Punnathanam S, Thapa R, et al. DEPLOYR: a technical framework for deploying custom real-time machine learning models into the electronic medical record. Journal of the American Medical Informatics Association. 2023 Aug;30(9):1532–1542. doi: 10.1093/jamia/ocad114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [26].Datta S, Posada J, Olson G, Li W, O’Reilly C, Balraj D, et al. A new paradigm for accelerating clinical data science at Stanford Medicine. arXiv preprint arXiv:200310534. 2020.
  • [27].Ramos JE. Using TF-IDF to Determine Word Relevance in Document Queries. 2003. Available from: https://www.semanticscholar.org/paper/Using-TF-IDF-to-Determine-Word-Relevance-in-Queries-Ramos/b3bf6373ff41a115197cb5b30e57830c16130c2c.
  • [28].Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems. 2017;30 [Google Scholar]
  • [29].Shapley LS, et al. A value for n-person games. 1953.
  • [30].Lundberg SM, Lee SI. A unified approach to interpreting model predictions. Advances in neural information processing systems. 2017;30 [Google Scholar]
  • [31].Pfohl SR, Zhang H, Xu Y, Foryciarz A, Ghassemi M, Shah NH. A comparison of approaches to improve worst-case predictive model performance over patient subpopulations. Scientific Reports. 2022 Feb;12(11):3254. doi: 10.1038/s41598-022-07167-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [32].Blanco N, Leekha S, Magder L, Jackson SS, Tamma PD, Lemkin D, et al. Admission Laboratory Values Accurately Predict In-hospital Mortality: a Retrospective Cohort Study. Journal of General Internal Medicine. 2020 Mar;35(3):719–723. doi: 10.1007/s11606-019-05282-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [33].Zhang Z, Goyal H, Lange T, Hong Y. Healthcare processes of laboratory tests for the prediction of mortality in the intensive care unit: a retrospective study based on electronic healthcare records in the USA. BMJ Open. 2019 Jun;9(6):e028101. doi: 10.1136/bmjopen-2018-028101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [34].Golas SB, Shibahara T, Agboola S, Otaki H, Sato J, Nakae T, et al. A machine learning model to predict the risk of 30-day readmissions in patients with heart failure: a retrospective analysis of electronic medical records data. BMC Medical Informatics and Decision Making. 2018 Dec;18(11):1–17. doi: 10.1186/s12911-018-0620-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [35].Alghatani K, Ammar N, Rezgui A, Shaban-Nejad A. Predicting Intensive Care Unit Length of Stay and Mortality Using Patient Vital Signs: Machine Learning Model Development and Validation. Journal of Medical Internet Research. 2021 May;9:e21347. doi: 10.2196/21347. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [36].Ljunggren M, Castrén M, Nordberg M, Kurland L. The association between vital signs and mortality in a retrospective cohort study of an unselected emergency department population. Scandinavian Journal of Trauma, Resuscitation and Emergency Medicine. 2016 Mar;24(1):21. doi: 10.1186/s13049-016-0213-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [37].Candel BG, Duijzer R, Gaakeer MI, Et Avest, Sir Lameijer H, et al. The association between vital signs and clinical outcomes in emergency department patients of different age categories. Emergency Medicine Journal. 2022 Dec;39(12):903–911. doi: 10.1136/emermed-2020-210628. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [38].Picker D, Heard K, Bailey TC, Martin NR, LaRossa GN, Kollef MH. The number of discharge medications predicts thirty-day hospital readmission: a cohort study. BMC Health Services Research. 2015 Jul;15(1):282. doi: 10.1186/s12913-015-0950-9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [39].Pereira F, Verloo H, Zhivko T, Giovanni SD, Meyer-Massetti C, Gunten Av, et al. Risk of 30-day hospital readmission associated with medical conditions and drug regimens of polymedicated, older inpatients discharged home: a registry-based cohort study. BMJ Open. 2021 Jul;11(7):e052755. doi: 10.1136/bmjopen-2021-052755. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Docke’s J, Varoquaux G, Poline JB. Preventing dataset shift from breaking machine-learning biomarkers. GigaScience. 2021 Sep;10(9):giab055. doi: 10.1093/gigascience/giab055. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [41].Davis SE, Greevy RA, Lasko TA, Walsh CG, Matheny ME. Detection of calibration drift in clinical prediction models to inform model updating. Journal of Biomedical Informatics. 2020 Dec;112:103611. doi: 10.1016/j.jbi.2020.103611. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [42].Ji CX, Alaa AM, Sontag D. Large-Scale Study of Temporal Shift in Health Insurance Claims. Conference on Health, Inference, and Learning. PMLR. 2023. pp. 243–78.
  • [43].Ghassemi M, Nsoesie EO. medicine, how do we machine learn anything real? Patterns. 2022 Jan;3(1):100392. doi: 10.1016/j.patter.2021.100392. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [44].Dankwa-Mullan I, Scheufele EL, Matheny ME, Quintana Y, Chapman WW, Jackson G, et al. A Proposed Framework on Integrating Health Equity and Racial Justice into the Artificial Intelligence Development Lifecycle. Journal of Health Care for the Poor and Underserved. 2021;32(2):300–317. [Google Scholar]
  • [45].Rajkomar A, Hardt M, Howell MD, Corrado G, Chin MH. Ensuring Fairness in Machine Learning to Advance Health Equity. Annals of internal medicine. 2018 Dec;169(12):866–872. doi: 10.7326/M18-1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [46].Robinson WR, Renson A, Naimi AI. Biostatistics. 2. Vol. 21. England: Oxford; 2019 Nov. Teaching yourself about structural racism will improve your machine learning; pp. 339–344. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from AMIA Summits on Translational Science Proceedings are provided here courtesy of American Medical Informatics Association

RESOURCES