Skip to main content
The Journal of Infectious Diseases logoLink to The Journal of Infectious Diseases
. 2024 Jul 12;230(5):1073–1082. doi: 10.1093/infdis/jiae348

Applications of Machine Learning on Electronic Health Record Data to Combat Antibiotic Resistance

Samuel E Blechman 1, Erik S Wright 2,3,✉,2
PMCID: PMC11565868  PMID: 38995050

Abstract

There is growing excitement about the clinical use of artificial intelligence and machine learning (ML) technologies. Advancements in computing and the accessibility of ML frameworks enable researchers to easily train predictive models using electronic health record data. However, several practical factors must be considered when employing ML on electronic health record data. We provide a primer on ML and approaches commonly taken to address these challenges. To illustrate how these approaches have been applied to address antimicrobial resistance, we review the use of electronic health record data to construct ML models for predicting pathogen carriage or infection, optimizing empiric therapy, and aiding antimicrobial stewardship tasks. ML shows promise in promoting the appropriate use of antimicrobials, although clinical deployment is limited. We conclude by describing the potential dangers of, and barriers to, implementation of ML models in the clinic.

Keywords: antibiotic resistance, antimicrobial stewardship, artificial intelligence, electronic health record, machine learning


The abundance of clinical publications using machine learning models places demands on clinicians seeking to properly critique and interpret these results. This review illustrates best practices and pitfalls inherent to handling electronic health record data and employing machine learning.


The rise in popular media coverage of artificial intelligence (AI) in recent years has led many to believe that AI will revolutionize every aspect of our society. In particular, the use of AI in health care has garnered attention for its potential to enhance clinical decision making [1], automatically generate clinical text (eg, radiology reports [1]), and assist biomedical research (eg, drug discovery [2]). While AI might be able to fulfill such hopes in the future, its current implementations are constrained by several practical factors impeding translatability, as well as the potential legal and ethical complications of its application to clinical data [3]. Further progress in applying AI in the clinic would be facilitated by a shared understanding of the merits and challenges inherent to the health care setting.

AI is an imprecise term that encompasses any use of algorithms for performing tasks that approach human-level intelligence, including planning, reasoning, language processing, and perception [4]. Machine learning (ML) is a subset of AI that uses data and algorithms to build models that “learn” to complete a task (eg, classification) through mathematical optimization. There is confusion in the medical community about AI and ML [5], as these terms are often used interchangeably. Despite this confusion, some simple ML algorithms have been used by clinical researchers for decades (eg, logistic regression). The use of more complicated ML models has gained traction in data-intensive fields, such as pathology and radiology, with some models matching expert-level performance in diagnostic tasks that utilize whole-slide images [6]. The number of studies applying ML to clinical data is growing [7], in part due to the increasing volume and standardization of medical data, breakthroughs in ML research, and the accessibility of ML frameworks (eg, user-friendly ML libraries [8]).

Electronic health record (EHR) data are a particularly useful form of clinical data, as it captures health care interactions at the patient and visit level. ML offers the hope of automatically detecting and harnessing novel insights from EHR data that can guide improvements to patient care. Antimicrobial stewardship teams, which seek to lessen the burden of antimicrobial resistance (AMR) in the clinic, may benefit from the application of ML on EHR data. In this article, we seek to review literature that describes ML models trained on EHR data for combating AMR. In the next section, Practical Crash Course in ML, we introduce ML and review practical considerations for leveraging EHR data for ML. In the section thereafter, Case Studies in the Application of EHR Data and ML to Combat AMR, we describe studies that used ML trained on antibiotic resistance variables present in the EHR. Finally, in Conclusions and Future Directions, we describe barriers to clinical implementation, summarize the current state of clinical ML models, and discuss future directions that the field may take toward causal analysis.

PRACTICAL CRASH COURSE IN ML

The strengths and limitations of ML can best be understood in contrast to mechanistic models. Mechanistic models are useful when the causal structure of a system is known. They require very few data to calibrate model parameters and can make predictions far outside the range in which data were collected. In contrast, ML models require a large amount of “training” data to “learn” the statistical relationship between input variables (ie, features, covariates) and output values (ie, outcomes, labels). Importantly, ML model performance may suffer outside the scope of training data. ML models also differ from expert systems that use manually curated rules to assign labels to samples, as ML algorithms learn rules directly from data. The goal of ML is to use a learned model to predict outcomes for unseen samples (eg, patient whose outcome is unknown) rather than explain the relationship between features and outcomes as done for purely statistical models.

Table 1 offers brief descriptions of 4 commonly used ML algorithms as well as their advantages and disadvantages. Many others exist and are covered elsewhere [4]. ML algorithms differ in their capacity to learn relationships between features and outcomes, interpretability (ie, ability of a human to understand why a model made a given prediction), as well as their advantages and disadvantages (Figure 1A). No matter the algorithm, the quality and quantity of training data are paramount for producing useful and reliable ML models. Curating training data is one of the key difficulties of ML.

Table 1.

A Nonexhaustive List of Machine Learning Algorithms/Models

Algorithm/Model Brief Description Advantages and Disadvantages
Logistic regression A statistical method that models the log odds of an outcome as a function of a linear combination of input variables. The simplest form has one parameter (coefficient) per input feature. Advantages
  • Highly interpretable

  • Easily trained

  • Can yield novel relationships (“risk factors”) between features and outcomes

  • Methods exist (eg, Lasso regularization) to limit model overfitting

Disadvantage
  • Does not model interactions among input variables unless explicitly included

Decision tree A flowchart-like tree structure in which each split is a learned “decision rule” that optimally separates samples with different outcome labels. Advantages
  • Can learn nonlinear relationships between features and outcome and interactions among features

  • Rules are easily interpreted and visualized for simple, shallow trees

Disadvantages
  • Without limiting tree depth, decision trees can easily be overfit to training data

  • Sensitive to class imbalances

Random forest An ensemble (ie, collection) of decision trees. Each tree is made from a subset of the data (ie, subset of the samples and/or features). The final prediction is made by combining all decision tree outputs. Advantages
  • Have a high capacity for learning complex relationships

  • Generally, more resistant to overfitting than decision trees

Disadvantage
  • Less easily interpretable than decision trees, but methods exist to estimate feature importance

Artificial neural network (aNN or NN) Composed of connected layers (ie, input, hidden, output) in which each layer is a set of nodes. Each node's input is a linear combination of multiple nodes in the previous layer. After application of an activation function, each node outputs a value to multiple nodes in the subsequent layer. Node-specific parameters are learned during model training. Advantages
  • Can learn any arbitrary function between input features and outcomes due to combining linear mappings and nonlinear activation functions

  • Can perform multiclass classification

  • Can be modified to model sequential data (recurrent NNs) and spatial data (convolutional NNs)

Disadvantages
  • Black-box method that is difficult or impossible to interpret

  • Computationally expensive

  • Long training times

  • Generally, needs lots of training data to resist overfitting

Figure 1.

Figure 1.

Algorithm and training data curation. A, Machine learning algorithms differ in their structure, complexity, and interpretability. B, Raw patient variables can be mined from electronic health record databases. Features may be selected from the available variables if they are deemed relevant to the predicted outcome. Additional features can be engineered from existing raw variables. C, Electronic health record–derived data may have extensive missing values. This can be handled by, for example, (i) removing observations with missing values or (ii) imputing missing values.

Challenges of Leveraging EHR Data for ML

EHR data are an attractive source for training data due to their quantity and real-world nature. However, their quality, accuracy, and meaning pose major limitations on secondary uses [9]. A major challenge of using EHR data is understanding when and how variables are reliable or meaningful for the application at hand. The first steps in curating a training data set from the EHR are defining inclusion criteria based on, for example, disease state or phenotype and then determining what a “sample” represents (eg, a hospital visit or a patient). Another key step is assigning outcome labels to each sample, which could include the occurrence of a clinical event, such as disease progression or development. Fields from the EHR can be used for inclusion criteria and outcome label assignment, such as diagnosis codes. Yet, the timing of diagnosis code entry may not correspond to disease onset [10]. Additionally, due to the influence of reimbursement policies on medical billing, many diagnosis codes have low sensitivity [11, 12]. For syndromic conditions such as sepsis, the optimal diagnostic criteria are the subject of ongoing debates among experts [13], and many studies used alternative definitions [14]. Biases in training data due to flawed cohort composition or inaccurate outcome labels may limit the applicability or performance of an ML model, particularly if the use of diagnosis codes changes over time or differs among institutions.

The retrospective nature of EHR data poses additional caveats to its use as training data, as changes in clinical care, definitions, and patient populations may cause data set “shifts” (ie, temporal inconsistencies). For example, an apparent increase in the rate of death from falls in one data set was due to transitioning to the ICD-10 coding system [15]. System-wide adoption of new EHR software can lead to dramatic data changes due to incompatibility between old and new systems (eg, see Johnson et al [16]). As another example, antibiotic susceptibility test results may not include the measured minimum inhibitory concentration. Revisions to clinical breakpoints will therefore lead to temporal inconsistencies in the definitions of resistance and susceptibility [17].

Some aspects of clinical care might be entirely missing from the EHR or not easily accessed by traditional methods. Unstructured EHR data contain a wealth of patient information in the form of imaging, signal (eg, electrocardiogram), and free-text data but require manual abstraction to derive useful variables. Researchers have recently begun to automatically extract features from progress notes, evaluation reports, and discharge summaries using natural language processing (NLP), which seeks to allow computers to “understand” language by using rules-based or ML approaches [4]. For example, Yao et al developed an NLP model that annotated patient eviction status from clinical notes [18]. Other researchers have used NLP to identify patients with a particular phenotype not captured by diagnosis codes alone [19].

EHR data will be missing when patients switch providers or get out-of-system care [20]. Madden et al combined insurance claims and EHR data and found that patients with psychiatric diagnoses had extensive missing psychiatric service data because they visited external institutions [21]. Drug exposure data may also be missing from the EHR when insurance claims are not submitted—for example, when patients pay out-of-pocket for drug costs [22]. Many ML algorithms are unable to handle missing data, necessitating removal of patients or variables with missingness. Missing values can be imputed [23], but this topic is beyond the scope of this article. Importantly, the missingness of data are often not random, potentially resulting in biases upon removal or imputation.

Model Overfitting and Evaluation

Ideally, an ML model learns robust and meaningful relationships between features and the outcome of interest such that the model will generalize to other data sources and perform well upon real-world implementation. However, models often become “overfit” to training data due to learning spurious correlations between features and outcome (noise). This may occur if the training data are very high dimensional (ie, many features) or the model contains many parameters. There are methods that attempt to limit model overfitting by penalizing model complexity (eg, Lasso or Ridge regularization [24]), but the topic is beyond the scope of this article. It was recently demonstrated that the apparent success of a neural network model that diagnosed COVID-19 from chest radiographs was due to the model learning noise in the image background rather than clinical characteristics [25]. Significant emphasis should be placed on applying domain knowledge (ie, clinical and biological) during model development (Figure 1B), requiring collaboration between ML researchers and clinicians.

Overfitting can occur when sample sizes are small—for instance, when cohorts are derived from the EHR system of a single institution. Combining EHR data across institutions is one solution but is complicated by differences in recording practices and data infrastructure [26]. Collaborative efforts are underway to collect and store patient data from millions of individuals of diverse backgrounds in a standardized format (ie, the Observational Medical Outcomes Partnership Common Data Model), including the databases All of Us [27] and Our Future Health [28]. However, ML models trained on heavily curated databases will be difficult to deploy in the clinic if the data at the time of prediction are not in the same format as the training data.

A key challenge of developing ML models is determining how well the model will perform in the real world. To estimate this, researchers may set aside a portion of the available labeled data for model testing (ie, test set or holdout set [24]). Model validation should mimic real-world deployment—for example, by withholding test data from later periods or different hospital sites. However, good performance in validation does not rule out overfitting to training data. Several metrics exist to evaluate model predictions on the test set (Figure 2). Model outputs can be “calibrated” during validation, allowing researchers to balance false positives and negatives according to their context-specific costs. The receiver operating characteristic (ROC) curve depicts the true- and false-positive rates across a range of model output cutoffs. The area under the ROC curve (AUROC or area under the curve) is often reported as the single metric describing performance.

Figure 2.

Figure 2.

Model training, testing, and evaluation. A, Data are split into train and test sets, commonly 80% for training and 20% for testing. Class imbalances can be handled, for example, by (i) undersampling the majority class or (ii) weighting observations by the inverse of the class frequency. Imbalances in testing data are left unchanged. B, During training, the model parameters (eg, β) are tuned to learn the relationship between features and outcome. Many models output a probability of a positive class label, requiring a cutoff to be chosen to assign a label. A graphic depiction of a hypothetical logistic regression model is shown as an example. C, The trained model is evaluated on the held-out test (Xtest), and the predicted labels (Ŷtest) are compared with the actual labels (Ŷtest). At the chosen cutoff, a confusion matrix can be used to assess model performance, and several metrics can be derived from it. Sensitivity (or recall) and specificity are the proportions of positive and negative samples accurately classified, respectively. PPV (or precision) and NPV are the proportions of predicted positive and negative samples that are actually positive and negative. Receiver operating characteristic (ROC) and precision-recall (PR) curves can be constructed to assess performance at each possible cutoff. FN, false negative; FP, false positive; N, negative; NPV, negative predictive value; P, positive; PPV, positive predictive value; TN, true negative; TP, true positive.

Class imbalance refers to data in which positive and negative samples do not occur at equal rates (Figure 2), potentially leading to biased models. When there are class imbalances in the test set, the area under the ROC curve is not an appropriate measure of performance [29]. In such cases, a precision-recall curve should be used. Data should be balanced during model training, while testing data should reflect real-world prevalence. Imbalanced training data can be handled by under- or oversampling the majority or minority class, respectively, or by weighting observations by class frequency during model training [30]. In clinical contexts, class imbalances are common due to the low prevalence of certain diseases or phenotypes, resulting in relatively few “positive” cases. Importantly, models with clinical applications should be evaluated according to their impact on clinical outcomes (mortality, length of stay, health care costs, etc). Many studies report only model evaluation metrics; others estimate clinical outcomes on testing data, and fewer still measure clinical outcomes via prospective or external validation [31].

CASE STUDIES IN THE APPLICATION OF EHR DATA AND ML TO COMBAT AMR

We searched the literature for studies that used ML trained on antibiotics-related data from the EHR to combat AMR. Studies generally used ML for one of three tasks: (1) predicting carriage of, or infection with, a specific antibiotic-resistant pathogen; (2) predicting antibiotic susceptibility test results at the patient level; or (3) assisting antimicrobial stewardship tasks. The wide variety of studies necessitated selection of only the most relevant to highlight core concepts in applying ML to EHR data. Table 2 and Figure 3 describe the input data, algorithms used, and performance of the ML models published in each article.

Table 2.

Characteristics of Machine Learning Models Used in Selected Papers

First Author (Year) Input (Features, Covariates) Algorithm Predicted Outcome Performance Relative to Existing Standard
Carriage/infection
Goodman et al (2016, 2019) [32, 34] Demographics, antibiotic use history, infection history LR, DT Infection with an ESBL-producing pathogen No clinical interpretation provided by authors
McGuire et al (2021) [35] 67 features—including demographics, body mass index, number of prior admissions, prior carbapenem resistance, positive MRSA/VRE swab within 24 h of admission, Charlson score, etc XGBoost Infection with a carbapenem-resistant pathogen No clinical interpretation provided by authors
Robicsek et al (2011) [37] Demographics, nursing home status, admission service type, feeding tube status, hemoglobin level, cystic fibrosis diagnosis, etc LR MRSA colonization Model detected MRSA colonization in patients accounting for 72% of MRSA-related patient-days, reducing costs by approximately 40%–60%
Multiple antibiotics
Kanjilal et al (2020) [38] In previous 90 d: resistance frequency in hospital urine samples, number of prescriptions of each antibiotic LR model for each antibiotic Resistance to NIT, TMP-SMX, CIP, and LVX Model would have reduced CIP/LVX use by 67% and the rate of mismatched treatment by 18%
Corbin et al (2022) [39] Demographics, insurance, laboratory values, vital measurements, antibiotic use history, infection history RF or GBDT, depending on antibiotic Resistance to 12 antibiotics (8 single, 4 combinations) Model would have reduced the spectrum of many prescriptions in the test set (eg, 69% of VAN + PIP-TAZ could have been narrowed to PIP-TAZ alone)
Aiding stewardship
Bystritsky et al (2020) [40] Visit-level characteristics: infectious disease consultation, positive culture, intensive care unit admission, etc LR, GBDT Patient requires stewardship intervention Model would have reduced number of patients to review by 54%
Goodman et al (2022) [33] Demographics, prior multidrug-resistant organism history, clinical characteristics (infection site, diagnoses), infectious disease consultation, positive culture, etc RF Patient requires stewardship intervention Models would have reduced number of cases requiring review by 34% and 31% and resulted in 161 and 171 “missed” interventions, respectively
Beaudoin et al (2014, 2016) [41, 42] Visit-level clinical “sequence”: time-encoded laboratory values, prescriptions, diagnoses, etc. TIM Antibiotic prescription was inappropriate The learning module derived novel rules and triggered more alerts than the baseline system alone

Abbreviations: CIP, ciprofloxacin; DT, decision tree; ESBL, extended spectrum β-lactamase; GBDT, gradient-boosted decision tree; LR, logistic regression; LVX, levofloxacin; MRSA, methicillin-resistant Staphylococcus aureus; NIT, nitrofurantoin; PIP-TAZ, piperacillin/tazobactam; RF, random forest; TIM, temporal induction of classification models; TMP-SMX, trimethoprim/sulfamethoxazole; VAN, vancomycin; VRE, vancomycin-resistant Enterococcus.

Figure 3.

Figure 3.

Model metrics in published papers. The performance of machine learning models differs across domains and predictive tasks. The prevalence column indicates the proportion of the test set that was susceptible to the antibiotic of interest (3.1 and 3.2) or was flagged for stewardship intervention (3.3). In 3.3, a flag (full, A, or B) was added to indicate that multiple reduced data sets were used for model training; refer to Goodman et al [33] for further details. Metrics printed in gray indicate model performance at the cutoff considered “optimal” by the authors, even if the performance was reported at multiple cutoffs. AUROC curve and AUPRC indicate performance at every model cutoff. White cells indicate a metric that was not reported in the published article. alg, algorithm; AMP, ampicillin; AUPRC, area under the precision-recall curve; AUROC, area under the receiver operating characteristic; CEF, cefazolin; CFP, cefepime; CFT, ceftriaxone; CIP, ciprofloxacin; DT, decision tree; GBDT, gradient-boosted decision trees; LR, logistic regression; LVX, levofloxacin; MER, meropenem; NIT, nitrofurantoin; NPV, negative predictive value; PIP/TAZ, piperacillin/tazobactam; PPV, positive predictive value; RF, random forest; TIM, temporal induction of classification models; TMP/SMX, sulfamethoxazole/trimethoprim; VAN, vancomycin.

Predicting Carriage of, or Infection With, Resistant Pathogens

Multidrug-resistant organisms pose a significant burden on patients, but timely identification of these pathogens at the patient level is difficult. Goodman et al sought to use ML to predict whether patients who were bacteremic were infected with an extended-spectrum β-lactamase–producing pathogen [34]. The authors derived a risk score from their logistic regression model and chose a cutoff that limited false positives. The risk score had excellent precision (94.6%) but poor sensitivity (49.5%), which resulted in few false positives and many false negatives. However, the risk score used features that were not hard coded in the EHR, requiring extensive patient interviews. To overcome this, the authors trained a decision tree only on features present in the EHR. The decision tree showed similar performance and was simpler to use at the bedside. Yet, the authors did not estimate how use of the model would improve the appropriateness of prescriptions if implemented into clinical practice.

McGuire et al sought to predict carbapenem resistance in patients to avoid prescribing an ineffective therapy [35]. Their resulting tree-based model (ie, XGBoost [36]) had excellent specificity and negative predictive value (both approximately 99%) but poor sensitivity and precision (both approximately 30%). They chose a cutoff to limit the number of false positives—that is, patients who would unnecessarily be given broad-spectrum alternatives. A comparison of the model with existing carbapenem resistance prediction tools was not provided by the authors.

Identification of patients colonized but not infected with multidrug-resistant organisms could minimize the transmission of resistant pathogens in the hospital. Robicsek et al sought to predict methicillin-resistant Staphylococcus aureus (MRSA) colonization in patients admitted to the hospital in an effort to reduce the costs associated with MRSA transmission [37]. With such a low prevalence (2%–4%), the false-positive rate of polymerase chain reaction assay precluded its clinical utility despite its low cost. However, the resulting model outperformed existing MRSA colonization risk scores and was able to identify a high-risk cohort of patients in which polymerase chain reaction assay could be used with high precision. The authors assessed the generalizability and transferability of their model by training it on data from one hospital and validating on two other sites.

Predicting Resistance to Multiple Antibiotics

ML models that predict resistance across a panel of antibiotics could aid clinicians in selecting an empiric therapy in the face of potential resistance. Several articles describe such work and go beyond resistance prediction by recommending an empiric therapy while taking stewardship efforts into account. To this end, Kanjilal et al trained logistic regression models to predict resistance to multiple antibiotics in urinary tract infection cases and used model outputs to recommend an optimal empiric therapy [38]. The authors engineered 2 population-level features: (1) the prevalence of resistance to each antibiotic in urine samples in the three months preceding a patient's sample and (2) the hospital-wide number of prescriptions to each antibiotic in the preceding three months. Relative to what was prescribed by clinicians, the model's recommendations reduced the use of broad-spectrum therapies by 67% and slightly reduced the rate of treatment mismatch. Cases in which the clinician and model recommendations differed were manually reviewed and occurred due to therapy contraindications not coded in the model (eg, allergies).

Similarly, Corbin et al sought to predict resistance to multiple antibiotics and recommend an optimal empiric therapy using data from Stanford and Boston hospitals [39]. Their recommendation algorithm revealed that a substantial portion of broad-spectrum prescriptions could have been exchanged with narrow-spectrum therapies while maintaining the same coverage rate as physicians. For example, 69% of vancomycin + piperacillin/tazobactam prescriptions could have been exchanged for piperacillin/tazobactam alone. The authors dealt with missing antibiotic susceptibility test values by constructing a set of rules for imputation. For example, Streptococcus agalactiae is susceptible to cephalosporins, and gram-negative rods are resistant to vancomycin. Resistance to each antibiotic was predicted with tree-based models. To assess the temporal reliability of model predictions, the models were trained on Stanford data from 2009 to 2018 and then tested on data from 2019.

Aiding Antimicrobial Stewardship Tasks

Several studies used ML to improve the efficiency of stewardship tasks that require significant time commitment by experienced pharmacists and physicians, such as postprescription review with feedback. Bystritsky et al hypothesized that ML could more accurately identify patients in need of a stewardship intervention than existing rules-based clinical decision support systems and manual patient review [40]. They tuned the model cutoff to limit false negatives (ie, patients who required an intervention but were not flagged), resulting in poor precision but excellent sensitivity. Relative to manual review, the model reduced by 54% the number of patients who needed to be reviewed to identify one patient in need of an intervention. In a similar study, Goodman et al trained a random forest model on a reduced set of features to emphasize ease of extraction from the EHR [32]. The resulting model had a similar predictive performance as that by Bystritsky et al but would be more feasible to implement. In the test set, their model would have reduced the number of cases requiring review by 31% to 34% while maintaining high sensitivity.

Beaudoin et al sought to use ML to improve the prescribing rules in the knowledge base of the antimicrobial prescription surveillance system, a clinical decision support system that automatically identifies inappropriate antimicrobial prescriptions [42]. The developed learning module of the antimicrobial prescription surveillance system used a rule induction ML algorithm to derive novel prescription rules from a set of patient visits labeled as having inappropriate prescriptions. In a prospective study, the learning module derived novel rules and identified several inappropriate prescriptions for which the baseline system did not trigger an alert [42].

CONCLUSIONS AND FUTURE DIRECTIONS

The medical community has expressed concerns about the lack of interpretability and accountability of AI/ML-based technologies [5]. Novel medicolegal problems arise, such as who is responsible in the event that such a product yields poor patient outcomes [3], particularly when models with limited interpretability (eg, neural networks) are used. Even if models are interpretable, their outputs may suggest counterintuitive interventions. In one case, pneumonia mortality prediction models learned that asthma diagnosis predicted lower mortality risk [43]. Blind trust in outputs of this model would have caused less aggressive treatment of patients who were asthmatic, but additional treatment is precisely what led to their superior survival. ML models trained on EHR data may output biased results for underrepresented and vulnerable populations (see examples in the article by Gianfrancesco et al [44]), thereby perpetuating existing sociodemographic health care disparities. Such results may erode patient trust in AI/ML-based technologies, which is key to their widespread acceptance in medicine.

ML models could be implemented as clinical decision support system tools to be used at the bedside. Implementation requires extensive collaboration among informaticists, clinicians, and information technology personnel. Clinicians can provide domain knowledge during model development to ensure that relevant features are included and to provide a baseline for model performance. EHR systems experts can build the necessary infrastructure for data processing that will occur at the time of prediction. If a model requires complicated data preprocessing (eg, some NLP approaches) or extensive patient interviewing to derive features, model predictions may not be immediately available. Many open questions related to the clinical deployment of ML models remain, such as how well models will generalize across institutions or whether this should be a focus. Some argue that temporal reliability within an institution is more important than generalizability across institutions [45]. Even within an institution, models may show a “performance gap” across time, due to changes in the way that data are extracted/transformed and to changes in the patient population [26].

Despite these concerns, the number of AI/ML-based devices and algorithms approved by the Food and Drug Administration (FDA) has risen sharply in the last decade and is expected to continue growing [46]. There are ML-based prediction models deployed clinically for predicting sepsis onset [14, 47], low ejection fraction [48, 49], and acute kidney injury [50], to name a few. However, the field currently suffers from a lack of external and prospective evaluation of model impact on clinical outcomes. ML models/algorithms are considered medical devices by the FDA, which have a lower approval standard than, for example, pharmaceutical drugs. Plana et al found that many clinically deployed FDA-approved ML models were not assessed by randomized clinical trials and those that were showed mixed results and were potentially biased [31]. Overarching issues with translation include training on cohorts of limited diversity, lack of validation across multiple sites, and poor study design [31].

ML has great utility but often struggles to be the panacea that many hoped. This is due to fundamental constraints such as the large volume of data required and the difficulty predicting outside the range of training data. These issues arise, in part, because ML models learn associations between features and outcomes and correlation does not indicate causation. In contrast, causal ML models attempt to infer the underlying causal structure of a system [51]. Causal models provide additional means of enforcing prior clinical knowledge and offer greater transferability and generalizability, areas where traditional ML struggles [52]. Causal models are highly interpretable, and the effect of an intervention can be directly tested, making them particularly amenable to clinical research. Causal models are only recently being applied to clinical research [52] in part because they are more difficult to train [51]. In the future, we believe that causal modeling will play a role in inference from EHR data because EHR data are of limited size, highly heterogeneous, temporally changing, and incomplete.

Contributor Information

Samuel E Blechman, Department of Biomedical Informatics, University of Pittsburgh, Pennsylvania.

Erik S Wright, Department of Biomedical Informatics, University of Pittsburgh, Pennsylvania; Center for Evolutionary Biology and Medicine, University of Pittsburgh, Pennsylvania.

Notes

Acknowledgments.  Figures 1 and 2 were made with Biorender.com. We thank Andrew Beckley for preliminary work on this topic not shown here.

Author contributions. Study conception and design: S. E. B. and E. S. W. Draft manuscript preparation: S. E. B.

Financial support. This work was supported by the National Institute of Allergy and Infectious Diseases at the National Institutes of Health (1R21AI144769-01A1).

References

  • 1. Moor M, Banerjee O, Abad ZSH, et al. Foundation models for generalist medical artificial intelligence. Nature 2023; 616:259–65. [DOI] [PubMed] [Google Scholar]
  • 2. Stokes JM, Yang K, Swanson K, et al. A deep learning approach to antibiotic discovery. Cell 2020; 180:688–702.e13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Yoon CH, Torrance R, Scheinerman N. Machine learning in medicine: should the pursuit of enhanced interpretability be abandoned? J Med Ethics 2022; 48:581–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Chowdhary KR. Fundamentals of artificial intelligence. New Delhi: Springer India, 2020. [Google Scholar]
  • 5. Chen M, Zhang B, Cai Z, et al. Acceptance of clinical artificial intelligence among physicians and medical students: a systematic review with cross-sectional survey. Front Med 2022; 9:990604. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Bera K, Schalper KA, Rimm DL, Velcheti V, Madabhushi A. Artificial intelligence in digital pathology—new tools for diagnosis and precision oncology. Nat Rev Clin Oncol 2019; 16:703–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Patton MJ, Liu VX. Predictive modeling using artificial intelligence and machine learning algorithms on electronic health record data. Crit Care Clin 2023; 39:647–73. [DOI] [PubMed] [Google Scholar]
  • 8. Raschka S, Patterson J, Nolet C. Machine learning in Python: main developments and technology trends in data science, machine learning, and artificial intelligence. Information 2020; 11:193. [Google Scholar]
  • 9. Hersh WR, Weiner MG, Embi PJ, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care 2013; 51(suppl 3):S30–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Ford E, Carroll J, Smith H, et al. What evidence is there for a delay in diagnostic coding of RA in UK general practice records? An observational study of free text. BMJ Open 2016; 6:e010393. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Liu B, Hadzi-Tosev M, Liu Y, et al. Accuracy of International Classification of Diseases, 10th Revision codes for identifying sepsis: a systematic review and meta-analysis. Crit Care Explor 2022; 4:e0788. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Carlson KF, Barnes JE, Hagel EM, Taylor BC, Cifu DX, Sayer NA. Sensitivity and specificity of traumatic brain injury diagnosis codes in United States Department of Veterans Affairs administrative data. Brain Inj 2013; 27:640–50. [DOI] [PubMed] [Google Scholar]
  • 13. Singer M, Deutschman CS, Seymour CW, et al. The third international consensus definitions for sepsis and septic shock (Sepsis-3). JAMA 2016; 315:801–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Fleuren LM, Klausch TLT, Zwager CL, et al. Machine learning for the prediction of sepsis: a systematic review and meta-analysis of diagnostic test accuracy. Intensive Care Med 2020; 46:383–400. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Hu G, Baker SP. An explanation for the recent increase in the fall death rate among older Americans: a subgroup analysis. Public Health Rep 2012; 127:275–81. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Johnson AEW, Pollard TJ, Shen L, et al. MIMIC-III, a freely accessible critical care database. Sci Data 2016; 3:160035. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Catalán P, Wood E, Blair JMA, Gudelj I, Iredell JR, Beardmore RE. Seeking patterns of antibiotic resistance in ATLAS, an open, raw MIC database with patient metadata. Nat Commun 2022; 13:2917. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Yao Z, Tsai J, Liu W, et al. Automated identification of eviction status from electronic health record notes. J Am Med Inform Assoc 2023; 30:1429–37. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Halpern Y, Horng S, Choi Y, Sontag D. Electronic medical record phenotyping using the anchor and learn framework. J Am Med Inform Assoc 2016; 23:731–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Lin KJ, Glynn RJ, Singer DE, Murphy SN, Lii J, Schneeweiss S. Out-of-system care and recording of patient characteristics critical for comparative effectiveness research. Epidemiology 2018; 29:356–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Madden JM, Lakoma MD, Rusinak D, Lu CY, Soumerai SB. Missing clinical and behavioral health data in a large electronic health record (EHR) system. J Am Med Inform Assoc 2016; 23:1143–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Choudhry NK, Shrank WH. Four-dollar generics—increased accessibility, impaired quality assurance. N Engl J Med 2010; 363:1885–7. [DOI] [PubMed] [Google Scholar]
  • 23. Emmanuel T, Maupong T, Mpoeleng D, Semong T, Mphago B, Tabona O. A survey on missing data in machine learning. J Big Data 2021; 8:140. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Hastie T, Tibshirani R, Friedman J. The elements of statistical learning. 2nd ed. New York: Springer, 2017. [Google Scholar]
  • 25. Dhar S, Shamir L. Evaluation of the benchmark datasets for testing the efficacy of deep convolutional neural networks. Vis Inform 2021; 5:92–101. [Google Scholar]
  • 26. Ötleş E, Oh J, Li B, et al. Mind the performance gap: examining dataset shift during prospective validation. 2021. Available at: https://arxiv.org/abs/2107.13964. Accessed 13 May 2024.
  • 27. The All of Us Research Program Investigators . The “All of Us” research program. N Engl J Med 2019; 381:668–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Our Future Health . 2024. https://ourfuturehealth.org.uk/. Accessed May 2024.
  • 29. Asnicar F, Thomas AM, Passerini A, Waldron L, Segata N. Machine learning for microbiologists. Nat Rev Microbiol 2024; 22:191–205. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30. He H, Garcia EA. Learning from imbalanced data. IEEE Trans Knowl Data Eng 2009; 21:1263–84. [Google Scholar]
  • 31. Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJY, Kann BH. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Netw Open 2022; 5:e2233946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32. Goodman KE, Lessler J, Cosgrove SE, et al. A clinical decision tree to predict whether a bacteremic patient is infected with an extended-spectrum β-lactamase–producing organism. Clin Infect Dis 2016; 63:896–903. 10.1093/cid/ciw425 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Goodman KE, Heil EL, Claeys KC, Banoub M, Bork JT. Real-world antimicrobial stewardship experience in a large academic medical center: using statistical and machine learning approaches to identify intervention “Hotspots” in an antibiotic audit and feedback program. Open Forum Infect 2022; 9(7). 10.1093/ofid/ofac289 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34. Goodman KE, Lessler J, Harris AD, Milstone AM, Tamma PD. A methodological comparison of risk scores versus decision trees for predicting drug-resistant infections: a case study using extended-spectrum beta-lactamase (ESBL) bacteremia. Infect Control Hosp Epidemiol 2019; 40:400–7. [DOI] [PubMed] [Google Scholar]
  • 35. McGuire RJ, Yu SC, Payne PRO, et al. A pragmatic machine learning model to predict carbapenem resistance. Antimicrob Agents Chemother 2021; 65:e0006321. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: KDD ’16: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco, CA: ACM, 2016:785–94. https://dl.acm.org/doi/10.1145/2939672.2939785 [Google Scholar]
  • 37. Robicsek A, Beaumont JL, Wright M-O, Thomson RB, Kaul KL, Peterson LR. Electronic prediction rules for methicillin-resistant Staphylococcus aureus colonization. Infect Control Hosp Epidemiol 2011; 32:9–19. [DOI] [PubMed] [Google Scholar]
  • 38. Kanjilal S, Oberst M, Boominathan S, Zhou H, Hooper DC, Sontag D. A decision algorithm to promote outpatient antimicrobial stewardship for uncomplicated urinary tract infection. Sci Transl Med 2020; 12:eaay5067. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Corbin CK, Sung L, Chattopadhyay A, et al. Personalized antibiograms for machine learning driven antibiotic selection. Commun Med 2022; 2:38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Bystritsky RJ, Beltran A, Young AT, Wong A, Hu X, Doernberg SB. Machine learning for the prediction of antimicrobial stewardship intervention in hospitalized patients receiving broad-spectrum agents. Infect Control Hosp Epidemiol 2020; 41:1022–7. [DOI] [PubMed] [Google Scholar]
  • 41. Beaudoin M, Kabanza F, Nault V, Valiquette L. An antimicrobial prescription surveillance system that learns from experience. AI Mag 2014; 35:15–25. 10.1609/aimag.v35i1.2500 [DOI] [Google Scholar]
  • 42. Beaudoin M, Kabanza F, Nault V, Valiquette L. Evaluation of a machine learning capability for a clinical decision support system to enhance antimicrobial stewardship programs. Artif Intell Med 2016; 68:29–36. [DOI] [PubMed] [Google Scholar]
  • 43. Caruana R, Lou Y, Gehrke J, Koch P, Sturm M, Elhadad N. Intelligible models for healthcare: predicting pneumonia risk and hospital 30-day readmission. In: KDD ’15: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Sydney, Australia: ACM, 2015:1721–30. https://dl.acm.org/doi/10.1145/2783258.2788613 [Google Scholar]
  • 44. Gianfrancesco MA, Tamang S, Yazdany J, Schmajuk G. Potential biases in machine learning algorithms using electronic health record data. JAMA Intern Med 2018; 178:1544–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Burns ML, Kheterpal S. Machine learning comes of age: local impact versus national generalizability. Anesthesiology 2020; 132:939–41. [DOI] [PubMed] [Google Scholar]
  • 46. Benjamens S, Dhunnoo P, Meskó B. The state of artificial intelligence-based FDA-approved medical devices and algorithms: an online database. NPJ Digit Med 2020; 3:118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Wong A, Otles E, Donnelly JP, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern Med 2021; 181:1065–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Yao X. Artificial intelligence–enabled electrocardiograms for identification of patients with low ejection fraction: a pragmatic, randomized clinical trial. Nat Med 2021; 27:815–9. [DOI] [PubMed] [Google Scholar]
  • 49. Lin C-S, Liu W-T, Tsai D-J, et al. AI-enabled electrocardiography alert intervention and all-cause mortality: a pragmatic randomized clinical trial. Nat Med 2024; 30:1461–70. [DOI] [PubMed] [Google Scholar]
  • 50. Wainstein M, Flanagan E, Johnson DW, Shrapnel S. Systematic review of externally validated machine learning models for predicting acute kidney injury in general hospital patients. Front Nephrol 2023; 3:1220214. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Shanmugam R. Elements of causal inference: foundations and learning algorithms. J Stat Comput Simul 2018; 88:3248. [Google Scholar]
  • 52. Sanchez P, Voisey JP, Xia T, Watson HI, O’Neil AQ, Tsaftaris SA. Causal machine learning for healthcare and precision medicine. R Soc Open Sci 2022; 9:220638. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from The Journal of Infectious Diseases are provided here courtesy of Oxford University Press

RESOURCES