Abstract
Predictive artificial intelligence (AI) models enhance clinical workflows with applications such as prognostication and decision support, yet suffer from postdeployment performance challenges due to dataset shifts. Regulatory guidelines emphasize the need for continuous monitoring, but actionable strategies are lacking. A significant issue is postdeployment assessment of predictive AI models due to confounding medical interventions where effective interventions modify outcomes, introducing bias into performance assessment. This can falsely suggest model decay, leading to unwarranted updates or decommissioning, harming clinical outcomes.
Proposed solutions include withholding model outputs, monitoring outcomes as surrogates, or including clinician interventions in models, each with ethical or practical limitations. The lack of effective solutions for this problem can lead to an abundance of models that cannot be later evaluated, tuned, or withdrawn if they become ineffective, leading to patient harm. Advanced causal modeling to assess counterfactual outcomes may offer a reliable validation method. Until effective methods for postdeployment monitoring of predictive models are developed and validated, decisions on model updates should consider the causal pathways and be evidence based, ensuring the sustained utility of AI models in dynamic clinical environments.
Introduction
Artificial intelligence (AI)–based predictive models have seen rising adoption over the past decade, with applications including clinical prognostication, early detection of adverse events, and clinical decision support. These models guide the allocation of interventions based on a predicted risk. It is well-known that such models may not generalize across settings or over time due to changes in patient populations, practice patterns, and data processes — a phenomenon known as dataset shift.1 The U.S. Food and Drug Administration (FDA)’s Good Machine Learning Practice for Medical Device Development2 recommends that “deployed models are monitored for performance,” and a prior White House executive order3 called for “post-market oversight of AI-enabled health care-technology algorithmic system performance against real-world data.” The presumption with these recommendations is that monitored models, if they decay in performance, should be updated to restore accuracy or decommissioned.
Although these policies are well-meaning, they are not actionable. Several open questions remain around translating these policies into actions that ensure that predictive models continue to work as intended after deployment. The most pertinent among the issues is that once a model is implemented into a clinical workflow, it becomes part of a causal pathway linking the model’s predictions to clinical actions and resulting outcomes. As a result, postdeployment estimates of model performance become unreliable. This is because a common objective of deploying predictive models is to prompt interventions that reduce the risk of adverse outcomes predicted by the model. If the interventions are effective, adverse events will be prevented, and this, in turn, will alter the rates of the events being predicted. These interventions, also called confounding medical interventions (CMIs), lead to model predictions appearing incorrect on subsequent evaluation. Based on simulations using a previously published model,4 the resulting bias in observed model performance estimates increases as the model becomes more successful in improving clinical outcomes (Fig. 1). If interpreted naively, the artifactual performance decays may lead to actions such as updating the model or abandoning the model entirely, both at a detriment to clinical outcomes.
Figure 1.

Relationships between Intervention Frequency, Intervention Success, and Observed (Biased) Model Performance Estimates.
Panel A shows the area under the receiver operating characteristic curve (AUROC), and Panel B shows the area under the precision–recall curve (AUPRC) for a previously published model with an event rate of 4.6%.4 Intervention frequency refers to the added probability of an intervention among patients who receive a model alert. This probability reflects the level of model adoption, clinicians’ trust in the model, timely delivery of the model alerts, and other factors impacting the effect of the model outputs on clinician behavior. Intervention effectiveness is the probability that the resulting intervention prevents the adverse event being predicted by the model. The observed event rates after incorporating simulated model-recommended interventions are shown in Panel C. The original AUROC (0.86) and AUPRC (0.36) correspond to the true model performance when either intervention frequency or effectiveness is zero. Each other point corresponds to the AUROC and AUPRC for simulated interventions under varying postdeployment intervention frequencies and effectiveness rates. The AUROC decreases from 0.86 to 0.24, and the AUPRC decreases from 0.36 to 0.00 as the intervention frequency and effectiveness increase. In other words, biases in the observed AUROC and AUPRC are greater as the intervention frequency and effectiveness increase. This is despite the fact that the true model performance remained unchanged. AUROC denotes area under the receiver operating characteristic curve; and AUPRC, area under the precision–recall curve.
Previously Proposed Solutions for Postdeployment Surveillance
Several solutions have been proposed for the postdeployment surveillance of health care AI models; however, all are fraught with challenges.
First, it is possible to withhold the model outputs for a randomly selected subset of patients, later validating model performance among this subset. This amounts to a perpetual or recurring randomized controlled trial. Although this solution enables accurate ongoing estimation of a model’s performance and effects, it raises ethical concerns; as clinicians adopt a clinical model into their workflows, withholding its output for a small group of patients may lead to substandard care. This is particularly problematic when the absence of model output may put patients at risk or when the model is used to allocate scarce clinical resources.
A second proposed solution involves monitoring clinical outcomes as a surrogate for model performance. While an accurate model paired with an effective intervention can improve clinical outcomes, a lack of observed improvements after model deployment is not necessarily due to poor model performance. It may instead reflect low levels of clinician trust or model adoption, or a lack of effective interventions for model-targeted conditions. Observed improvements in patient outcomes may also be due to seasonal or epidemiologic changes or other shifts in clinical practice.
A third potential solution is to include clinician intervention as a model term.5 This approach, which is akin to adjusting away the effects of clinician interventions, assumes that subjects receiving and not receiving treatment are conditionally exchangeable given other model covariates. For this assumption to be valid, the data-generating process must be faithfully replicated in the prediction model, and there must be no unmeasured confounders. This is rarely true for published prediction models, which are not typically developed in the same way as causal models for estimating unbiased treatment effects. Including intervention as a predictor can exacerbate the problem if the exchangeability assumption is not satisfied. This is because clinicians typically administer interventions to patients for whom a high risk of the adverse outcomes is predicted; hence, the intervention variable often contributes positively to model-predicted risks. This results in an even larger discrepancy between the model output and the observed labels that it will be evaluated against.
Feng et al. proposed a new score-based cumulative sum (CUSUM) procedure within a causal framework to identify a change in the model calibration after deployment.6 However, simulation results suggest that the effectiveness of the CUSUM procedure decreases as clinician trust in the model predictions increases. This is consequential because the severity of the problem of CMIs also worsens as clinician trust in the model increases (Fig. 1). Feng et al. also proposed a CUSUM procedure to monitor changes in the positive and negative predictive values.7 This approach addresses scenarios where the intervention that ensues from model alerts was not part of routine practice predeployment. However, it does not apply in more general scenarios, including when a model’s goal is to increase the use of treatments in active use (e.g., allocating existing clinical interventions to patients at high risk of some adverse outcome). In addition, both approaches provide a statistical test to determine whether or not a change in model performance has occurred, but do not provide quantitative estimates of model performance and the magnitude of change. Strategies to address more complex scenarios, such as recurrent predictions or multiple interventions, also remain unknown.
Considerations for Effective Postdeployment Surveillance
Modern approaches leveraging advances in causal inference may enable postdeployment monitoring that is less susceptible to these challenges. A key to these approaches is to separately consider the effects of model recommendations on clinician decisions and the effects of those decisions on predicted outcomes. If a model’s degraded performance over time is driven by clinicians acting on model predictions to avert adverse outcomes, validating models against counterfactual outcomes — outcomes that would have occurred had a clinician not intervened — may enable consistent postdeployment validation despite effective interventions. Such methods can also isolate changes to clinical outcomes that are directly attributable to a model (rather than to secular trends or contemporaneous practice changes); these effects remain the best gauge of a model’s impact. Future work should embed postdeployment monitoring into randomized trials to assess whether or not computational techniques applying these approaches yield similar observed performance across trial arms (e.g., those with CMIs vs. those without CMIs).
Until these methodological challenges have been addressed, we caution against using apparent model performance — collected as part of model monitoring — to drive decisions about whether a model needs to be retrained, recalibrated, or updated in some other fashion. Although the FDA encourages predetermined change control plans for AI, such plans should not set arbitrary criteria for when a model should or should not be updated. If such thresholds need to be set, then they should explicitly consider the causal pathway between the model’s recommendation and the resulting intervention, with supporting evidence pointing to the model as the root cause of the apparent performance decay. Ideally, these decisions should be made on a case-by-case basis in partnership between model developers, implementers, and the appropriate governing bodies.
Strategies to assess the ongoing accuracy of predictive models in dynamic clinical environments are critical. Causal modeling offers a promising way forward. As these approaches are developed, dedicated efforts to refine and validate them will be crucial for sustaining the utility and trustworthiness of predictive models and capitalizing on the promise of AI in health care.
Supplementary Material
Footnotes
Disclosures
Author disclosures are available at ai.nejm.org.
The views expressed are those of the authors and do not reflect the views of the United States government.
References
- 1.Finlayson SG, Subbaswamy A, Singh K, et al. The clinician and dataset shift in artificial intelligence. N Engl J Med 2021;385:283–286. DOI: 10.1056/NEJMc2104626. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.U.S. Food and Drug Administration. Good machine learning practice for medical device development. October 27, 2021. (https://www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-device-development-guiding-principles).
- 3.Biden JR. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence. October 30, 2023. (https://www.federalregister.gov/documents/2023/11/01/2023-24283/safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence). Rescinded on January 20, 2025.
- 4.Cummings BC, Blackmer JM, Motyka JR, et al. External validation and comparison of a general ward deterioration index between diversely different health systems. Crit Care Med 2023;51:775–786. DOI: 10.1097/CCM.0000000000005837. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Lenert MC, Matheny ME, Walsh CG. Prognostic models will be victims of their own success, unless…. J Am Med Inform Assoc 2019;26:1645–1650. DOI: 10.1093/jamia/ocz145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Feng J, Gossmann A, Pennello GA, Petrick N, Sahiner B, Pirracchio R. Monitoring machine learning-based risk prediction algorithms in the presence of performativity. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics. Valencia, Spain: PMLR, 2024:919–927 (https://proceedings.mlr.press/v238/feng24b.html). [Google Scholar]
- 7.Feng J, Subbaswamy A, Gossmann A, et al. Designing monitoring strategies for deployed machine learning algorithms: navigating performativity through a causal lens. Proceedings of the Third Conference on Causal Learning and Reasoning. Los Angeles, CA: PMLR, 2024:587–608 (https://proceedings.mlr.press/v236/feng24a.html). [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
