Abstract
Background
As the enthusiasm for integrating artificial intelligence (AI) into clinical care grows, so has our understanding of the challenges associated with deploying impactful and sustainable clinical AI models. Complex dataset shifts resulting from evolving clinical environments strain the longevity of AI models as predictive accuracy and associated utility deteriorate over time.
Objective
Responsible practice thus necessitates the lifecycle of AI models be extended to include ongoing monitoring and maintenance strategies within health system algorithmovigilance programs. We describe a framework encompassing a 360 continuum of preventive, preemptive, responsive, and reactive approaches to address model monitoring and maintenance from critically different angles.
Discussion
We describe the complementary advantages and limitations of these four approaches and highlight the importance of such a coordinated strategy to help ensure the promise of clinical AI is not short-lived.
Keywords: algorithmovigilance, dataset shift, predictive analytics, model updating, performance drift, artificial intelligence
As artificial intelligence (AI) continues to mature towards broad implementation within clinical systems, successful integration requires comprehensive, system-based approaches.1 We have seen hundreds of predictive models developed targeting important health outcomes, yet for any number of reasons, few are deployed in clinical tools. The advent of large language models has further generated a flurry of proposed applications whose implementation realities are yet to be determined. What we know, however, is that training accurate models is not enough to ensure those models can support decision-making or improve patient outcomes. Successful clinical AI tools must ensure algorithmic fairness and develop user trust. They must provide actionable information when and how that information best supports decision-making. They must consistently deliver clinical utility. Certainly, there is excitement in integrating AI into clinical care for the benefit of patients, providers, and healthcare systems, but many challenges remain.
Coordinating the clinical, technical, ethical, and sociotechnical expertise needed to implement impactful AI-based tools is no small feat. However, even when these efforts initially succeed, such tools may face challenges in remaining effective and safe as the performance of the underlying models is disrupted over time by evolving clinical environments.2 Patient populations, environmental exposures, clinical care practices, healthcare policies, and patient preferences and care goals can all change over time. Even how we collect patient information shifts, both from a technical perspective and in terms of how we capture information within workflows. This process, referred to as dataset shift or concept drift, influences in predictable and unpredictable ways how well a model trained on previous clinical encounters applies to new patients. As a result, model accuracy deteriorates, reducing utility and potentially leading to safety concerns.2–5
Responsible practice necessitates the lifecycle of both analytic and generative AI models extend beyond development, validation, implementation, and impact assessment.1,6–9 If we dedicate the resources to integrate AI into clinical tools and ask both patients and clinicians to trust and rely on these tools, then it is incumbent upon us to ensure they consistently perform as promised. Our work cannot end when we turn a model on, rather that is simply when we enter a new phase of ongoing monitoring and maintenance—a key component of algorithmovigilance.8
By default, model maintenance efforts have long relied on complaints from end users. Given the challenge of regaining user trust after perceived model failure and the potential impact on patient care, clinical AI may be more sustainable and successful over time if we can restore struggling models before users are affected. Recent recommendations recognize the role of model governance3,9 and data-driven maintenance methods are expanding.9–12 In isolation, however, none of these approaches will be sufficient given the complexity of dataset shift in clinical environments. Some shifts may be intentional and announced, such as software updates or the release of new clinical guidelines. Some, maybe most, will be more nuanced or the unintended consequence of other healthcare and information system priorities. Successfully responding to these varying forces requires algorithmovigilance programs have a suite of tools at their disposal.
In support of healthcare organizations developing model maintenance programs, we propose a 360 continuum of approaches that address model monitoring and maintenance from critically different angles (see Figure 1 and Table 1). We posit that comprehensive algorithmovigilance programs leveraging preventive, preemptive, responsive, and reactive tactics in coordination can sustain clinical AI models, minimize user disruptions, and reliably support patient care.
Figure 1.
Continuum of algorithmovigilance approaches to ongoing model monitoring and maintenance.
Table 1.
Overview, advantages, and limitations of perspectives on model monitoring and maintenance.
Preventive (stability-focused design) | Preemptive (technical oversight) | Responsive (data-driven oversight) | Reactive (end-user reporting) | |
---|---|---|---|---|
Approaches |
|
|
|
|
Benefits |
|
|
|
|
Limitations |
|
|
|
|
We may be able to prevent some model deterioration through careful planning during development. Stability-focused feature selection and learning algorithms minimize model susceptibility to dataset shift.2 Such models are expected to be relatively consistent over time and less affected by changing clinical settings. Replacing traditional static models with online, continuous learning models, where appropriate, may also minimize the impact of some dataset shift by actively incorporating new information over time.12,13 However, no models will be robust to all dataset shifts they may encounter and continuous learning models must be scrutinized to ensure errant performance trends do not derail model utility.
We can preemptively surveil informatics and clinical landscapes to plan for upcoming technical changes or revisions to clinical guidance. Such technical oversight can allow teams to plan for which—of potentially many—models deployed in their organization may be impacted. These teams could preempt model failures by making backend modifications prior to system updates or initiating necessary revisions to specific models. However, ongoing scrutiny of technical and clinical landscapes is resource intensive, requiring significant expertise and situational awareness. Even well-conducted, complex nuanced dataset shift may not be foreseeable and may defy preemptive measures.
We can be responsive to observed deterioration in model accuracy and impact through data-driven surveillance. Running behind the scenes, surveillance systems can actively monitor performance and impact metrics, triggering updating as needed to maintain models in response to unanticipated dataset shift.10,11,14 While not all updates can be automated and updating may not always restore acceptable performance, responsive data-driven oversight can help sustain multiple models and free up data science teams to concentrate on those models most in need of their intervention.
And of course, we must continue to react when end users notice accuracy issues or diminished utility of AI-enabled tools. User feedback may reveal changes unanticipated through technical oversight and not yet detected through data-driven monitoring. User feedback, particularly in coordination with monitoring of process metrics related to model deployments, may also reveal shifts in model utility not directly related to accuracy, such as the need to adjust prediction delivery within clinical workflows. In response, model managers can investigate, update, and even disable models as needed. To sustain user trust and promote stable use of these technologies in healthcare, reactive approaches should be reserved as the mechanism of last resort.
Using this 360 continuum of algorithmovigilance approaches as a conceptual framework may allow healthcare organizations to sustain clinical AI tools more consistently and efficiently, while also limiting the inevitable need for costly, high-resource interventions. Such efforts will require organizational commitment and the establishment of multidisciplinary teams bringing together clinicians, informaticians, data scientists, and health IT professionals.
Preventive and responsive tactics may be led by data scientists that collaborate with clinical champions to tailor model training and updating around clinical requirements. Successful preventive and responsive approaches may minimize periods of instability or inaccuracy; increase maintenance efficiency; aid in prioritizing data science and health IT workloads; and be nearly transparent to end users, helping sustain trust in AI-enabled tools.
Preemptive and reactive tactics may be led by teams of clinicians, informaticians, and health IT professionals that maintain situational awareness of changes both upstream and downstream of model implementations. Consistently scanning the landscape for upcoming changes and investigating end-user concerns may be costly in terms of human resources; however, these approaches are as critical as more automated and less resource-intensive approaches.
Research and policies are needed to develop systems encompassing these tactical perspectives. Practical recommendations for customizing strategies around local resources are also necessary to ensure the benefits of AI-enabled healthcare are available to patients whether they received care at small community hospitals or large academic medical centers. By embracing comprehensive systems for monitoring and maintenance as a priority within our clinical AI deployments and algorithmovigilance programs, we can help ensure the opportunity and value of clinical AI are realized for patients over the long term.
Acknowledgments
Not applicable.
Contributor Information
Sharon E Davis, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States.
Peter J Embí, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States.
Michael E Matheny, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, United States; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN 37232, United States; Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States; Geriatric Research, Education, and Clinical Care, Tennessee Valley Healthcare System VA Medical Center, Veterans Health Administration, Nashville, TN 37212, United States.
Author contributions
Sharon E. Davis conceptualized the manuscript and developed the initial manuscript. Peter J. Embí and Michael E. Matheny made substantial contributions to the framework and critically revised intellectual content. All authors read and approved the final manuscript.
Funding
None declared.
Conflicts of interest
All authors declare no financial or non-financial competing interests.
Data availability
No new data were generated or analyzed in support of this research.
References
- 1. De Silva D, Alahakoon D.. An artificial intelligence life cycle: from conception to production. Patterns (N Y). 2022;3(6):100489. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Subbaswamy A, Saria S.. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics. 2020;21(2):345-352. [DOI] [PubMed] [Google Scholar]
- 3. Finlayson SG, Subbaswamy A, Singh K, et al. The clinician and dataset shift in artificial intelligence. N Engl J Med. 2021;385(3):283-286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Wong A, Cao J, Lyons PG, et al. Quantification of sepsis model alerts in 24 US hospitals before and during the COVID-19 pandemic. JAMA Netw Open. 2021;4(11):e2135286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Van Calster B, Steyerberg EW, Wynants L, van Smeden M.. There is no such thing as a validated prediction model. BMC Med. 2023;21(1):70. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Matheny ME, Thadaney Israni S, Ahmed M, Whicher D.. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. National Academy of Medicine; 2019. [PubMed] [Google Scholar]
- 7. Davis SE, Walsh CG, Matheny ME.. Open questions and research gaps for monitoring and updating AI-enabled tools in clinical settings. Front Digit Health. 2022;4:958284. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Embi PJ. Algorithmovigilance-advancing methods to analyze and monitor artificial intelligence-driven health care for effectiveness and equity. JAMA Netw Open. 2021;4(4):e214622. [DOI] [PubMed] [Google Scholar]
- 9. Feng J, Phillips RV, Malenica I, et al. Clinical artificial intelligence quality improvement: towards continual monitoring and updating of AI algorithms in healthcare. NPJ Digit Med. 2022;5(1):66. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Guo LL, Pfohl SR, Fries J, et al. Systematic review of approaches to preserve machine learning performance in the presence of temporal dataset shift in clinical medicine. Appl Clin Inform. 2021;12(4):808-815. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Davis SE, Greevy RA Jr, Lasko TA, Walsh CG, Matheny ME.. Detection of calibration drift in clinical prediction models to inform model updating. J Biomed Inform. 2020;112:103611. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Jenkins DA, Sperrin M, Martin GP, Peek N.. Dynamic models to predict health outcomes: current status and methodological challenges. Diagn Progn Res. 2018;2(23):23. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Pianykh OS, Langs G, Dewey M, et al. Continuous learning AI in radiology: implementation principles and early applications. Radiology. 2020;297(1):6-14. [DOI] [PubMed] [Google Scholar]
- 14. Davis SE, Brown JR, Dorn C, Westerman D, Solomon RJ, Matheny ME.. Maintaining a national acute kidney injury risk prediction model to support local quality benchmarking. Circ Cardiovasc Qual Outcomes. 2022;15(8):e008635. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Data Availability Statement
No new data were generated or analyzed in support of this research.