Artificial Intelligence (AI) aims at mimicking human intelligence through computer programs. Machine learning (ML), especially deep learning technologies, aiming at inferring insights from complex data through mathematical modeling, offer an effective way of achieving AI, and they have achieved great success in many disciplines, such as computer vision and natural language processing. Over the past decade, many ML models have also been developed with the goal of improving healthcare, such as predicting the risk of sepsis shock for patients in critical care [1], identifying patients who are at high risk of developing postpartum depression from their historical clinical records [2], and screening patients who are infected by SARS-CoV-2 according to their routine blood test results [3].
Real-world clinical trials are essential for proving that AI applications are safe, effective, and fit for use in healthcare by assessing their performance across diverse conditions and populations, ensuring regulatory compliance, and addressing ethical concerns. Despite the need for clinical trials and the promising results reported in the research papers, the ratio of these models that have been implemented in real world clinical workflows is relatively small. One of the inherent reasons is the complex interactions among multiple stakeholders in the healthcare system including patients, providers, policy makers and insurance companies. In a recent review, Li et al. [4] identified 19 Technical/Algorithm, Stakeholder, and Social levels (TASS) barriers to the application of AI in healthcare and called for future endeavors to address them. With this demand, there has been more and more efforts focusing on particular aspects of these barriers [5,6], or exemplar implementations in different disease contexts [1,2,7], but guidelines for the holistic process of implementating AI models in clinical workflows are still sporadic. To fill in this gap, in this perspective, we provide a AI model implementation roadmap in clinical workflows, which includes three main phases: pre-implementation, peri-implementation and post-implementation. Key modules each phase and how they are interconnected to impact the overall outcome of the entire solution are discussed, with the goal of providing a comprehensive picture on the lifecycle of AI model implementation. Fig.1 summarizes these different stages and the critical components that we will discuss in the following.
Fig 1.

Stages across the deployment of AI models in clinical workflows.
Pre-Implementation
Pre-implementation refers to the stage when the model has been developed and demonstrated strong promise during retrospective analysis. Before we integrate the model into the actual clinical workflow, we need to make sure of the following items.
Model Performance.
The model’s performance needs to be extensively evaluated before it can be deployed. In a recent paper, Wong et al. [8] reported a significant drop of the performance of the sepsis risk prediction model integrated in the Epic system. Finlayson et al. [6] stated that “this was a case in which dataset shift fundamentally altered the relationship between fevers and bacterial sepsis”. Although external validation has been emphasized as an important step to ensure the generalizability of the developed model, recently researchers have argued that such external validation could be unrealistic due to various reasons including population and measurement differences, and it has been suggested to conduct repeated local validation instead. Therefore, retrospective evaluation using the local data from the site that the model will be deployed to is critical. During localization, the operating characteristics and threshold determination can be made based on the specific use case.
Data and infrastructure.
After the model is developed and appropriately evaluated for performance and bias, we need to map out the entire data flow of the model deployment cycle and understand where the data will be fed into the model and how the model output will be demonstrated to the end user. For example, a clinical risk prediction model can be implemented within the electronic health record (EHR) system, such as Epic, through their provided applied programming interfaces (APIs). During this process, the model developers need to work closely with the information technology service (ITS) team to build appropriate connectors (e.g., through the Fast Healthcare Interoperability Resources, or FHIR) so that the EHR data can be fed into the model and model outputs can be transmitted back to the EHR system. We also need to consider where the model will be stored and how frequently the model inference will be needed. Costs and resources to complete this work should be incorporated into the value assessment of the tool.
Model Integration.
In addition to the technical aspects involving model, data and infrastructure, incentives for the integration of the solution should be aligned as the stakeholder who made the request may not be the same as those that will be responsible for acting on the results. It is imperative to understand the current and future state care delivery process as adoption of the tool will be dependent on its fit into a given workflow. The five rights of clinical decision support can be used as a guide: the right person, information, time, context, and channel [9]. A user-centered design approach should be taken and an effector arm should be implemented [10]. Patient and provider input provides valuable insights into the user-friendliness, effectiveness, and overall impact of AI applications on care. At this stage it is appropriate to consider engaging the community for feedback through groups such as a patient advisory council.
Peri-Implementation
Peri-implementation refers to the stage right before and during the model is implemented in the clinical workflow. During this phase, the following items are critical.
Measurement of success.
It is critical to define the measurement of success during model deployment and ensure the data to quantify this measurement is captured during implementation. Typically, such measurement is not directly the model performance, but it is derived from the model’s inference. For example, Adams et al. [1] used mortality reduction to measure the effectiveness of a sepsis shock prediction algorithm, where the doctors who act on the BPA would prescribe antibiotics earlier and may improve patient outcomes such as mortality. In clinical operations, metrics in the electronic health record, such as Epic’s “Pajama Time,” are used to track interventions aimed at reducing physician administrative burden [11]. The measurement of success should be compared against the pre-deployment standard of care to understand the impact of the tool.
Implementation management.
The oversight of medical artificial intelligence is crucial to ensure its safety and effectiveness, not only on a centralized scale, like the US Food and Drug Administration, but also at the local level to address variations in care, patients, and system performance [12]. A clear local governance structure is needed during the model deployment process, as this will involve coordination and collaborations across multiple teams. These teams may include information technology, informatics, data science, health equity, legal, compliance, and information security. An efficient and effective communication mechanism is also required across these teams and with the leadership and end-users. A well-organized documentation structure is needed, so that problems and troubles can be resolved in time.
Silent validation and initial pilot.
Before the model is integrated in the actual clinical workflow, a silent validation and a pilot study are needed to check production data feeds and understand how such a model will impact the clinical workflow. Here “silent” validation means the end users do not have access to the model results, with the goal of recording information on the data input and the model output to ensure it is in line with the retrospective evaluation. A subsequent pilot study, typically in a smaller subset of the final intended population, allows for assessment of the education materials, communication plan, user interface, and potential effector arm.
Post-Implementation
AI model deployment is not a one-stop procedure. After deploying the model, its performance and the impact to the entire workflow need to be closely monitored. Necessary actions, such as model updating, re-training and even decommissioning, should be taken when the model’s behavior deviates from its original intention or becomes harmful to patients.
Monitoring and surveillance.
Most of the disease conditions progress over time, and thus the model trained using patient data collected from a certain period may not work in the future. For example, with COVID-19, the different SARS-CoV-2 variant waves have been associated with different acute infection outcomes. Therefore, a clinical risk prediction model built during the first wave, which is associated with the most severe clinical outcomes in the acute phase for patients who were infected, may not work for later waves. In addition, public health policies and resource abundance may also impact model performance. For instance, Yang et al. [2] created a COVID-19 risk prediction model using patient blood tests results collected during the first wave of the pandemic in the New York city area. During that time, resources needed for conducting the reverse transcription polymerase chain reaction (RT-PCR) test – the golden standard for confirming a patient is infected by SARS-CoV-2, is limited. Consequently, patients could only take the test if they had relevant symptoms such as fever and cough, which led to a high positive rate (close to 50%). However, after the first wave, such resources became much more available and policy also changed so anyone can take the test if they wanted, which reduced the positive rate to around 2%. Yang et al. [13] found that the routine blood test profile distributions of the patients who took the RT-PCR test had changed significantly and the model performance was drastically decreased. Therefore, the model performance needs to be closely monitored and appropriate actions are needed when there are abnormal observations.
Solution Performance.
After model deployment, its behavior will interact with clinicians’ practice, which may impact the model’s performance and further model tuning or retraining is needed. Vaid et al. [14] systematically studied this problem in a simulation framework and found that such model adjustment would further deteriorate the model’s performance and lead to unintended consequences. Therefore, it is critical to carefully log all details of the model deployment process, including when the model was deployed, how it interacted with clinicians and how the model performance was changing over time. Liu et al. [15] proposed a medical algorithmic audit framework to better understand the mechanism of the AI model failure and encourage feedback between the end user, model developer and ITS team, which can better ensure a safe model deployment process.
Bias.
Evaluation of bias should be done at each phase of model deployment to ensure that the model does not introduce or perpetuate health care inequities. During retrospective evaluation, model developers should review the training data to ensure that patients represented in the data match the intended target population [16]. If race or inputs from other protected classes are used as features, then rationale for inclusion of that input should be clearly understood and communicated. The use of surrogate variables for inputs or outcome labels should be reviewed [17]. Model performance should be measured across demographics retrospectively and prospectively, to identify potential disparate performance across groups, which could lead to the introduction or perpetuation of bias. Lastly, the favorable outcome (e.g., resource, intervention) should be identified, and during the post-implementation period, the distribution of the favorable outcome should be measured to determine whether the model interventions are equitable or as expected. Xu et al. [18] summarized the various potential causes of the biased decisions made by algorithms. To deal with this challenge, researchers have developed different checklists for potential algorithmic bias. For example, Finlayson et al. [6] developed an “AI safety checklist” to recognize and mitigate dataset shifts in AI models. Wolff et al. [19] created the Prediction model Risk Of Bias ASsessment Tool (PROBAST) for assessing the risk of bias of the predictive models. These checklists and tools should be used as references for assessing the potential bias in AI algorithms.
In summary, we provided an overview of the lifecycle of implementing AI models in clinical workflows. Different from existing studies focusing on model development or a particular phase of the model implementation process, we provided a complete picture of the aspects at its different phases and how they are interconnected to impact the outcome of the overall solution, which aligns well with the real-world scenario when we actually implement these models. We hope our paper can provide a roadmap and trigger holistic thinking in our communities.
Funding Acknowledgement
F.W. would like to acknowledge the support from NIH awards R01MH124740, RF1AG072449, R01AG080991, R01AG080624, R01AG076448, R01AG076234, as well as NSF award 1750326 and 2212175.
Footnotes
Competing Interests
The authors declare no competing interest for this paper.
Ethics
This paper is on perspectives and discussions, which does not include any actual studies of any kind, thus ethics approval is not needed.
Patient and Public Involvement statement
This study does not involve patient participants.
References
- 1.Adams Roy, Henry Katharine E., Sridharan Anirudh, Soleimani Hossein, Zhan Andong, Rawat Nishi, Johnson Lauren, Hager David N., Cosgrove Sara E., Markowski Andrew, Klein Eili Y., Chen Edward S., Saheed Mustapha O., Henley Maureen, Miranda Sheila, Houston Katrina, Linton Robert C., Ahluwalia Anushree R., Wu Albert W., Saria Suchi. “Prospective, multi-site study of patient outcomes after implementation of the TREWS machine learning-based early warning system for sepsis.” Nature medicine 28, no. 7 (2022): 1455–1460. [DOI] [PubMed] [Google Scholar]
- 2.Liu Yifan, Joly Rochelle, Turchioe Meghan Reading, Benda Natalie, Hermann Alison, Beecy Ashley, Pathak Jyotishman, and Zhang Yiye. “Preparing for the bedside—optimizing a postpartum depression risk prediction model for clinical implementation in a health system.” Journal of the American Medical Informatics Association (2024): ocae056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yang He Sarina, Hou Yu, Vasovic Ljiljana V, Steel Peter, Chadburn Amy, Racine-Brzostek Sabrina E, Velu Priya, Cushing Melissa, Loda Massimo, Kaushal Rainu, Zhao Zhen, Wang Fei. “Routine laboratory blood tests predict SARS-CoV-2 infection using machine learning.” Clinical chemistry 66, no. 11 (2020): 1396–1404. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Li Linda T., Haley Lauren C., Boyd Alexandra K., and Bernstam Elmer V.. “Technical/Algorithm, Stakeholder, and Society (TASS) Barriers to the Application of Artificial Intelligence in Medicine: A Systematic Review.” Journal of Biomedical Informatics (2023): 104531. [DOI] [PubMed] [Google Scholar]
- 5.Reddy Sandeep, Rogers Wendy, Makinen Ville-Petteri, Coiera Enrico, Brown Pieta, Wenzel Markus, Weicken Eva et al. “Evaluation framework to guide implementation of AI systems into healthcare settings.” BMJ health & care informatics 28, no. 1 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Finlayson Samuel G., Subbaswamy Adarsh, Singh Karandeep, Bowers John, Kupke Annabel, Zittrain Jonathan, Kohane Isaac S., and Saria Suchi. “The clinician and dataset shift in artificial intelligence.” New England Journal of Medicine 385, no. 3 (2021): 283–286. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Boag William, Hasan Alifia, Kim Jee Young, Revoir Mike, Nichols Marshall, Ratliff William, Gao Michael, Zilberstein Shira, Samad Zainab, Hoodbhoy Zahra, Ali Mushyada, Khan Nida Saddaf, Patel Manesh, Balu Suresh, Sendak Mark. “The algorithm journey map: a tangible approach to implementing AI solutions in healthcare.” npj Digital Medicine 7, no. 1 (2024): 87. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wong Andrew, Otles Erkin, Donnelly John P., Krumm Andrew, McCullough Jeffrey, DeTroyer-Cooley Olivia, Pestrue Justin, Phillips Marie, Konye Judy, Penoza Carleen, Ghous Muhammad, Singh Karandeep. “External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients.” JAMA Internal Medicine 181, no. 8 (2021): 1065–1070. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Osheroff Jerome A., Teich Jonathan M., Middleton Blackford, Steen Elaine B., Wright Adam, and Detmer Don E.. “A roadmap for national action on clinical decision support.” Journal of the American medical informatics association 14, no. 2 (2007): 141–145. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Martinez Vanessa A., Betts Robin K., Scruth Elizabeth A., Buckley Jacqueline D., Cadiz Vilma R., Bertrand Linda D., Paulson Shirley S., Dummett Brian Alex, Abhyankar Stella S, Reyes Vivian M, Hatton Joeffrey R, Sulit Reynaldo, Liu Vincent X. “The Kaiser Permanente Northern California Advance Alert Monitor Program: An Automated Early Warning System for Adults at Risk for In-Hospital Clinical Deterioration.” The Joint Commission Journal on Quality and Patient Safety 48, no. 8 (2022): 370–375. [DOI] [PubMed] [Google Scholar]
- 11.Arndt Brian G., Micek Mark A., Rule Adam, Shafer Christina M., Baltus Jeffrey J., and Sinsky Christine A.. “Refining Vendor-Defined Measures to Accurately Quantify EHR Workload Outside Time Scheduled With Patients.” The Annals of Family Medicine 21, no. 3 (2023): 264–268. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Nicholson Price W, Sendak Mark, Balu Suresh, and Singh Karandeep. “Enabling collaborative governance of medical AI.” Nature Machine Intelligence 5, no. 8 (2023): 821–823. [Google Scholar]
- 13.Yang He S, Hou Yu, Zhang Hao, Chadburn Amy, Westblade Lars F, Fedeli Richard, Steel Peter AD, Racine-Brzostek Sabrina E, Velu Priya, Sepulveda Jorge L, Satlin Michael J, Cushing Melissa M, Kaushal Rainu, Zhao Zhen, Wang Fei. “Machine Learning Highlights Downtrending of COVID-19 Patients with a Distinct Laboratory Profile.” Health data science (2021). Vol 2021 Article ID: 7574903. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Vaid Akhil, Sawant Ashwin, Suarez-Farinas Mayte, Lee Juhee, Kaul Sanjeev, Kovatch Patricia, Freeman Robert, Jiang Joy, Jayaraman Pushkala, Fayad Zahi, Argulian Edgar, Lerakis Stamatios, Charney Alexander W, Wang Fei, Levin Matthew, Glicksberg Benjamin, Narula Jagat, Hofer Ira, Singh Karandeep, Nadkarni Girish N. “Implications of the Use of Artificial Intelligence Predictive Models in Health Care Settings: A Simulation Study.” Annals of Internal Medicine 176, no. 10 (2023): 1358–1369. [DOI] [PubMed] [Google Scholar]
- 15.Liu Xiaoxuan, Glocker Ben, McCradden Melissa M., Ghassemi Marzyeh, Denniston Alastair K., and Oakden-Rayner Lauren. “The medical algorithmic audit.” The Lancet Digital Health 4, no. 5 (2022): e384–e397. [DOI] [PubMed] [Google Scholar]
- 16.Jamali Haya, Castillo Lauren T., Morgan Chelsea Cosby, Coult Jason, Muhammad Janice L., Osobamiro Oyinkansola O., Parsons Elizabeth C., and Adamson Rosemary. “Racial disparity in oxygen saturation measurements by pulse oximetry: evidence and implications.” Annals of the American Thoracic Society 19, no. 12 (2022): 1951–1964. [DOI] [PubMed] [Google Scholar]
- 17.Obermeyer Ziad, Powers Brian, Vogeli Christine, and Mullainathan Sendhil. “Dissecting racial bias in an algorithm used to manage the health of populations.” Science 366, no. 6464 (2019): 447–453. [DOI] [PubMed] [Google Scholar]
- 18.Xu Jie, Xiao Yunyu, Wang Wendy Hui, Ning Yue, Shenkman Elizabeth A., Bian Jiang, and Wang Fei. “Algorithmic fairness in computational medicine.” EBioMedicine 84 (2022). 104250. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wolff Robert F., Moons Karel GM, Riley Richard D., Whiting Penny F., Westwood Marie, Collins Gary S., Reitsma Johannes B., Kleijnen Jos, Mallett Sue, and PROBAST Group. “PROBAST: a tool to assess the risk of bias and applicability of prediction model studies.” Annals of internal medicine 170, no. 1 (2019): 51–58. [DOI] [PubMed] [Google Scholar]
