Abstract
Integrating artificial intelligence (AI) into health care offers the potential to address critical challenges related to access to care, workforce burnout, and health inequities. Despite its promise, AI adoption remains limited due to safety, efficacy, and equity concerns. This paper presents a novel and comprehensive framework for responsible AI development, evaluation, and deployment in health care, encompassing four key phases: (1) AI Solution Design and Development, (2) AI Solution Qualification, (3) AI Solution Efficacy and Safety Evaluation, and (4) AI Solution Impact. By establishing rigorous standards for operational, clinical, and technical quality, the framework aims to guide AI developers and health care professionals toward creating AI solutions that are ethical, effective, and scalable. This structured approach fosters collaboration and mitigates risks to help AI achieve its full potential in improving patient outcomes and health care efficiency.
Many barriers exist in today’s health care system to delivering high-quality, accessible, and affordable care. Such barriers include inaccessibility to quality services, inability to pay for services, missed or delayed diagnoses, and workforce burnout. Although policymakers continue to explore alternative care delivery and payment models to address these issues, many of these solutions will likely bring only incremental improvement. As such, solutions that leverage artificial intelligence (AI) are being examined for their transformative potential in the hope that they can better address some of these challenges.
Studies have shown that some AI models can perform better than clinicians at diagnosing certain diseases.1 Similarly, many AI solutions promise to reduce health care professionals’ burden by automating specific tasks.2 However, many of these solutions are not being implemented for various reasons, including limited access to high-quality and diverse data, inconsistent efficacy or validation of solutions, and lack of ethical and responsible processes. Together, these challenges limit the development of high-quality AI solutions, which can ultimately pose risks to patient safety and health equity. AI systems are also not immune to the biases present within the health care system and broader society. Such risks to health equity and patient safety have been well documented. Paul et al3 described algorithmic bias and summarized examples of AI algorithms that performed differently across race, ethnicity, gender, geographic location, and socioeconomic status and ultimately led to worse outcomes in specific subgroups. AI models are built using data that are generated, recorded, and labeled by humans, meaning they can embed and amplify existing social, technological, and clinician biases if left unchecked.4 Similarly, technological limitations, such as design flaws in medical devices, can lead to systematic underperformance in specific populations, further entrenching disparities in care.5
These layers of bias can create a cycle in which AI tools perpetuate or worsen existing health inequities, even when they appear to perform well in aggregate. For instance, models trained predominantly on data from majority populations may fail to accurately identify disease in underrepresented groups4, as seen in dermatology AI models performing less accurately on darker skin tones.6 Additionally, algorithms that initially demonstrate fair performance in one setting may develop new biases when deployed in different clinical environments or patient populations.7 Alongside ethical and equity considerations, it is also important to examine how the design of machine learning (ML) models can encode subtle clinical biases. Harris et al8 draw attention to a critical issue: many AI models are trained to predict clinical decisions such as intensive care unit (ICU) admissions, code blue events, or specialist referrals. Even when a model is trained on demographically representative data, it can still absorb and reinforce harmful clinical biases if its prediction target reflects human behavior rather than biological reality. This often occurs when ML models are designed to predict clinical decisions such as ICU admissions, diagnostic testing, or prescribing patterns because these decisions are shaped by more than clinical need. They reflect provider judgment, institutional practices, and broader cultural norms that may carry implicit or systematic bias. This reveals a critical limitation in model development: fairness cannot be assessed solely by looking at the inputs and outputs. It also requires scrutiny of what the model is being asked to learn and whether that reflects clinical truth or biased behavior.
ML models hold the promise of transforming health care by turning large volumes of clinical data into actionable insights but can inevitably reflect the biases embedded within the data and systems that generate them. As AI continues to be explored for its potential to improve health care delivery, it is essential to recognize that these technologies are not value-neutral and require careful review, transparent evaluation, and ongoing monitoring to mitigate bias and protect patient safety and equity.
With few defined and universally accepted standards and testing mechanisms to ensure safety, efficacy, and equity, accordingly, we have created a novel and comprehensive strategy to help guide the health care audience in the responsible creation and delivery of AI solutions (see Figure 1). This strategy comprises four phases:
Figure 1.
A proposed phased approach to responsible AI development and deployment.
Phase 1: AI Solution Design and Development
Phase 2: AI Solution Qualification
Phase 3: AI Solution Efficacy and Safety Evaluation
Phase 4: AI Solution Impact
Phase 1: AI Solution Design and Development
Algorithm design and development are multistep process; best practices and processes are wellestablished within the discipline of data science. Each step involves thoughtful assumptions and diligent practices that, if not completed carefully, can potentially produce ineffective, unsafe, or biased algorithms that pose downstream challenges to health equity.9,10 Health care-focused AI algorithms should follow these same principles; however, nuances specific to health care require careful consideration. We provide a general summary of many of the established data science practices and provide additional context specific to health care predictive and generative AI solutions. Specifically, we discuss solution design, data procurement, data analysis and preprocessing, model selection and training, internal model validation, and model tuning. Although much of this framing relates to predictive AI solutions (algorithms that analyze and learn from historical data to predict outcomes or classify data), there is considerable overlap and parallels to algorithms leveraging generative AI (algorithms that learn patterns from underlying data to create new content).
Solution Design
As shown in Figure 2, solution design is a crucial first step in AI development, requiring more than simple pattern recognition in datasets; it must also consider clinical context. Identifying the problem involves a deep understanding of the clinical domain, data availability, and workflow complexities that could bias the data. Collaborating with domain experts and end users provides valuable insight into existing challenges and the needs of proposed solutions. Developers should partner with clinical teams to create a detailed AI solution proposal that defines the problem, anticipated impact, potential biases, workflow integration, stakeholder engagement, and plans for evaluating efficacy and key performance indicators. Insights from this design process will guide the subsequent stages of AI algorithm development.
Figure 2.
Multistep approach to AI solution design and development.
Data Procurement
High-quality data underpins the development of any high-quality model. Dataset “depth,” “breadth,” and “spread” can significantly impact model performance. “Depth” refers to the volume or size of the dataset (ie, the number of patients represented [ie, the rows]). “Breadth” refers to the dimensionality of information captured and the types of sources represented (ie, the columns), and “spread” refers to the heterogeneity and diversity, particularly of the population that is represented. The lack of any of the three can result in algorithmic bias.3 In other words, if the data used to train a model are biased, the resultant model will likely represent and further propagate those same biases.
Consider an ML algorithm that predicts hospitalized patients needing escalation of care to the ICU. If the training dataset does not represent a specific minority subgroup, the model may have lower accuracy when applied to this group. Poorer performance may delay appropriate triage of this group to the ICU and result in disproportionately worse outcomes.11,12 Or consider an alternate example of an ML algorithm developed to predict depression in a population. If depression is systematically underdiagnosed in specific populations, the data procured to represent depression within this population likely reflect this systemic misdiagnosis. An ML algorithm trained on these data to predict depression may propagate this bias and lead to inappropriate misdiagnosis in this subgroup of patients, thus further contributing to health inequity.11,13 Generative AI algorithms, such as large language models, require large volumes of textual data, and the source and quality of such data often should be considered, especially for health care applications.
These examples illustrate the critical role of underlying data in AI development. The first example emphasizes the need for data with sufficient depth, breadth, and spread, which are essential for creating equitable AI systems. Although finding a single dataset embodying all these attributes may be challenging, understanding the training dataset’s representativeness can help model developers anticipate potential biases and address them in later development stages. In the second example, biases inherent in society and clinical practices influence data capture, leading to potentially unfair ML algorithms, as seen in depression diagnosis. Although datasets with sufficient depth, breadth, and spread may help counteract some biases, they may also reflect systemic societal issues. Recognizing such limitations, model developers might choose alternative proxy variables that reduce the impact of biased predictors. Awareness of these potential bias sources can help enable developers to create more reliable algorithms and transparently communicate limitations to end users.
Data Analysis: Exploratory Data Analysis, Data Preprocessing, and Feature Engineering
Before building an AI model, data must be thoroughly prepared through three primary phases: (1) exploratory data analysis (EDA), (2) data preprocessing, and (3) feature engineering. EDA enables the AI developer to understand data complexity and limitations using visualizations, summaries, and exploratory calculations. Next, data preprocessing transforms raw health care data into a consistent, usable format, ensuring uniformity in elements such as date formats and biometric scales. This phase includes four steps: data cleaning (addressing irrelevant and missing data), data transformation (standardizing data for analysis), data integration (combining data from different sources), and data reduction. Data reduction involves reducing dataset size while preserving data integrity.14,15 The full scope of these steps is detailed elsewhere and should be approached rigorously when developing a model.
Finally, feature engineering means turning cleaned data into meaningful variables that better capture the clinical information needed for the model. For example, body mass index (BMI) is a surrogate representation of lean body mass and can be captured by creating a new variable that calculates BMI from height and weight.
Data analysis is a critical and often complex step that shapes the final AI algorithm. If not handled carefully, it can introduce bias. To prevent this, developers need a thorough understanding of the dataset and its clinical context. This helps them make informed decisions during preprocessing and anticipate potential impacts on health equity.
Model Selection and Training
During model selection, developers choose the most appropriate model on the basis of prediction type, performance, complexity, and resource availability. Depending on the problem being solved, specific models may be better suited. The computational requirements of specific models may determine whether an entity can logistically support them from a cost and resource perspective. There are many more considerations not covered here.
Although methodologies often differ across AI algorithm developers and specific use cases, models typically proceed through an iterative process of model comparison, assessment, and hyperparameter tuning until a model developer ultimately chooses the optimal model for the given task.
Internal Model Evaluation: Model Validation
To assess an AI model’s performance and generalizability, it is crucial to evaluate it on an unseen portion of the initial data, known as internal model validation. In predictive AI, a portion of the dataset, called “holdout” data, is typically set aside for this. Key metrics, such as sensitivity, specificity, PPV, NPV, accuracy, precision, and AUC, measure the model’s performance. A common approach is a “random split” method, where 70% of the data are used for training and 30% are held out for validation. This step reveals any biases or issues from data preparation and training. Internal validation for generative algorithms may extend beyond traditional measures of performance and include quality, fairness, safety, and other metrics that may be automated or performed by expert human review. If performance meets expectations, the model moves to external validation; if not, methods such as model tuning may be applied.
Internal Model Evaluation: Model Tuning
During algorithm tuning, hyperparameters are adjusted to improve model performance. Hyperparameters are settings that guide how a model learns and are set before training begins. These values can be refined through iteration to achieve better results. After tuning, the model should be retrained on the dataset and re-evaluated through internal validation. If the model still underperforms, developers should review the data analysis and preprocessing steps to identify and address any issues before proceeding.
Phase 2: AI Solution Qualification
Once developers create a model, they must take several steps to assure safety and effectiveness so that it will impact clinical care. Those that fall under FDA regulation must provide the FDA with a model’s intended use, indications for use, proposed labeling, a narrative description, a physical description, substantial equivalence to an existing device (if pursuing a 510k submission), performance data, and any other special considerations from FDA to aid in the approval process.16 However, AI algorithms that do not fall within FDA regulations do not typically undergo formalized assessments. Instead, the end user is responsible for assessing an algorithm’s safety and efficacy and establishing formalized risk management procedures for the AI solution developer to adhere to proper use and monitoring.
This leads to significant variability in AI solution assessment. Consequences of this variability can lead to suboptimal or biased AI solutions or inefficiencies in the adoption process of AI solutions. We advocate for a more clearly defined AI solution qualification process to assess solutions for operational, clinical, and technical (quantitative) quality as shown in Figure 3. Insights gleaned from this process should guide and inform health systems and solution developers of what further assessments might be needed to ensure the safety and efficacy of the AI model being examined.
Figure 3.

Proposed components of AI solution qualification to help articulate and clarify a solution’s safety and effectiveness.
Operational Qualification
It is well accepted that AI solutions must pass through 8 various operational assessments to ensure compliance with privacy and security requirements. We refer to “operational quality” as an AI solution’s ability to comply with such standards, laws, and best practices. Innovators must adhere to these requirements and standards through their policies and procedures. Before deployment, AI algorithm developers must determine whether a solution requires FDA approval. For solutions that do not require FDA approval, adherence to data privacy, security, and advertising laws is still necessary. Although the FDA may not have jurisdiction over nondevice algorithm applications, following FDA guidance on Good ML Practices and postmarket management of medical devices is still highly advisable and necessary for ensuring the nonmedical device solution is safe for use in patients.17
The entities intending to implement the AI solution will conduct an operational risk assessment to ensure that the innovators creating the algorithm demonstrate operational quality and utilize sound practices that are compliant with laws, regulations, and institutional requirements. This is often referred to as a third-party risk management assessment.
Clinical Qualification
Although operational quality is critical to protecting the most fundamental risks to patient privacy and security, it is only one component of AI solution quality. “Clinical quality” describes the clinical value and risks of the AI solution as they relate to patient outcomes and proper usage. There are many lenses through which to view clinical value and risk, and approaches might vary widely. Regardless, stakeholders should take a systematic approach to both principles.
When assessing clinical quality, several elements must be comprehensively evaluated and communicated to end users in easy-to-understand language appropriate to their expertise. These include:
-
•
Intended end users and individuals most impacted
-
•
Intended use and workflows implemented
-
•
Intended and anticipated clinical, financial, and operational impact on individual patients and health systems
-
•
Information surrounding algorithm design and development
-
•
Data surrounding the solution’s performance
-
•
Results of external validation assessments
-
•
Potential solution limitations, areas of bias, and risks
-
•
Anticipatory guidance for clinicians or administrators
This information should collectively inform “clinical risk”, a concept that has not been formalized and defined. We define clinical risk as the potential for negative impact on patient and population-level health outcomes and health equity. Such risks are important to characterize and communicate so that health systems, clinicians, and patients understand solutions’ potential positive and negative impacts.
Technical Qualification
Next, we must consider the “technical quality” of an AI algorithm, which reflects technical value and technical risk. Like clinical quality, there has yet to be a defined way to view technical quality. We propose that technical quality demonstrates a combination of metrics that inform an algorithm’s performance and stability over time. Some of these metrics, as they relate to predictive and/or generative AI, may include:
-
•
Model type
-
•
Performance metrics from internal and external validation
-
•
Training population demographics
-
•
Population stability
-
•
Volatility features
-
•
Variance in data collection
-
•
Data latency
-
•
Model interdependencies
-
•
Hallucination risk
-
•
Safety
-
•
Consistency
Risk Mitigation Needs Assessment
Understanding these factors is crucial in determining how often a model needs reassessment, tuning, and monitoring. For example, a model predicting hypertension on the basis of “last blood pressure” might be more volatile than one using a “6-month average blood pressure,” as single readings can vary daily and contextually. This could suggest that a different variable for blood pressure might be more reliable. Additionally, if the model was trained on a population with a mean age of 40 but applied to a population with a mean age of 65, its generalizability may be uncertain. External validation studies would help assess its performance across different demographics.
The output of these processes should explain if and how an AI algorithm should be further tested and assessed. For example, reviewers may identify a need for additional clinical studies to be completed prior to deployment of the model within their health system. Or they may determine that the solution should be monitored over time across a set of clinical and quantitative metrics. Algorithm remediation or further steps to mitigate bias may also be identified.
We encourage health care end users to develop appropriate processes to help qualify solutions prior to implementation. The outcome of such a process should be used to inform a set of recommendations into algorithm evaluation and implementation, which will be discussed in Phase 3 and Phase 4 below.
Phase 3: AI Solution Efficacy and Safety Evaluation
Numerous methodologies can be utilized to assess an algorithm’s efficacy and appropriateness as shown in Figure 4. These range from pilot studies to highly intensive and rigorous evaluations, such as randomized control trials (RCT). If an AI solution falls under FDA’s regulatory oversight, the evidence required to demonstrate safety and efficacy varies by the level of risk that the “device” presents. Class I devices require FDA registration but typically do not require FDA notification and review. For Class II devices, which encompass most AI solutions submitted to the FDA at this time, manufacturers must show that the device is as safe and effective as a substantially equivalent existing device that the FDA previously approved. This can include 3-site validation testing, clinical trials, and other validation methods.
Figure 4.
Methodologies of efficacy and appropriateness for the assessment of an AI solution.
For AI solutions that do not fall under FDA oversight, determining the appropriate level of assessment is less defined. Clinical and quantitative quality should guide how much evaluation is needed. Although not every solution may require an RCT, some end users may prefer this level of evidence for higher-risk solutions. Decisions about the level of assessment should be based on clinical and quantitative risk profiles, which still require clearer definitions. Depending on priorities and resource limitations, the level of evaluation will likely vary across institutions. Below, we discuss various types of assessment and provide general guidance on when each layer may be appropriate.
Mathematical External Validation
Previously, we discussed the internal mathematical validation process conducted by model developers of algorithms. This type of assessment is performed on a portion of data not used during the training process, referred to as the “holdout” data. Internal validation helps inform an iterative calibration, tuning, and data preprocessing process. It is also a standard practice to perform independent external mathematical validation.18 This means the algorithm is validated on an entirely different dataset, ideally by an independent party.18, 19, 20 In addition to independent party evaluation, health care institutions that are interested in using a particular algorithm may also find benefit in independently assessing model performance on their local dataset.
External validation significantly benefits health care AI algorithms, chiefly by assessing a model’s generalizability. A model trained on a limited dataset may not perform consistently across diverse populations. Alternatively, an large language models trained on unstructured data from a population of outpatients, for example, may not perform as well on hospitalized patients.
As discussed earlier, AI models can reflect, perpetuate, or propagate bias and can potentially negatively impact health outcomes. Systematically assessing for bias across various dimensions can help improve understanding of an algorithm’s differential impact in practice and take steps to mitigate potential negative effects. Such systematic assessments can also help differentiate between unexpected versus expected bias, improving transparency of an algorithm’s performance. For example, consider an algorithm developed on a large set of diverse data to predict beta-thalassemia, which is an inherited blood disorder that is more common in people of Mediterranean, Middle Eastern, and Asian descent. Because of the higher prevalence of this condition across these groups, the algorithm may preferentially learn from these patients and thus overrepresent patients of these demographics. This may result in a performance bias that is somewhat expected based on the training data and the epidemiological characteristics of this disease, thereby manifesting as a performance bias that is more representative of these subgroups than others.
Although some differences in model performance or prediction may stem from underlying variation, such as when disease is more prevalent in certain ancestry groups, other disparities may reflect harmful bias rather than clinically appropriate variation. Distinguishing between the two is essential during external validation, as it helps determine whether subgroup differences are expected based on underlying biology or signal potential inequities in care that the model may unintentionally reinforce.21 For instance, if historical data shows that certain patient groups consistently received delayed treatment for the same condition, an algorithm trained on those patterns might learn to replicate those disparities. In such cases, the model is not identifying a clinically appropriate difference but instead mirroring biased human decision-making. Careful evaluation of subgroup performance is therefore critical to ensure that AI models do not perpetuate unjust patterns of care and that their recommendations align with equitable, evidence-based clinical standards.
Performing such subgroup analyses early in the AI development life cycle during external validation can expose potential biases, helping developers improve the algorithm and aiding clinicians in understanding its appropriate use and associated risks.
However, external validation is not always routinely conducted. Privacy and security constraints limit access to health care data, making external validation challenging. Innovators may also resist sharing their intellectual property with independent parties. Finally, not all health care partners and users require external validation for these algorithms.
Clinical Studies
Clinical studies can and should further evaluate many AI algorithms. These study types may include pilot and feasibility studies, retrospective analyses, prospective studies, RCTs, and real-world evidence evaluation. These study types are extensively described elsewhere; however, it is sufficient to note that they vary in rigor and subsequent reliability of results. The type of clinical study required should be commensurate with a given AI solution’s clinical and quantitative risk level. Although retrospective analyses may be helpful in providing an early understanding of how an AI algorithm is expected to perform, RCTs will be more likely to serve as the preferred standard for clinical validation and should be performed when feasible.
When an AI model is used to predict a harmful clinical event, for example deterioration, cardiac arrest, or ICU transfer, it is often paired with an intervention aimed at preventing that outcome. If the intervention is successful, the event may no longer occur, which weakens the association between the model’s predictions and observed outcomes in retrospective data. As a result, the model may appear less effective when evaluated using historical outcomes, even though it is functioning properly and benefiting patients in real time.22 This “Early Warning Paradox” presents a unique challenge for retrospective validation efforts, which may underestimate a model’s utility or lead to incorrect conclusions about its performance. Only through prospective studies, where the model is deployed in a live setting and its influence on patient care and outcomes is directly observed, can these dynamics be properly captured.
Pre-Translation Evaluation
For an AI solution to effectively impact clinical care or administrative processes, it must integrate seamlessly into real-life clinical workflows with real end users. Successful translation requires thorough testing and evaluation, including internal and external validation, local validation, clinical studies, and trials. Although these tests are valuable, they often rely on retrospective data or limited study populations. Assessing model efficacy in live workflows—where algorithms interact with diverse users, processes, and systems like electronic health records—is essential for a comprehensive evaluation.
The Scaled Agile Framework (SAFe) is a structured software development and deployment approach that provides guidelines for quality practices and lean-agile principles for continuous delivery, inspection, and risk reduction.23 Although the entirety of the approach to software development and deployment is out of scope here, several critical components of SAFe can be helpful to clinicians, informaticists, and health care leaders as they assess AI solutions. Ideally, these groups would partner with software developers, IT specialists, and data scientists throughout this proposed modified SAFe that is specific to health care AI.
We draw from SAFe and other frameworks to outline three phases of pre-translation evaluation: (1) Pre-Deployment, (2) Passive Deployment, and (3) Active Deployment.
Pre-deployment
Pre-deployment assessments can be useful in anticipating potential challenges during solution translation. Although there are many types of pre-deployment assessments, we describe a few that might be particularly useful. These include usability studies, clinical bias assessments, and systematic workflow analyses. Although clinical bias assessment and systematic workflow analyses are not formally defined, we propose these areas be further characterized and standardized in the future.
Usability studies assess how easily users interact with a product, technology, or system. Metrics typically examined can include time taken for task completion, objective scoring measures, task success rate, error rates, user satisfaction, and other factors that approximate ease of use.24 A technology’s usability can impact adoption and translation of a solution at scale. For example, if an algorithm designed to predict decompensation to the ICU displays its output in a flowsheet that must be manually searched, the algorithm may experience limited adoption—even if it performs well and has limited bias. Further, if an AI solution has poor usability, it may be inconsistently or inappropriately used, contributing to reduced adoption, worse outcomes, or bias.
Although bias can be measured quantitatively during internal and external validation, these metrics must be interpreted within relevant clinical contexts. For example, if an algorithm predicting hypertension shows an AUC of 0.9 in nonobese males aged 40-65 but only 0.6 in obese males of the same age group, this raises important clinical considerations. Should the algorithm be used only with specific patients, or should end users be alerted to possible performance limitations in certain demographics? Such decisions require clinical judgment beyond the numbers alone. Translating these statistical findings into practical guidance helps ensure fair and effective use, providing clinicians with valuable anticipatory insights.
Systematic workflow analysis is a critical part of pre-deployment assessment. Clinical and administrative workflows vary widely across health systems, specialties, and individual users. It is important to understand how a solution will fit into these workflows and what configurations may be needed at different sites. Whenever possible, human factors specialists and implementation scientists should be involved in this process. Creating detailed workflow diagrams before deployment can help capture workflow nuances and identify potential sources of bias.
Passive Deployment
In AI, passive deployment refers to operationalizing AI models without requiring active user engagement during the deployment process. Minimal user interaction is needed to integrate the AI system fully into existing workflows. We propose three types of passive deployment assessments that are most relevant to clinical end users: staging, simulation, and silent deployment.
Staging refers to the practice of testing and verifying algorithm functionality prior to release into live workflows, or production.25 Staging environments replicate production settings, providing a realistic space to test and verify that a model functions as intended before deployment. This step helps prevent faulty features from reaching clinical use. Staging also allows clinicians and health care providers to review AI solution’s outputs to ensure they are clinically appropriate and do not reflect bias or safety issues.
Simulation offers a way to test AI models’ performance, impact, and potential interactions in a controlled, nonimpacting environment. Unlike staging, which involves pre-deployment testing, simulation enables developers to assess AI behavior under adjustable conditions. For example, an ICU decompensation algorithm can be tested in both inpatient and outpatient contexts to see if it falsely triggers in an outpatient setting. Simulation can also reveal performance biases across patient groups.
Silent deployment, or shadow mode, is a passive assessment method where an AI algorithm runs in a real clinical environment but remains hidden from end users. This method allows for extended monitoring of the AI’s performance, offering a valuable opportunity to evaluate adequacy prior to deployment in live workflows that impact end users and patients.
Active Deployment
Active deployment is defined as implementing AI algorithms in live workflows to make predicted outputs visible to end users. Algorithms can be systematically assessed and optimized in real-world workflows.26 One type of assessment performed during this phase is A/B testing, which can also be referred to as a soft launch. A/B testing exposes a select group of users to a newly deployed algorithm; algorithm performance, user behavior, and feedback are collected and analyzed to help optimize the AI solution. By limiting the end users and contexts, the potential risks and harms are minimized and monitored. A/B testing is critical in ensuring solutions are deployed thoughtfully and carefully in the real-world.27
Phase 4: AI Solution Impact
Translation and Adoption
The preceding stages of algorithm design, qualification, and evaluation are essential to preparing for safe and responsible implementation into clinical practice. Next, solutions must be translated and adopted; this phase involves various components as shown in Figure 5. Algorithms must undergo workflow integration. Here, algorithms are embedded into production environments, guided by Phase 3 findings, and executed by technology experts. Each algorithm’s suitability depends on its specific clinical context; for example, a chronic kidney disease prediction tool might be best used in an outpatient intake process, whereas a hemodynamic decompensation predictor might be integrated into an inpatient provider’s patient list.
Figure 5.
Proposed elements to prepare for solution translation and adoption.
Next, the deployed algorithm is assessed in workflow validation to confirm its expected performance. Real-world integration may reveal challenges, such as data field mismatches—like an algorithm expecting “blood pressure” data when the system uses “latest blood pressure”—which could impact functionality.
Finally, successful translation is heavily dependent on effective change management. Achieving widespread adoption of a novel technology or intervention is not trivial, and there are many challenges to consider that are comprehensively described elsewhere.28,29 Change management becomes critically important in surmounting many adoption hurdles, and both solution developers and health systems must work together to provide sufficient education, transparency, governance, and incentive alignment.
Continuous Monitoring and Refinement
After an AI solution is deemed safe for clinical use, it is deployed in live workflows, where its outputs are accessible to a broader user base and can significantly influence patient outcomes. Ongoing monitoring and periodic refinement are essential for maintaining safety and efficacy. Performance metrics and clinical outcomes should be tracked both retrospectively and in real time to monitor for safety, reliability, efficacy, and equity.
We recommend tracking metrics across patient subgroups and transparently communicating results to quickly identify potential safety and fairness issues. Developers and clinicians should reassess each development and evaluation phase to ensure best practices are followed if discrepancies arise. Refinements should be made to enhance performance and reduce bias, with user feedback actively incorporated. Effective refinement relies on collaboration between AI developers and end user partners.
Transparent AI Lifecycle Documentation
Fragmented and limited documentation can hinder the adoption of AI in clinical practice by creating transparency gaps that affect evidence sharing, bias mitigation, accountability, and explainability. In contrast, strategic documentation aligned with FDA guidelines can support Quality Management Systems (QMS) and risk-based practices, promoting communication, rigor, and ethical standards. Effective governance and transparent documentation enable health care organizations to implement AI responsibly, reduce burdens, and achieve ethical AI use at scale. To improve health care AI documentation, challenges such as bias mitigation, accountability, governance, and clarity must be addressed. Using a QMS framework, teams can proactively manage risks while supporting transparent development and deployment. The following recommended strategies characterize this proactive approach:
-
(1)
Formulating documentation delineating translational procedures and fostering a collaborative, multidisciplinary approach.
-
(2)
Placing paramount importance on patient safety considerations and aligning AI applications with clinically appropriate use cases.
-
(3)
Ensuring the thorough completion of imperative testing and validation processes before deployment.
-
(4)
Crafting a meticulously detailed deployment roadmap for AI-based tools, commencing with identifying problem areas and extending to encompass comprehensive deployment and ongoing maintenance phases.
Conclusion
This paper explores the multifaceted approach required for integrating responsible AI within health care by outlining four key phases: 1) AI Solution Design and Development, 2) AI Solution Qualification, 3) AI Solution Efficacy and Safety Evaluation, and 4) AI Solution Impact. Together, these phases offer a comprehensive roadmap for health care professionals, AI developers, and policymakers to follow, ensuring the ethical, safe, and effective use of AI technologies in health care settings. The proposed framework emphasizes both the technical and clinical aspects of AI development while underscoring the importance of ethical considerations, patient safety, and health equity.
As health care stands on the brink of an AI-driven transformation, realizing the full benefits of these technologies requires responsible development and implementation guided by principles such as those outlined here. Adopting guidelines and guardrails will mitigate AI-related risks while maximizing AI’s potential to improve patient outcomes, reduce health care costs, and elevate care quality. This journey toward responsible AI in health care is ongoing and demands continuous effort, collaboration, and stakeholder commitment. Following this structured approach, we can navigate AI integration’s complexities and move closer to its transformative potential.
Potential Competing Interest
Given her role as Guest Editor for this special issue, Dr Overgaard, had no involvement in the peer-review of this article and has no access to information regarding its peer-review. Full responsibility for the editorial process for this article was delegated to an unaffiliated Editor.
Ethics Statement
Authors declare no competing interests. This publication did not involve human participants, living animals, or biological samples. All data used were sourced from publicly available data sources. Therefore, no ethical approval from an Institutional Review Board (IRB) or Institutional Animal Care and Use Committee (IACUC) was required.
References
- 1.Topol E.J. High-performance medicine: the convergence of human and artificial intelligence. Nat Med. 2019;25(1):44–56. doi: 10.1038/s41591-018-0300-7. [DOI] [PubMed] [Google Scholar]
- 2.Yu K.H., Beam A.L., Kohane I.S. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2(10):719–731. doi: 10.1038/s41551-018-0305-z. [DOI] [PubMed] [Google Scholar]
- 3.Paul C., John H., Michael P. A proposal for developing a platform that evaluates algorithmic equity and accuracy. BMJ Health Care Inform. 2022;29(1) doi: 10.1136/bmjhci-2021-100423. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Saint James Aquino Y. Making decisions: bias in artificial intelligence and data-driven diagnostic tools. Aust J Gen Pract. 2023;52(7):439–442. doi: 10.31128/AJGP-12-22-6630. [DOI] [PubMed] [Google Scholar]
- 5.Cross J.L., Choma M.A., Onofrey J.A. Bias in medical AI: implications for clinical decision-making. PLoS Digit Health. 2024;3(11) doi: 10.1371/journal.pdig.0000651. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Kamulegeya L., Bwanika J., Okello M., et al. Using artificial intelligence on dermatology conditions in Uganda: a case for diversity in training data sets for machine learning. Afr Health Sci. 2023;23(2):753–763. doi: 10.4314/ahs.v23i2.86. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Yang J., Soltan A.A.S., Eyre D.W., Clifton D.A. Algorithmic fairness and bias mitigation for clinical machine learning with deep reinforcement learning. Nat Mach Intell. 2023;5(8):884–894. doi: 10.1038/s42256-023-00697-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Harris S. I don't want my algorithm to die in a paper: detecting deteriorating patients early. Am J Respir Crit Care Med. 2021;204(1):4–5. doi: 10.1164/rccm.202102-0459ED. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Rajkomar A., Hardt M., Howell M.D., et al. Ensuring fairness in machine learning to advance health equity. Ann Intern Med. 2018;169(12):866–872. doi: 10.7326/m18-1990. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Amaya J., Holweg M. Using algorithms to improve knowledge work. J Oper Manag. 2024;70(3):482–513. doi: 10.1002/joom.1296. [DOI] [Google Scholar]
- 11.Makhni S., Chin M.H., Fahrenbach J., Rojas J.C. Equity challenges for artificial intelligence algorithms in health care. Chest. 2022;161(5):1343–1346. doi: 10.1016/j.chest.2022.01.009. [DOI] [PubMed] [Google Scholar]
- 12.Haixiang G., Yijing L., Shang J., et al. Learning from class-imbalanced data: Review of methods and applications. Expert Syst Appl. 2017;73:220–239. doi: 10.1016/j.eswa.2016.12.035. [DOI] [Google Scholar]
- 13.Park Y., Hu J., Singh M., et al. Comparison of methods to reduce bias from clinical prediction models of postpartum depression. JAMA Netw Open. 2021;4(4) doi: 10.1001/jamanetworkopen.2021.3909. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.ur Rehman M.H., Liew C.S., Abbas A., et al. Big data reduction methods: a survey. Data Sci Eng. 2016;1(4):265–284. doi: 10.1007/s41019-016-0022-0. [DOI] [Google Scholar]
- 15.What is data reduction? Ibm.com. https://www.ibm.com/topics/data-reduction
- 16.Content of a 510(k). U.S. Food & Drug Administration. https://www.fda.gov/medical-devices/premarket-notification-510k/content-510k
- 17.Center for Devices and Radiological “Good Machine Learning Practice for Medical Device Development: Guiding Principles.” FDA. www.fda.gov/medical-devices/software-medical-device-samd/good-machine-learning-practice-medical-dev
- 18.Ramspek C.L., Jager K.J., Dekker F.W., et al. External validation of prognostic models: what, why, how, when and where? Clin Kidney J. 2021;14(1):49–58. doi: 10.1093/ckj/sfaa188. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Binuya M.A.E., Engelhardt E.G., Schats W., et al. Methodological guidance for the evaluation and updating of clinical prediction models: a systematic review. BMC Med Res Methodol. 2022;22(1):316. doi: 10.1186/s12874-022-01801-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Steyerberg E.W., Harrell F.E. Prediction models need appropriate internal, internal-external, and external validation. J Clin Epidemiol. 2016;69:245–247. doi: 10.1016/j.jclinepi.2015.04.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Jain A., Brooks J.R., Alford C.C., et al. Awareness of racial and ethnic bias and potential solutions to address bias with use of health care algorithms. JAMA Health Forum. 2023;4(6) doi: 10.1001/jamahealthforum.2023.1197. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Logan Ellis H., Palmer E., Teo J.T., et al. The early warning paradox. NPJ Digit Med. 2025;8(1):81. doi: 10.1038/s41746-024-01408-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Marinho M.L., Camara R., Sampaio S. Toward unveiling how SAFe framework supports agile in global software development. IEEE Access. 2021;9:109671–109692. doi: 10.1109/ACCESS.2021.3101963. [DOI] [Google Scholar]
- 24.Keenan H.L., Duke S.L., Wharrad H.J., et al. Usability: an introduction to and literature review of usability testing for educational resources in radiation oncology. Tech Innov Patient Support Radiat Oncol. 2022;24:67–72. doi: 10.1016/j.tipsro.2022.09.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.van der Lans R.F. In: Data Virtualization for Business Intelligence Systems. van der Lans R.F., editor. Morgan Kaufmann; 2012. Chapter 7 - Deploying Data Virtualization in Business Intelligence Systems; pp. 147–176. [Google Scholar]
- 26.Davenport T., Malone K. Deployment as a critical business data science discipline. Harvard Data Sci Rev. 2021;3(1) doi: 10.1162/99608f92.90814c32. [DOI] [Google Scholar]
- 27.Gallo A. “A Refresher on A/B Testing.” Harvard Business Review. hbr.org/2017/06/a-refresher-on-ab-testing
- 28.Varghese J. Artificial intelligence in medicine: chances and challenges for wide clinical adoption. Visc Med. 2020;36(6):443–449. doi: 10.1159/000511930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Kelly C.J., Karthikesalingam A., Suleyman M., et al. Key challenges for delivering clinical impact with artificial intelligence. BMC Medicine. 2019;17(1):195. doi: 10.1186/s12916-019-1426-2. [DOI] [PMC free article] [PubMed] [Google Scholar]




