Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Aug 12.
Published in final edited form as: Annu Rev Biomed Data Sci. 2025 Feb 19;8(1):81–99. doi: 10.1146/annurev-biodatasci-103123-095824

Evaluation and Regulation of Artificial Intelligence Medical Devices for Clinical Decision Support

Gary E Weissman 1
PMCID: PMC12339208  NIHMSID: NIHMS2073954  PMID: 39971383

Abstract

Artificial intelligence (AI) methods were first developed nearly seven decades ago. Only in recent years have they demonstrated their potential to improve clinical care at the bedside. AI systems are now capable of interpreting, predicting, and even generating important medical information. AI medical devices share many similarities with traditional medical devices but also diverge from them in important ways. Despite widespread optimism and enthusiasm surrounding the use of such devices to improve care processes, patient outcomes, and the healthcare experience for patients, caregivers, and clinicians alike, little evidence exists so far for their effectiveness in practice. Even less is known about the safety or equity of AI medical devices. As with any new technology, this exciting time is accompanied by appropriate questions regarding if, how much, when, and who such AI systems really help. Different stakeholders, ranging from patients to clinicians to industry device developers, may have divergent preferences or assessments of risk and benefits, warranting an informed public discussion to guide emerging regulatory efforts. This review summarizes the rapidly evolving recent efforts and evidence related to the regulation and evaluation of AI medical devices and highlights opportunities for future work to ensure their effectiveness, safety, and equity.

Keywords: artificial intelligence, machine learning, medical devices, regulation, clinical decision support

1. INTRODUCTION

The concept of artificial intelligence (AI) was proposed during a summer conference at Dartmouth College in 1956 (1). Since then, many attempts have been made to use algorithmic or computerized methods to improve clinical decision-making at the bedside (2, 3). AI offers a promising tool kit for this purpose because clinical decisions are complex, rely on ever-growing and ever-changing data sources, and carry a high degree of uncertainty.

The confluence of several technical and policy developments has brought the promise of AI in medicine closer to reality during the past decade. First, the Health Information Technology for Economic Clinical Health Act has fostered nearly complete adoption of electronic health records (EHRs) among hospitals in the United States (4). This transition away from paper records has yielded a trove of easily accessible and shareable data sources amenable to the development of AI tools. Standardized EHR datasets, common data models, and robust deidentification techniques (59) have further enabled researchers to develop and test new approaches to medical AI models.

Second, the creation and widespread availability of open-source programming languages and software have lowered barriers to access for researchers, developers, and data scientists. The Office of Data Science Strategy of the National Institutes of Health encourages researchers to operate under FAIR (findability, accessibility, interoperability, and reusability) principles (10), to which open-source software is well-aligned. R and Python have consistently ranked among the most popular programming languages for statistical computing. Newer languages like Julia continue to gain popularity. All have active user communities with a wealth of online tutorials freely available to interested learners. Such tools have nurtured opportunities for research, innovation, and collaboration by minimizing barriers to accessing state-of-the-art computational tools.

Third, the growth of learning health systems has led to nascent institutional models in which hospitals use their clinical data to inform accurate, timely, and individualized treatment recommendations (11). These efforts have yielded new oversight models that bring together the diverse stakeholders needed to develop, deploy, and oversee the use of AI medical devices (1214). As these models mature, they will create road maps for the needed infrastructure—including data, computing, governance, ethics, and diverse technical and clinical domains—for hospitals to safely and effectively use medical AI devices in practice.

Fifth, emerging financial incentives for the development and use of AI tools in medicine have created an attractive market for these devices. In 2020, the Centers for Medicare and Medicaid (CMS) approved the first New Technology Add-On Payment for an AI medical device (15). Under this ruling, hospitals could receive payments from CMS for up to $1,400 each time the device is used. Since that time, many other AI medical devices have received such a designation with growing financial incentives (16). However, the rate of investment has outpaced the rate of high-quality evidence production for the clinical effectiveness and safety of such systems, including those undergoing formal regulatory review (1719). These trends underscore the importance of systems and practices that provide robust oversight and accountability for emerging AI medical devices. As the medical profession has seen the birth of new and revolutionary technologies—X-rays, antibiotics, and vaccines—and found ways to safely incorporate them into practice, so the challenge for AI medical devices presents itself today.

Therefore, here we review several important concepts and processes necessary to ensure that AI medical devices serve their ideal purpose of promoting clinical care that is safe, effective, and equitable. First, we review the definition of a medical device and its implications for oversight and design. Second, we review current frameworks for the regulation of AI medical devices. Third, we review the current state of knowledge regarding the evaluation of AI medical devices. Finally, through this review we identify several emerging themes and questions with associated opportunities for researchers and policymakers.

2. DEFINING AN ARTIFICIAL INTELLIGENCE MEDICAL DEVICE

AI tools typically rely on advanced statistical methods to learn patterns in large datasets and produce a set of parameters, weights, or other features that define a fitted model. Such models can be encoded using widely available computer software and easily shared, updated, and distributed. According to the International Medical Device Regulators Forum (20, p. 1), “software intended to be used for one or more medical purposes that perform these purposes without being part of a hardware medical device” can be classified as software as a medical device (SaMD). Designation as an SaMD is in contrast to other types of software related to medical devices (20). For example, there may be software in a medical device, such as software used to control the pressure and flow through a mechanical ventilator. Or software may be used for the manufacture or maintenance of a medical device, such as software used to evaluate the functioning of a medication pump. The software in these two latter categories, however, are not themselves devices.

2.1. US Food and Drug Administration Device Criteria

AI medical devices fall under the broader category of SaMD. AI decision support tools might reasonably be considered statistical models, bots, agents, sociotechnical systems, or nudges, among other designations, depending on their particular clinical context and use. But it is their designation as medical devices that undergirds the current policy and regulatory framework. The term device might sometimes be used informally to describe any clinical decision support system (CDSS), especially those relying on AI or machine learning (ML) technologies. But a precise definition of what constitutes an AI medical device has evolved over the past several years and concretized into specific criteria relevant for federal oversight. Most recently, the US Food and Drug Administration (FDA) released their final guidance document in September 2022 outlining the four key criteria (see the sidebar titled Defining Device Clinical Decision Support) for what distinguishes a clinical decision support device from a nondevice CDSS (21). A device designation hinges on how a CDSS acquires and displays information, the types of recommendations it provides, and how the end user—often a healthcare professional (HCP)—is meant to use those recommendations.

Criterion 1 excludes machines like electrocardiograms (ECGs), pulse oximeters, and imaging systems (e.g., magnetic resonance imaging or ultrasound), all of which acquire signals directly from patients and are therefore considered devices. Furthermore, software functions that analyze such signals from patients, even if they do not directly capture those signals, also fail to meet Criterion 1 and would still be considered devices. For example, a system that might enhance an image taken from an ultrasound machine or detect an arrhythmia from an ECG signal, even when the original signal was acquired through another device, would still be considered a device.

Criterion 2 requires that a software function is for the purpose of displaying patient medical information, in contrast to providing specific recommendations about that information. For example, a software function that displays—in contrast to a device that captures these data directly from a patient—the latest blood pressure measurement, serum creatinine value, or a radiologist’s report from a chest radiograph would fulfill this criterion. Displaying patient demographic information and clinical resources such as textbooks and peer-reviewed studies would also qualify as nondevice functions.

Criterion 3 outlines how a software function might provide decision support that does not qualify it as a device. Examples of nondevice decision support include standardized order sets, background information on a disease, drug formulary information, and reminders about preventive care. As long as the software function is used to “enhance, inform, and/or influence” a clinical decision, it meets nondevice criteria. However, if the software function is used to “substitute, replace, or direct” a clinician’s judgment, it will not fulfill this criterion and would be considered a device (21). Although the FDA guidance document does provide examples of both device-like and nondevice decision support, the specific criteria used to make this distinction are not well-established.

Criterion 4 highlights that for a software function to receive a nondevice designation, it must enable the HCP to independently review the basis for the recommendations in such a way that the HCP is not primarily relying on those recommendations for a clinical decision. Notably, the FDA guidance document specifically interprets “critical, time-sensitive” tasks to fail this criterion (21). Therefore, all software functions for such tasks, including those related to decision support for sepsis, stroke, or other time-sensitive conditions, would be considered devices.

This last criterion may be challenging to implement in practice because there are no validated instruments for assessing whether an HCP is able to independently review the basis for an AI/ML model’s recommendations. Thus, more empiric research is needed to fully operationalize this criterion for clinical contexts not mentioned in the guidance document. This is particularly important as AI/ML models, and HCPs’ ability to review the basis for their recommendations, may change over time.

Software functions must meet all four of these criteria in order to be considered a nondevice CDSS. If a CDSS fails any one of them, it is considered a medical device and likely subject to FDA review. Notably, this recent document outlines the criteria for a CDSS that intended for use by an HCP. The FDA has previously clarified that a CDSS intended for use by non-HCPs, such as patients or caregivers, qualifies as a device regardless of the severity of the clinical state or if the end user is able to independently review the basis of the recommendations (22).

2.2. Exclusions to Device Designation

In the United States, the 21st Century Cures Act laid out a more detailed framework for the FDA’s role in regulating AI medical devices (23). A section entitled “Clarifying Medical Software Regulation” [Section 3060(a)] amended section 520(o)(1)(A) of the Federal Food, Drug, and Cosmetic Act (FD&C Act) to make several exceptions for software that might otherwise be classified as a device and therefore subject to FDA review. Software functions that provide “administrative support of a health care facility” or those “intended for maintaining or encouraging a healthy lifestyle,” among several other categories, are considered exempt from designation as a device. Software functions intended for “population health management” are also exempted from classification as a device (23, p. 2). It is unclear if widely used commercial population health management algorithms, such as those shown to reinforce racial disparities (24), would meet this exemption criterion.

3. ARTIFICIAL INTELLIGENCE MEDICAL DEVICE REGULATION

There are several evolving layers of oversight that are intended to ensure the safety, effectiveness, and equity of AI medical devices. While AI medical devices bear some similarities to traditional medical devices, they also possess several unique characteristics that require new approaches to regulation and oversight. Thus, federal agencies, AI researchers, and hospitals have all so far proposed and, in some cases, implemented novel approaches of overseeing AI medical devices. More input from diverse stakeholders, creative solutions from policymakers, and rigorous evaluations of oversight approaches will be needed to achieve the optimal balance of safety and innovation for AI medical devices.

3.1. Federal Oversight

The US government oversees some AI medical devices through a multifaceted and evolving strategy.

3.1.1. US Food and Drug Administration.

How does the FDA have authority to determine which AI systems qualify as medical devices? The FD&C Act, passed in 1938, gave the FDA authority to regulate food, drugs, cosmetics, and some medical devices. The FDA only began to actively regulate medical devices in 1976 after growing public concern about the safety of the Dalkon Shield contraceptive device (25). At the time, there was no federal requirement for a premarket assessment of safety or efficacy of any medical device (26). Only after significant injury to patients was reported, including deaths, infections, and other complications, did Congress pass the Medical Device Amendments that gave the FDA authority to regulate all medical devices (27, 28). This legislation established a risk-based classification scheme for evaluating medical devices and created the premarket approval (PMA) and 510(k) pathways that are still used today (29).

Most medical devices are now regulated by the FDA through the Center for Devices and Radiological Health through one of three pathways (30, 31). First, new devices that are designated as high risk (Class III), such as ventilators and pacemakers, require a PMA with data indicating safety and effectiveness, often obtained through a clinical trial. Second, a manufacturer may petition the FDA to reevaluate a device’s risk category, potentially shifting the designation to moderate (Class II). In this case, a device may be classified as de novo and undergo review, and potentially clearance, through a less stringent premarket notification (PMN) pathway. Third, a moderate risk device (Class II) that is deemed “substantially equivalent” to a previously authorized and marketed medical device, known as a “predicate,” can also be evaluated through a PMN, or 510(k), pathway (30). Although the PMN pathway requires some assurance of safety and effectiveness, it does not require the same extensive clinical evidence needed for a PMA. While other pathways for medical devices to receive FDA authorization exist, they do not typically apply to modern AI medical devices (2831).

However, in addition to requiring substantially less robust clinical evidence compared to a PMA, the PMN pathway is susceptible to the problem of “serial predicates” (30). This problem arises when a device receives clearance based on a predicate, which in turn received clearance based on another predicate, and so on, potentially spanning many generations of medical devices over several decades. For example, in a review of AI critical care medical devices by Lee et al. (18) published in 2023, among nine devices cleared through a PMN, only three had any AI-related predicates and the earliest predicate in the predicate tree stretched back to 1982. Another review of the genealogy of PMNs in the FDA AI database found one device with 15 generations of predicates and highlighted the key safety concern of “predicate creep,” or “iterative design changes [across generations] resulting in unproven devices dissimilar from original predicate” (32). Thus, the PMN pathway that requires little clinical evidence if substantial equivalence to a predicate device is established may undermine efforts to establish safety and effectiveness for modern AI medical devices.

To promote transparency in the regulation of AI medical devices, since 2020 the FDA has published a public database of all approvals and clearances for devices that rely on AI and ML tools.1 The database is updated every few months and has served as a rich data resource for public awareness and for research into AI/ML device regulation (18, 19, 32, 34). Based on the information contained in this database, most FDA-authorized AI devices are cleared through the PMN pathway.

FDA leaders have recently outlined priorities for future regulatory approaches that include life cycle management, special mechanisms for generative AI systems, and a focus on patient health outcomes (35).

3.1.2. Changing devices over time.

A key advantage of AI medical devices is that they may continue to learn from new data even after they are deployed. Although this feature is important for AI systems, it is not unique to them, and it is relevant for the oversight of non-AI medical devices, too. Since 2017, the FDA began offering guidance about how device manufacturers should deal with changes, or updates, to already authorized systems. After releasing several guidance documents and incorporating public comments, the FDA released most recently a draft guidance document in August 2024 titled “Predetermined Change Control Plans for Medical Devices” (36). This updated document outlines requirements for reasonable assurances of safety and effectiveness, and with substantial equivalence for devices with predetermined change control plans (PCCPs). Manufacturers may submit PCCPs for devices receiving authorization through the PMA or PMN pathways, including those with de novo designation. Typically they are expected to be submitted at the time of the submission for the medical device and the PCCP is considered part of the PMN or PMA. The PCCP allows a device to be updated using additional data that might improve its predictive performance. However, changes such as using a new data source or predicting a new outcome are likely not appropriate for a PCCP and would instead warrant authorization as a new device.

3.1.3. Regulatory pilot program.

In 2017, the FDA developed a new regulatory framework appropriate for SaMDs because they are so different than traditional medical devices. The new Digital Health Software Precertification (Pre-Cert) Pilot Program allowed manufacturers to bring devices to market in a streamlined process after the manufacturers themselves demonstrated adherence to best practices in promoting safety, quality, responsibility, and other best practices for AI device development. Over 100 companies expressed interest in the program (37). Ultimately, the nine manufacturers selected for the pilot precertification program included Apple (Cupertino, California), Fitbit (San Francisco, California), Johnson & Johnson (New Brunswick, New Jersey), Pear Therapeutics (Boston, Massachusetts), Phosphorus (New York, New York), Roche (Basel, Switzerland), Samsung (Seoul, South Korea), Tidepool (Palo Alto, California), and Verily (Mountain View, California). The pilot program was ultimately closed in 2022, in part because the FDA reported that it did not have the statutory authority to fully implement necessary elements of its program (38). Although the program has remained closed, newer proposals to revive firm-based regulation of AI medical devices would likely require action from Congress to provide extended statutory authority in order for such an approach to have a chance of success (39).

3.1.4. Office of the National Coordinator for Health Information Technology.

The Office of the National Coordinator for Health Information Technology (ONC) is responsible for the certification of EHR systems in the United States. The ONC released a proposal for the Health Data, Technology, and Interoperability (HTI-1) rule in April 2023. After publication and incorporation of public feedback, the ONC released its final rule in December 2023 that lays out extensive reporting requirements for predictive decision support interventions (DSIs) (40). Importantly, these requirements apply only to those DSIs embedded within certified information technology (IT), the purview of the ONC, which is most commonly the EHR. Some of the mandatory reporting elements include measures of predictive performance, a description of the approaches to ensure fairness, and details of an external validation process. These quantitative assessments and process details far exceed any other federal reporting requirements of AI medical devices. However, the reporting requirements are focused on model performance and do not explicitly require any assessments of a model’s clinical effectiveness. At this time, HTI-1 requires disclosure of these elements by the manufacturer of an ONC-certified IT system. But HTI-1 does not require public disclosure of all of these elements, potentially leaving hospital leaders, clinicians, and patients without sufficient data to inform important investment or clinical decisions.

3.1.5. Biden’s artificial intelligence announcement.

In November 2023, President Biden said that, with respect to AI, “It must be governed” (41) and released the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence (42). Although intended to address AI technologies across a broad array of industries and domains, the order had specific and sometimes uncertain implications for AI medical devices (43, 44). First, the order laid out how existing federal laws, even if not created specifically for AI, would still apply to AI systems. For example, laws related to privacy, discrimination, and unfair business practices are still relevant for protecting people against potential harms of AI medical devices. Second, the order required the establishment of standards for “trust and safety” of AI systems. This requirement would create a tracking system for medical errors committed through the use of AI systems and an “AI assurance policy” that would create an infrastructure to evaluate medical AI systems. Third, the order required the appointment of a new chief AI officer within the Department of Health and Human Services to align efforts across multiple federal agencies. Finally, the order recognized the importance of a skilled workforce to evaluate and implement these laws as they apply to AI systems in mandating the National Science Foundation to create at least four new institutes dedicated to training in AI-related skills.

There are still many open questions and uncertainties about how the federal regulatory landscape will evolve. For example, what regulatory requirements should be in place for device CDSSs that are already in widespread use but have not yet received authorization from the FDA? For those manufacturers or health systems that do not comply, how will the FDA approach enforcement? Will CMS include AI-related safety requirements in their Conditions of Participation (45)? Existing regulatory frameworks for laboratory testing provide several lessons and insights for how the federal government and local hospitals might collaborate in overseeing AI medical devices, but it would likely require expanded statutory authority to implement such a system (46, 47). Will Congress act to further empower CMS, and/or other federal agencies, to take responsibility for additional aspects of AI medical device regulation? Only through further congressional action, active public comments, and engaged discourse across stakeholders and sectors of society will the federal policy landscape realize the potential of AI medical devices.

3.2. Local Hospital Oversight

Regardless of where the federal oversight framework, currently led by the FDA and ONC, evolves, there will be a critical need for a local, hospital-level oversight infrastructure (22). There is substantial evidence that predictive AI models require ongoing validation and retraining as their performance changes over time and in new locations (4851). Thus, if an AI medical device receives authorization from the FDA based on clinical data from one or a handful of centers and is then deployed around the country, there is no guarantee of how it will perform elsewhere or that its performance or effectiveness will remain consistent. This need for local validation and monitoring is further underscored by the limited geographic diversity in public datasets commonly used to train AI systems (52).

Thus, hospitals, clinics, and local healthcare delivery organizations will have an important role to play in the local oversight and governance of AI medical devices. Some health systems have already built sophisticated infrastructures to oversee their local AI systems (12, 14). But most of these efforts are concentrated in well-resourced, urban academic centers. Although well-suited to the task because of available infrastructure, personnel, and mission, such centers are unlikely to reflect the workflows, populations, and local clinical needs of many hospitals around the rest of the country (52). These efforts have also been accompanied by exciting organizational innovations, including creating clinical AI departments (53) and appointing chief AI officers (54).

3.3. Locally Developed Artificial Intelligence

An important area of uncertainty in the regulation of AI medical devices is how current oversight mechanisms will consider systems developed and deployed locally in a hospital. The FDA regulatory guidance is intended for industry manufacturers, while the ONC’s regulatory efforts are focused on certification of commercial EHR platforms. What about the many AI CDSSs that are developed by researchers, data scientists, informaticians, and other stakeholders for use in their own hospital? Such AI systems may not fall under the purview of any federal regulatory agency directly but still warrant review to ensure their effectiveness, safety, and equity. The local hospital oversight efforts described above are a promising first step but are likely limited to only well-resourced academic centers. Thus, new policy frameworks and oversight infrastructure will be needed to effectively balance the safety and innovation of these locally developed AI medical devices.

3.4. Generative Artificial Intelligence

Generative AI systems are those that create content—text, images, audio, and/or video—in contrast to traditional AI systems that typically provide a comparatively simpler prediction of a disease, treatment, or outcome. Such generative systems rely on stochastic sampling from probability distributions learned from massive datasets using combinations of unsupervised and supervised learning procedures. Consequently, the output of generative AI systems is often startlingly human-like, confidently composed, and seems promising to support a broad array of clinical activities and decisions (5558). However, these systems have a tendency to confabulate (59), provide inappropriate medical advice based on a person’s race or insurance status (60, 61), and disregard prompts requiring compliance with the FDA’s medical device criteria (62). Thus, generative AI systems pose particular difficulties for regulators trying to balance innovation and safety while awaiting the growth of more mature environments, or ecosystems, with sufficient resources to operationalize mature oversight program (63, 64).

4. ARTIFICIAL INTELLIGENCE MEDICAL DEVICE EVALUATION

Just as traditional medical devices warrant evaluation before they are used at the bedside, AI medical devices require the same (65). Many proposals from clinical and methodologic experts have outlined criteria for the reporting, evaluation, and oversight of these devices (12,6672). But these guidelines and frameworks remain variably adopted and adhered to. Most published evaluations of AI medical devices are in silico or are based on a model’s predicted performance in a retrospectively collected dataset. There are comparatively very few studies reporting evaluations in clinico, or how a model affects clinical decisions or patient care when deployed at the bedside (73). This gap in clinical effectiveness evaluations is due, in part, to the still nascent and evolving scientific, clinical, and regulatory knowledge about AI medical devices. Still, this gap poses current challenges for patients, clinicians, researchers, and hospital leaders who must make decisions about investment, deployment, and clinical care without substantial clinical evidence. Additionally, these stakeholders’ incentives many not align to support rigorous, randomized evaluations of AI medical devices (74). Many hospitals also lack the resources and infrastructure needed to conduct robust evaluations of AI medical devices. Thus, robust AI medical device evaluations are critically lacking. Increased understanding of the different types of AI medical device evaluations, each with a distinct role, may help inform clinical, scientific, and operational decisions.

4.1. Phases of Research

In silico evaluations of a model’s performance are essential for its comprehensive assessment. However, estimates of predictive performance should not be conflated with estimates of clinical effectiveness or safety. The framework outlined by Park et al. (75) (Figure 1) is helpful in interpreting different types of evaluations of AI medical devices aligned with more familiar evaluations of drugs and traditional medical devices.

Figure 1.

Figure 1

Phases of research and exemplar studies at each phase in the development of drugs, traditional medical devices, and artificial intelligence (AI) medical devices. Because AI medical devices are relatively new, the clinical relevance of the distinction between phases of research may be less familiar. Figure adapted from Reference 75 (CC BY 4.0).

Each type of evaluation is important and represents a key step in the development of AI medical devices that are safe, effective, and equitable. And each step should be interpreted according to the quality of the evidence, the inherent limitations of the study design, and its consequences for patient care.

4.2. Performance Evaluation

Retrospective, in silico evaluations are essential components of any AI medical device’s comprehensive assessment. Such studies typically focus on estimating the predictive performance of an AI system or how well that system is able to guess some outcome or state compared to what was actually observed. These evaluations typically correspond to Phase 0 or Phase 1 research.

Predictive performance evaluations require investigators to choose metrics by which to assess a model (7678). Model discrimination, or the ability of a model to distinguish between two or more outcome categories, is commonly measured with the concordance statistic (or C-statistic), sensitivity, specificity, or positive or negative predictive value. Model calibration, or the ability of a model to produce a probability of an outcome that is aligned with the observed probability in the case of a binary classifier, is commonly measured with the calibration slope, inspection of calibration curves, or integrated calibration index. There are also composite measures of model performance that simultaneously capture discrimination and calibration, such as the scaled Brier score, R2 value, and the logarithmic loss. Importantly, no single measure captures all aspects of a model’s performance. Therefore, a comprehensive AI model performance evaluation will report multiple measures.

None of these measures, however, directly addresses a model’s clinical impact. Several measures have been developed to translate a model’s predictive performance into more clinically relevant terms. These include the number needed to evaluate, also known as the workup to detection ratio (79), the net benefit (80), and formal decision analysis (when the effects of interventions deployed in response to an AI tool’s suggestions are known) (81). The use of those more clinical measures can extend the insight gained through an in silico evaluation. But they should be interpreted as tentative, or back-of-the-napkin, rather than definitive estimates of clinical impact (82).

Retrospective performance evaluations far outnumber prospective evaluations. For example, Fleuren et al. (83) reviewed 172 published manuscripts of ML models for use in critical care and found that 160 (93%) were still in the prototyping and development phase. There are several reasons why such retrospective performance evaluations are likely the most common type of AI medical device study. First, they are the least costly and time-consuming to conduct. These in silico studies typically rely on the same data on which a model is trained. These studies do not require human factors inquiries, EHR integration, clinician or patient engagement efforts, coordination with hospital operations leaders, or federal regulatory review. Second, although adherence remains variable (84), the reporting standards for studies of predictive performance are reasonably well-established with consensus on core elements of appropriate methods, outcomes, and study designs (71, 8587).

The key threats to assessing an AI medical device’s predictive performance in a retrospective evaluation are overfitting and drift. Overfitting occurs when an AI model captures not just the signal but also the noise in a dataset during training (88). Overfitting leads to overly optimistic performance assessments if a model is evaluated only on that training data. Thus, other approaches to validation, such as random splitting, cross validation, or bootstrap correction, can provide appropriately optimistic measures of a model’s performance. Drift occurs when an AI model’s performance degrades over time or space due to changes in the input and/or output variables (49). Thus, temporal, geographic, or other types of external validation are essential for assessing how an AI medical device may perform in a population distinct from that used for its development. However, external validation is not always necessary. If an AI tool is intended for local use only, then a targeted validation in the population where it was trained may be sufficient (89). The development of national assurance networks focused on independent testing could fill some gaps in needed AI performance evaluations (90).

Strong predictive performance in a retrospective evaluation is a first step in assembling a record of evidence for how an AI medical device might perform when deployed in clinical practice. However, a predictive performance evaluation is necessary but not sufficient for a comprehensive assessment of an AI medical device.

4.3. Clinical Effectiveness Evaluation

Prospective, in clinico evaluations measure the effects of an AI medical device on care processes and patient outcomes when deployed in practice. While comparatively few such studies have been published, they are essential in understanding a device’s impact in the real world. These evaluations typically correspond to Phase 2 or Phase 3 research.

There are several ways in which a prospective clinical effectiveness evaluation of an AI device could differ significantly from a retrospective performance evaluation. First, the data sources may change dramatically. Pulling retrospective data and pulling prospective data may require different extraction pipelines, exhibit different missingness patterns, and be represented using different fields or ontologies. Second, the end user may respond to a model’s predictions, whether they are correct or not, in unpredictable ways. Thus, rigorous assessments of human factors and consideration of a device’s deployment as a sociotechnical system are likely to improve the chances of success (9193). The actual effect of an AI medical device on clinical practice cannot be known until it is deployed and evaluated.

There are few randomized clinical trials of AI medical devices relative to the number of published retrospective performance evaluations (94, 95). For example, the clinical trial reported by Abràmoff et al. (96) of an AI system to detect diabetic retinopathy was the first preregistered clinical trial of a diagnostic AI medical device and led to the first FDA approval for such a system (97). Among the most rigorous clinical evaluations of an AI medical device is the randomized clinical trial by Wijnberge et al. (98). This study showed how an AI-based prediction model led to improvements in care processes and patient outcomes by detecting hypotension during noncardiac surgery and making specific recommendations for interventions. This study was described as akin to a “positive phase 2b drug trial” (99, p. 1044). Even among those AI medical devices that have already undergone review by the FDA and received an approval or clearance, there are very few reported peer-reviewed investigations or reports of clinical effectiveness (18, 19).

4.4. Safety

There is a concerning lack of data about how AI medical devices may cause harms. Even AI systems that are intended to allocate resources, rather than make care decisions directly, may harm patients. For example, a prediction model that identifies clinical deterioration among hospitalized patients is intended to prompt an emergent evaluation. In doing so, the alert focuses resources on some patients while reducing attention paid to other patients, who are then at increased risk of a deterioration event (100). Thus, the evaluation of safety of an AI medical device, regardless of its intended end user, is a critical but often missing component of AI medical device testing. Importantly, although retrospective evaluations could identify potential safety issues related to errors in discrimination or calibration of suggested treatments, they are insufficient to fully evaluate the safety of using an AI medical device.

A key concern related to the safety of AI systems is automation bias, sometimes called over-reliance (101, 102). Automation bias occurs when a clinician cedes cognitive authority to an AI system and heeds the advice of that system without adequately reviewing the appropriateness of that advice. Although having a human in the loop is a practice hypothesized to ensure the safety of AI medical devices (103, 104), the presence of automation bias undermines that as a sole strategy with growing evidence that humans may overrely on the suggestions of an AI system even when they are incorrect (105, 106). Further evidence suggests that simply explaining the predictions of an AI system, in an attempt to overcome the black box, is not sufficient to fully ameliorate automation bias (107, 108). This phenomenon is especially important in clinical settings like the intensive care unit because automation bias increases with the complexity of the decision (109) and is especially risky in time-critical environments (110).

On the other hand, algorithm aversion bias refers to the tendency to ignore an AI system even when its suggestions are correct (111). This phenomenon risks clinicians overlooking the potential benefit of AI systems when they do have insights to offer. Thus, there is a significant knowledge gap about how end users will interact with AI medical devices broadly and across varied clinical contexts. More research is needed to understand these dynamics and how their effects might be modified by clinician, context, and task-specific characteristics. More work is needed prior to AI device deployment to understand if and to what extent these interaction patterns might be present and to test optimal approaches to mitigation.

4.5. Equity

A critical component of any AI medical device evaluation is that of its effect on health equity. Many groups of people already experience significant and well-documented disparities in their medical care and outcomes in a broad range of health contexts. These disparities are apparent across sex, gender, race, ethnicity, sexual orientation, rurality, preferred language, and other characteristics (112117). Many of these disparities are explained by large-scale, structural differences in resource allocation, historical deprivation, and policy, and/or by individual-level biases of decision-makers. AI medical devices are then trained on large datasets that are the result of social patterning in the underlying data generating process. Thus, AI medical devices simultaneously offer tremendous promise to ameliorate bias and inequity in healthcare while also carrying commensurate risk to reinforce and even exacerbate these inequities. Even clinical algorithms, which may or may not qualify as AI medical devices, may have a substantial impact on racial and ethnic disparities (24, 118121).

Existing federal regulatory guidelines are not sufficient to ensure equitable AI medical device development (18, 21). More recent scientific reporting guidelines have emphasized the need for fairness evaluations in the reporting of AI prediction models (71). Local hospital oversight mechanisms will likely be necessary to verify fairness in AI medical device development and deployment (12). Additionally, the adoption of research practices that promote equity, inclusion, and access will foster innovative environments that are more likely to lead to the development of equitable AI medical devices (122).

DEFINING DEVICE CLINICAL DECISION SUPPORT.

In September 2022, the FDA released a final guidance document (21) outlining four criteria to distinguish device from nondevice CDSSs. That is, software may be considered not a device if it is:

Criterion 1

Not intended to acquire, process, or analyze a medical image or signal from an in vitro diagnostic device or a pattern or signal from a signal acquisition system.

Criterion 2

Intended for the purpose of displaying, analyzing, or printing medical information about a patient or other medical information.

Criterion 3

Intended for the purpose of supporting or providing recommendations to an HCP about prevention, diagnosis, or treatment of a disease or condition.

Criterion 4

Intended for the purpose of enabling an HCP to independently review the basis for the recommendations that such software presents so that it is not the intent that the HCP rely primarily on any such recommendations to make a clinical diagnosis or treatment decision regarding an individual patient.

SUMMARY POINTS.

  1. Medical devices relying on artificial intelligence (AI) technologies share both similarities and differences with traditional medical devices.

  2. AI medical devices show significant promise for improving care processes, clinical workflows, and patient outcomes.

  3. AI medical devices could undermine patient safety by recapitulating biased care, producing unsound clinical suggestions, or not having their output adequately reviewed by bedside clinicians.

  4. There is currently little high-quality evidence about the effectiveness, safety, or equity of using AI medical devices in clinical practice.

FUTURE ISSUES.

  1. New regulatory and oversight approaches are needed that account for the unique features of AI medical devices.

  2. Innovative federal, state, city, and hospital policies can all contribute to ensuring the safety, effectiveness, and equity of AI medical devices.

  3. Increased funding, training, infrastructure, and incentives are all needed to develop a rigorous body of evidence for the use of AI medical devices in clinical practice.

  4. More prospective research is needed to better identify safe, effective, and equitable approaches for integrating AI medical devices into clinical practice.

ACKNOWLEDGMENTS

The author was supported by grants from the National Institutes of Health (R01HL162354, R35GM155262, and R03HL171424), the Advanced Research Projects Agency for Health (D24AC00253), and the National Academy of Medicine (SCON-10001137), with additional support from the Gordon and Betty Moore Foundation and the John A. Hartford Foundation, during the preparation of this manuscript.

Footnotes

DISCLOSURE STATEMENT

The author is not aware of any affiliations, memberships, funding, or financial holdings that might be perceived as affecting the objectivity of this review.

1

As of September 1, 2024, 924 out of 950 devices (97%) in this database received clearance through a PMN (33).

LITERATURE CITED

  • 1.Moor J 2006. The Dartmouth College Artificial Intelligence Conference: the next fifty years. AI Mag. 27(4):87–91 [Google Scholar]
  • 2.Warner HR, Toronto AF, Veasey LG, Stephenson R. 1961. A mathematical approach to medical diagnosis. Application to congenital heart disease. JAMA 177:177–83 [DOI] [PubMed] [Google Scholar]
  • 3.Shortliffe EH, Davis R, Axline SG, Buchanan BG, Green C, Cohen SN. 1975. Computer-based consultations in clinical therapeutics: explanation and rule acquisition capabilities of the MYCIN system. Comput. Biomed. Res. 8(4):303–20 [DOI] [PubMed] [Google Scholar]
  • 4.Adler-Milstein J, Jha AK. 2017. HITECH Act drove large gains in hospital electronic health record adoption. Health Aff. 36(8):1416–22 [DOI] [PubMed] [Google Scholar]
  • 5.Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, et al. 2023. MIMIC-IV, a freely accessible electronic health record dataset. Sci. Data 10(1):1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. 2018. The eICU collaborative research database, a freely available multi-center database for critical care research. Sci. Data 5(1):180178. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Lee HC, Park Y, Yoon SB, Yang SM, Park D, Jung CW. 2022. VitalDB, a high-fidelity multi-parameter vital signs database in surgical patients. Sci. Data 9(1):279. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Hripcsak G, Levine ME, Shang N, Ryan PB. 2018. Effect of vocabulary mapping for conditions on phenotype cohorts. J. Am. Med. Inform. Assoc. 25(12):1618–25 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Schneider ALC, Ginestra JC, Kerlin MP, Shashaty MGS, Miano TA, et al. 2024. The Complete Inpatient Record Using Comprehensive Electronic Data (CIRCE) project: a team-based approach to clinically validated, research-ready electronic health record data. Learn. Health Syst. 2024:e10439. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Wilkinson MD, Dumontier M, Aalbersberg IJJ, Appleton G, Axton M, et al. 2016. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3(1):160018. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Friedman CP, Wong AK, Blumenthal D. 2010. Achieving a nationwide learning health system. Sci. Transl. Med. 2(57):57cm29. [DOI] [PubMed] [Google Scholar]
  • 12.Bedoya AD, Economou-Zavlanos NJ, Goldstein BA, Young A, Jelovsek JE, et al. 2022. A framework for the oversight and local deployment of safe and high-quality prediction models. J. Am. Med. Inform. Assoc. 29(9):1631–36 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Reddy S, Allan S, Coghlan S, Cooper P. 2019. A governance model for the application of AI in health care. J. Am. Med. Inform. Assoc. 27(3):491–97 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Nong P, Hamasha R, Singh K, Adler-Milstein J, Platt J. 2024. How academic medical centers govern AI prediction tools in the context of uncertainty and evolving regulation. NEJM AI 1(3):AIp2300048 [Google Scholar]
  • 15.Martin-Carreras T. 2020. CMS’ new technology add-on payment ruling. American College of Radiology. https://www.acr.org/Member-Resources/rfs/Resident-and-Fellow-News/November-2020/CMS-New-Technology-Add-On-Payment-Ruling
  • 16.Wu K, Wu E, Theodorou B, Liang W, Mack C, et al. 2023. Characterizing the clinical adoption of medical AI devices through U.S. insurance claims. NEJM AI 1(1):AIoa2300030 [Google Scholar]
  • 17.Cristea IA, Cahan EM, Ioannidis JP. 2019. Stealth research: lack of peer-reviewed evidence from healthcare unicorns. Eur. J. Clin. Investig. 49(4):e13072. [DOI] [PubMed] [Google Scholar]
  • 18.Lee JT, Moffett AT, Maliha G, Faraji Z, Kanter GP, Weissman GE. 2023. Analysis of devices authorized by the FDA for clinical decision support in critical care. JAMA Intern. Med. 183:1399–401 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Chouffani El Fassi S, Abdullah A, Fang Y, Natarajan S, Masroor AB, et al. 2024. Not all AI health tools with regulatory authorization are clinically validated. Nat. Med. 30(10):2718–20 [DOI] [PubMed] [Google Scholar]
  • 20.US Food Drug Adm. 2018. Software as a medical device (SaMD). US Food & Drug Administration. https://www.fda.gov/medical-devices/digital-health-center-excellence/software-medical-device-samd [Google Scholar]
  • 21.US Food Drug Adm. 2022. Clinical decision support software. Guid. Doc. FDA-2017-D-6569, US Food Drug Adm., Rockville, MD. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/clinical-decision-support-software [Google Scholar]
  • 22.Weissman GE. 2020. FDA regulation of predictive clinical decision-support tools: What does it mean for hospitals? J. Hosp. Med. 16(4):244–46 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.US Food Drug Adm. 2019. Changes to existing medical software policies resulting from Section 3060 of the 21st Century Cures Act. Guid. Doc. FDA-2017-D-6294. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/changes-existing-medical-software-policies-resulting-section-3060-21st-century-cures-act
  • 24.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. 2019. Dissecting racial bias in an algorithm used to manage the health of populations. Science 366(6464):447–53 [DOI] [PubMed] [Google Scholar]
  • 25.Pisac A, Wilson N. 2021. FDA device oversight from 1906 to the present. AMA J. Ethics 23(9):E712–20 [DOI] [PubMed] [Google Scholar]
  • 26.Rome BN, Kramer DB, Kesselheim AS. 2014. Approval of high-risk medical devices in the US: implications for clinical cardiology. Curr. Cardiol. Rep. 16(6):489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Curfman GD, Morrissey S, Drazen JM. 2008. A pivotal medical-device case. New Engl. J. Med. 358(1):76–77 [DOI] [PubMed] [Google Scholar]
  • 28.Inst. Med. 2011. Medical Devices and the Public’s Health: The FDA 510(k) Clearance Process at 35 Years. Washington, DC: Nat. Acad. Press [Google Scholar]
  • 29.Darrow JJ, Avorn J, Kesselheim AS. 2021. FDA regulation and approval of medical devices: 1976–2020. JAMA 326(5):420–32 [DOI] [PubMed] [Google Scholar]
  • 30.Van Norman GA. 2016. Drugs, devices, and the FDA: Part 2: an overview of approval processes: FDA approval of medical devices. JACC: Basic Transl. Sci. 1(4):277–87 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Harvey HB, Gowda V. 2020. How the FDA regulates AI. Acad. Radiol. 27(1):58–61 [DOI] [PubMed] [Google Scholar]
  • 32.Muehlematter UJ, Bluethgen C, Vokinger KN. 2023. FDA-cleared artificial intelligence and machine learning-based medical devices and their 510(k) predicate networks. Lancet Digit. Health 5(9):e618–26 [DOI] [PubMed] [Google Scholar]
  • 33.Food Drug Adm US. 2024. Artificial intelligence and machine learning (AI/ML)-enabled devices. US Food & Drug Administration.https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-and-machine-learning-aiml-enabled-medical-devices [Google Scholar]
  • 34.Wu E, Wu K, Daneshjou R, Ouyang D, Ho DE, Zou J. 2021. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat. Med. 27(4):582–84 [DOI] [PubMed] [Google Scholar]
  • 35.Warraich HJ, Tazbaz T, Califf RM. 2024. FDA perspective on the regulation of artificial intelligence in health care and biomedicine. JAMA. 10.1001/jama.2024.21451 [DOI] [PubMed] [Google Scholar]
  • 36.US Food Drug Adm. 2024. Predetermined change control plans for medical devices. Draft Guid. Doc. FDA-2024-D-2338, US Food Drug Adm. Rockville, MD. https://www.fda.gov/regulatory-information/search-fda-guidance-documents/predetermined-change-control-plans-medical-devices [Google Scholar]
  • 37.US Food Drug Adm. 2017. FDA selects participants for new digital health software precertification pilot program. Press Release, Sept. 26. https://www.fda.gov/news-events/press-announcements/fda-selects-participants-new-digital-health-software-precertification-pilot-program
  • 38.US Food Drug Adm. 2022. The software precertification (pre-cert) pilot program: tailored total product lifecycle approaches and key findings. Rep., US Food Drug Adm., Rockville, MD. https://www.fda.gov/media/161815/download?attachment [Google Scholar]
  • 39.Gottlieb S. 2024. Congress must update FDA regulations for medical AI. JAMA Health Forum 5(7):e242691. [DOI] [PubMed] [Google Scholar]
  • 40.Assist. Secr. Technol. Policy. 2024. Health data, technology, and interoperability: certification program updates, algorithm transparency, and information sharing (HTI-1) final rule. Assistant Secretary for Technology Policy. https://www.healthit.gov/topic/laws-regulation-and-policy/health-data-technology-and-interoperability-certification-program
  • 41.White House. 2023. Remarks by President Biden and Vice President Harris on the administration’s commitment to advancing the safe, secure, and trustworthy development and use of artificial intelligence. https://www.whitehouse.gov/briefing-room/speeches-remarks/2023/10/30/remarks-by-president-biden-and-vice-president-harris-on-the-administrations-commitment-to-advancing-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
  • 42.White House. 2023. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence. https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/
  • 43.Mello MM, Shah NH, Char DS. 2023. President Biden’s executive order on artificial intelligence—implications for health care organizations. JAMA 331(1):17–18 [DOI] [PubMed] [Google Scholar]
  • 44.Blumenthal D. 2024. The U.S. president’s executive order on artificial intelligence. NEJM AI 1(2):AIpc2300296 [Google Scholar]
  • 45.Fleisher LA, Economou-Zavlanos NJ. 2024. Artificial intelligence can be regulated using current patient safety procedures and infrastructure in hospitals. JAMA Health Forum. 5(6):e241369. [DOI] [PubMed] [Google Scholar]
  • 46.Herman DS, Reece JT, Weissman GE. 2024. Lessons for local oversight of AI in medicine from the regulation of clinical laboratory testing. NPJ Digit. Med. 7:359. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Jackson BR, Sendak MP, Solomonides A, Balu S, Sittig DF. Regulation of artificial intelligence in healthcare: Clinical Laboratory Improvement Amendments (CLIA) as a model. J. Am. Med. Inform. Assoc. 2024:ocae296. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Lea AS, Jones DS. 2024. Mind the gap—machine learning, dataset shift, and history in the age of clinical algorithms. New Engl. J. Med. 390(4):293–95 [DOI] [PubMed] [Google Scholar]
  • 49.Finlayson SG, Subbaswamy A, Singh K, Bowers J, Kupke A, et al. 2021. The clinician and dataset shift in artificial intelligence. New Engl. J. Med. 385(3):283–86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Riley RD, Archer L, Snell KIE, Ensor J, Dhiman P, et al. 2024. Evaluation of clinical prediction models (part 2): how to undertake an external validation study. BMJ 384:e074820. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, et al. 2021. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern. Med. 181(8):1065–70 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52.Kaushal A, Altman R, Langlotz C. 2020. Geographic distribution of US cohorts used to train deep learning algorithms. JAMA 324(12):1212–13 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53.Cosgriff CV, Stone DJ, Weissman G, Pirracchio R, Celi LA. 2020. The clinical artificial intelligence department: a prerequisite for success. BMJ Health Care Inform. 27(1):e100183. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54.Beecy AN, Longhurst CA, Singh K, Wachter RM, Murray SG. 2024. The chief health AI officer—an emerging role for an emerging technology. NEJM AI 1(7):AIp2400109 [Google Scholar]
  • 55.Van Veen D, Van Uden C, Blankemeier L, Delbrouck JB, Aali A, et al. 2024. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. 30(4):1134–42 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Han T, Adams LC, Bressem KK, Busch F, Nebelung S, Truhn D. 2024. Comparative analysis of multimodal large language model performance on clinical vignette questions. JAMA 331(15):1320–21 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Cabral S, Restrepo D, Kanjee Z, Wilson P, Crowe B, et al. 2024. Clinical reasoning of a generative artificial intelligence model compared with physicians. JAMA Intern. Med. 184(5):581–83 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58.Goh E, Gallo R, Hom J, Strong E, Weng Y, et al. 2024. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw. Open 7(10):e2440969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 59.Hatem R, Simmons B, Thornton JE. 2023. Chatbot confabulations are not hallucinations. JAMA Intern. Med. 183(10):1177. [DOI] [PubMed] [Google Scholar]
  • 60.Nastasi AJ, Courtright KR, Halpern SD, Weissman GE. 2023. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci. Rep. 13(1):17885. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 61.Omiye JA, Lester JC, Spichak S, Rotemberg V, Daneshjou R. 2023. Large language models propagate race-based medicine. NPJ Digit. Med. 6(1):195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 62.Weissman GE, Mankowitz T, Kanter GP. 2024. Large language model non-compliance with FDA guidance for clinical decision support devices. Preprint. https://www.researchsquare.com/article/rs-4868925/v1
  • 63.Minssen T, Vayena E, Cohen IG. 2023. The challenges for regulating medical use of ChatGPT and other large language models. JAMA 330(4):315–16 [DOI] [PubMed] [Google Scholar]
  • 64.Coiera E, Fraile-Navarro D. 2024. AI as an ecosystem—ensuring generative AI is safe and effective. NEJM AI 1(9):AIp2400611 [Google Scholar]
  • 65.Longhurst CA, Singh K, Chopra A, Atreja A, Brownstein JS. 2024. A call for artificial intelligence implementation science centers to evaluate clinical effectiveness. NEJM AI 1(8):AIp2400223 [Google Scholar]
  • 66.Mitchell M, Wu S, Zaldivar A, Barnes P, Vasserman L, et al. 2019. Model cards for model reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pp. 220–29. New York: Assoc. Comput. Mach. [Google Scholar]
  • 67.Wiens J, Saria S, Sendak M, Ghassemi M, Liu VX, et al. 2019. Do no harm: a roadmap for responsible machine learning for health care. Nat. Med. 25(9):1337–40 [DOI] [PubMed] [Google Scholar]
  • 68.Labkoff S, Oladimeji B, Kannry J, Solomonides A, Leftwich R, et al. 2024. Toward a responsible future: recommendations for AI-enabled clinical decision support. J. Am. Med. Inform. Assoc. 31(11):2730–39 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 69.CONSORT-AI SPIRIT-AI Steer. Group. 2019. Reporting guidelines for clinical trials evaluating artificial intelligence interventions are needed. Nat. Med. 25(10):1467–68 [DOI] [PubMed] [Google Scholar]
  • 70.Kwong JCC, Khondker A, Lajkosz K, McDermott MBA, Frigola XB, et al. 2023. APPRAISE-AI tool for quantitative evaluation of AI studies for clinical decision support. JAMA Netw. Open 6(9):e2335377. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 71.Collins GS, Moons KGM, Dhiman P, Riley RD, Beam AL, et al.2024.TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ 385:e078378. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 72.Vasey B, Nagendran M, Campbell B, Clifton DA, Collins GS, et al. 2022. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat. Med. 28(5):924–33 [DOI] [PubMed] [Google Scholar]
  • 73.Weissman GE. 2024. Moving from in silico to in clinico evaluations of machine learning-based interventions in critical care. Crit. Care Med. 52(7):1141–44 [DOI] [PubMed] [Google Scholar]
  • 74.Downing NL, Rolnick J, Poole SF, Hall E, Wessels AJ, et al. 2019. Electronic health record-based clinical decision support alert for severe sepsis: a randomised evaluation. BMJ Q. Saf. 28(9):762–68 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 75.Park Y, Jackson GP, Foreman MA, Gruen D, Hu J, Das AK. 2020. Evaluating artificial intelligence in medicine: phases of clinical research. JAMIA Open 3(3):326–31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 76.Steyerberg EW, Vickers AJ, Cook NR, Gerds T, Gonen M, et al. 2010. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21(1):128–38 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 77.Van Calster B, Nieboer D, Vergouwe Y, De Cock B, Pencina MJ, Steyerberg EW. 2016. A calibration hierarchy for risk models was defined: from utopia to empirical data. J. Clin. Epidemiol. 74:167–76 [DOI] [PubMed] [Google Scholar]
  • 78.Austin PC, Steyerberg EW. 2019. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat. Med. 38(21):4051–65 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 79.Romero-Brufau S, Huddleston JM, Escobar GJ, Liebow M.2015. Why the C-statistic is not informative to evaluate early warning scores and what metrics to use. Crit. Care 19(1):285. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 80.Vickers AJ, Elkin EB. 2006. Decision curve analysis: a novel method for evaluating prediction models. Med. Decis. Mak. 26(6):565–74 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 81.Pauker SG, Kassirer JP. 1987. Decision analysis. New Engl. J. Med. 316(5):250–58 [DOI] [PubMed] [Google Scholar]
  • 82.Weissman GE, Greer JA, Temel JS. 2024. Use of machine learning to optimize referral for early palliative care: Are prognostic predictions enough? J. Clin. Oncol. 42(14):1603–6 [DOI] [PubMed] [Google Scholar]
  • 83.Fleuren LM, Thoral P, Shillan D, Ercole A, Elbers PWG, et al. 2020. Machine learning in intensive care medicine: Ready for take-off? Intensive Care Med. 46(7):1486–88 [DOI] [PubMed] [Google Scholar]
  • 84.Andaur Navarro CL, Damen JAA, Takada T, Nijman SWJ, Dhiman P, et al. 2022. Completeness of reporting of clinical prediction models developed using supervised machine learning: a systematic review. BMC Med. Res. Methodol. 22(1):12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 85.Collins GS, Reitsma JB, Altman DG, Moons KG. 2015. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD): the TRIPOD statement. Ann. Intern. Med. 162(1):55. [DOI] [PubMed] [Google Scholar]
  • 86.Heus P, Reitsma JB, Collins GS, Damen JA, Scholten RJ, et al. 2020. Transparent reporting of multivariable prediction models in journal and conference abstracts: TRIPOD for abstracts. Ann. Intern. Med. 173:42–47 [DOI] [PubMed] [Google Scholar]
  • 87.Leisman DE, Harhay MO, Lederer DJ, Abramson M, Adjei AA, et al. 2020. Development and reporting of prediction models: guidance for authors from editors of respiratory, sleep, and critical care journals. Crit. Care Med. 48(5):623–33 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 88.James G, Witten D, Hastie T, Tibshirani R. 2021. An Introduction to Statistical Learning. New York: Springer. 2nd ed. [Google Scholar]
  • 89.Sperrin M, Riley RD, Collins GS, Martin GP. 2022. Targeted validation: validating clinical prediction models in their intended population and setting. Diagn. Prognost. Res. 6(1):24. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Shah NH, Halamka JD, Saria S, Pencina M, Tazbaz T, et al. 2023. A nationwide network of health AI assurance laboratories. JAMA 331(3):245–49 [DOI] [PubMed] [Google Scholar]
  • 91.Bates DW, Kuperman GJ, Wang S, Gandhi T, Kittler A, et al. 2003. Ten commandments for effective clinical decision support: making the practice of evidence-based medicine a reality. J. Am. Med. Inform. Assoc. 10(6):523–30 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 92.Kawamoto K, Houlihan CA, Balas EA, Lobach DF. 2005. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ 330(7494):765. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 93.Filice RW, Ratwani RM. 2020. The case for user-centered artificial intelligence in radiology. Radiol. Artif. Intell. 2(3):e190095. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 94.Plana D, Shung DL, Grimshaw AA, Saraf A, Sung JJY, Kann BH. 2022. Randomized clinical trials of machine learning interventions in health care: a systematic review. JAMA Netw. Open 5(9):e2233946. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 95.Martindale APL, Ng B, Ngai V, Kale AU, Ferrante di Ruffano L, et al. 2024. Concordance of randomised controlled trials for artificial intelligence interventions with the CONSORT-AI reporting guidelines. Nat. Commun. 15(1):1619. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 96.Abràmoff MD, Lavin PT, Birch M, Shah N, Folk JC. 2018. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digit. Med. 1(1):39. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 97.Grzybowski A, Brona P, Lim G, Ruamviboonsuk P, Tan GSW, et al. 2020. Artificial intelligence for diabetic retinopathy screening: a review. Eye 34(3):451–60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 98.Wijnberge M, Geerts BF, Hol L, Lemmers N, Mulder MP, et al. 2020. Effect of a machine learning–derived early warning system for intraoperative hypotension versus standard care on depth and duration of intraoperative hypotension during elective noncardiac surgery: the HYPE randomized clinical trial. JAMA 323(11):1052–60 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 99.Angus DC. 2020. Randomized clinical trials of artificial intelligence. JAMA 323(11):1043–45 [DOI] [PubMed] [Google Scholar]
  • 100.Volchenboum SL, Mayampurath A, Göksu-Gürsoy G, Edelson DP, Howell MD, Churpek MM. 2016. Association between in-hospital critical illness events and outcomes in patients on the same ward. JAMA 316(24):2674–75 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 101.Goddard K, Roudsari A, Wyatt JC. 2012. Automation bias: a systematic review of frequency, effect mediators, and mitigators. J. Am. Med. Inform. Assoc. 19(1):121–27 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 102.Parasuraman R, Manzey DH. 2010. Complacency and bias in human use of automation: an attentional integration. Hum. Factors 52(3):381–410 [DOI] [PubMed] [Google Scholar]
  • 103.Cohen IG, Babic B, Gerke S, Xia Q, Evgeniou T, Wertenbroch K. 2023. How AI can learn from the law: putting humans in the loop only on appeal. NPJ Digit. Med. 6(1):160. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 104.Mosqueira-Rey E, Hernández-Pereira E, Alonso-Ríos D, Bobes-Bascarán J, Fernández-Leal A. 2023. Human-in-the-loop machine learning: a state of the art. Artif. Intell. Rev. 56(4):3005–54 [Google Scholar]
  • 105.Dratsch T, Chen X, Rezazade Mehrizi M, Kloeckner R, Mähringer-Kunz A, et al. 2023. Automation bias in mammography: the impact of artificial intelligence BI-RADS suggestions on reader performance. Radiology 307(4):e222176. [DOI] [PubMed] [Google Scholar]
  • 106.Yu F, Moehring A, Banerjee O, Salz T, Agarwal N, Rajpurkar P. 2024. Heterogeneity and predictors of the effects of AI assistance on radiologists. Nat. Med. 30:837–49 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 107.Jabbour S, Fouhey D, Shepard S, Valley TS, Kazerooni EA, et al. 2023. Measuring the impact of AI in the diagnosis of hospitalized patients: a randomized clinical vignette survey study. JAMA 330(23):2275–84 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 108.Khera R, Simon MA, Ross JS. 2023. Automation bias and assistive AI: risk of harm from AI-driven clinical decision support. JAMA 330(23):2255–57 [DOI] [PubMed] [Google Scholar]
  • 109.Lyell D, Coiera E. 2017. Automation bias and verification complexity: a systematic review. J. Am. Med. Inform. Assoc. 24(2):423–31 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 110.Cummings ML. 2017. Automation bias in intelligent time critical decision support systems. In Decision Making in Aviation, ed. Harris D, Lu W-C, pp. 289–94. Routledge [Google Scholar]
  • 111.Burton JW, Stein MK, Jensen TB. 2020. A systematic review of algorithm aversion in augmented decision making. J. Behav. Decis. Mak. 33(2):220–39 [Google Scholar]
  • 112.Aysola J, Clapp JT, Sullivan P, Brennan PJ, Higginbotham EJ, et al. 2022. Understanding contributors to racial/ethnic disparities in emergency department throughput times: a sequential mixed methods analysis. J. Gen. Intern. Med. 37(2):341–50 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 113.Chesley CF, Chowdhury M, Small DS, Schaubel D, Liu VX, et al. 2023. Racial disparities in length of stay among severely ill patients presenting with sepsis and acute respiratory failure. JAMA Netw. Open 6(5):e239739. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 114.Ramadurai D, Kohn R, Hart JL, Scott S, Kerlin MP. 2023. Associations of race with sedation depth among mechanically ventilated adults: a retrospective cohort study. Crit. Care Explorat. 5(11):e0996. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 115.Harlan EA, Venkatesh S, Morrison J, Cooke CR, Iwashyna TJ, et al. 2024. Rural-urban differences in mortality among mechanically ventilated patients in intensive and intermediate care. Ann. Am. Thorac. Soc. 21(5):774–81 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 116.Mayer KH, Bradford JB, Makadon HJ, Stall R, Goldhammer H, Landers S. 2008. Sexual and gender minority health: what we know and what needs to be done. Am. J. Public Health 98(6):989–95 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 117.Green AR, Nze C. 2017. Language-based inequity in health care: Who is the “poor historian”? AMA J. Ethics 19(3):263–71 [DOI] [PubMed] [Google Scholar]
  • 118.Eneanya ND, Yang W, Reese PP.2019.Reconsidering the consequences of using race to estimate kidney function. JAMA 322(2):113–14 [DOI] [PubMed] [Google Scholar]
  • 119.Moffett AT, Bowerman C, Stanojevic S, Eneanya ND, Halpern SD, Weissman GE. 2023. Global, race-neutral reference equations and pulmonary function test interpretation. JAMA Netw. Open 6(6):e2316174. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 120.Ashana DC, Anesi GL, Liu VX, Escobar GJ, Chesley C, et al. 2021. Equitably allocating resources during crises: racial differences in mortality prediction models. Am. J. Respir. Crit. Care Med. 204(2):178–86 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 121.Siddique SM, Tipton K, Leas B, Jepson C, Aysola J, et al. 2024. The impact of health care algorithms on racial and ethnic disparities: a systematic review. Ann. Intern. Med. 177(4):484–96 [DOI] [PubMed] [Google Scholar]
  • 122.Britez Ferrante E, Blady S, Sheu D, Maitra MR, Drakes J, et al. 2024. Operationalizing equity, inclusion, and access in research practice at a large academic institution. J. Gen. Intern. Med. 39(6):1037–47 [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES