Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2023 Feb 1.
Published in final edited form as: Ophthalmology. 2021 Aug 31;129(2):e14–e32. doi: 10.1016/j.ophtha.2021.08.023

Foundational Considerations for Artificial Intelligence Using Ophthalmic Images

Michael D Abràmoff 1,16,17, Brad Cunningham 2, Bakul Patel 3, Malvina B Eydelman 2, Theodore Leng 4, Taiji Sakamoto 5,6, Barbara Blodi 7, S Marlene Grenon 8,18, Risa M Wolf 9, Arjun K Manrai 10,11, Justin M Ko 12, Michael F Chiang 13, Danton Char 14,15, Collaborative Community on Ophthalmic Imaging Executive Committee and Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group
PMCID: PMC9175066  NIHMSID: NIHMS1813852  PMID: 34478784

Abstract

Importance:

The development of artificial intelligence (AI) and other machine diagnostic systems, also known as software as a medical device, and its recent introduction into clinical practice requires a deeply rooted foundation in bioethics for consideration by regulatory agencies and other stakeholders around the globe.

Objectives:

To initiate a dialogue on the issues to consider when developing a bioethically sound foundation for AI in medicine, based on images of eye structures, for discussion with all stakeholders.

Evidence Review:

The scope of the issues and summaries of the discussions under consideration by the Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group, as first presented during the Collaborative Community on Ophthalmic Imaging inaugural meeting on September 7, 2020, and afterward in the working group.

Findings:

Artificial intelligence has the potential to improve health care access and patient outcome fundamentally while decreasing disparities, lowering cost, and enhancing the care team. Nevertheless, substantial concerns exist. Bioethicists, AI algorithm experts, as well as the Food and Drug Administration and other regulatory agencies, industry, patient advocacy groups, clinicians and their professional societies, other provider groups, and payors (i.e., stakeholders) working together in collaborative communities to resolve the fundamental ethical issues of nonmaleficence, autonomy, and equity are essential to attain this potential. Resolution impacts all levels of the design, validation, and implementation of AI in medicine. Design, validation, and implementation of AI warrant meticulous attention.

Conclusions and Relevance:

The development of a bioethically sound foundation may be possible if it is based in the fundamental ethical principles of nonmaleficence, autonomy, and equity for considerations for the design, validation, and implementation for AI systems. Achieving such a foundation will be helpful for continuing successful introduction into medicine before consideration by regulatory agencies. Important improvements in accessibility and quality of health care, decrease in health disparities, and lower cost thereby can be achieved. These considerations should be discussed with all stakeholders and expanded on as a useful initiation of this dialogue.

Keywords: Artificial intelligence, Augmented intelligence, Clinical standards, Clinical trial, Cornea, Ethics, FDA, Glaucoma, Oculoplastics, Regulation, Retina, Safety, imaging, non-maleficence, equity, autonomy, patient benefit, health disparities, population health, clinical outcome, validation, explainability, validability, transparency, population achieved sensitivity, vernacular medicine, scalability


The Collaborative Community on Ophthalmic Imaging (CCOI) formed in 2019 to advance the innovation of ophthalmic imaging with a focus on medical devices using artificial intelligence (AI).1,2 The CCOI’s Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation (FPOAI) Working Group was established in March 2020 to generate consensus on a bioethical foundation for AI of ophthalmic imaging for consideration by all stakeholders in the health care system, including, but not limited to, the United States Food and Drug Administration (FDA) and other regulatory agencies. Its processes draw on the expertise of bioethicists,3,4 AI algorithm experts, FDA and other regulatory agencies, as well as industry, patients, patient advocacy groups, clinicians and their professional societies, and payors,5 to identify best practices for addressing novel issues emerging with AI conception, evaluation, and implementation, including validation, reference standards, performance metrics, accountability for output, bias, and impacts on workflow.

The terms artificial intelligence and augmented intelligence are used interchangeably for systems that perform tasks that mimic human cognitive capabilities.1 The authors use artificial intelligence to refer to the concept of programming computer systems to perform tasks to mimic human cognitive capabilities—such as understanding language, recognizing objects and sounds, learning, and problem solving—by using logic, decision trees, machine learning, or deep learning. Such anthropomorphic AI systems, which are becoming more common, are not programmed explicitly and instead learn from data that reflect highly cognitive tasks, typically performed by trained health care professionals. In some cases, these AI systems are used to aid health care professionals.6 The introduction of AI in medicine has the potential to improve quality, reduce costs, diminish health disparities, and increase accessibility, as well as enhance the care team, at both the individual and population levels.7,8 Thus, its introduction aligns with the American Medical Association’s principle of quadruple aim of improved outcomes, lower cost, improved patient experience, and improved clinician experience.9 After the first FDA de novo clearance for an autonomous AI,10 that is, an AI system that makes a clinical decision without human oversight,10 AI has entered mainstream health care, including standards of care.11 The use of AI in the ophthalmic setting has been studied for many applications,12 including in diseases such as diabetic retinopathy,13 retinopathy of prematurity,14 macular degeneration,15 glaucoma,16 and cancer,17 as well as many other ocular conditions, such as those of the cornea18 and other parts of the anterior segment.19

To maximize AI’s benefits, many ethical, economic, and scientific issues, including algorithmic bias, safety, efficacy, and equity—terms that are explained in the next section—need to be addressed in a transparent fashion for acceptance by all stakeholders. So far, studies to establish scientific evidence for the safety and other criteria of AI in general are quite limited, with few exceptions.20 In a meta-analysis of 81 AI clinical trials, only 9 were prospective, and just 6 were tested in a non-research, clinical setting.21 The relationship of the AI’s diagnostic accuracy to clinical outcomes in this widely cited study was not even mentioned, and more generally, in an analysis of 126 published diagnostic accuracy studies, only 12% reported any statistical test of a hypothesis related to the study objectives.22

Reporting standards for AI studies have been published recently, such as Consolidated Standards of Reporting Trials-Artificial Intelligence (CONSORT-AI),23 as well as an AI extension to the Standards for Reporting of Diagnostic Accuracy Studies24 currently under development. Although potentially beneficial, such standards may not provide sufficient information to help inform regulatory evaluation and have not been recognized by the FDA; see also the FDA’s Recognized Consensus Standards.25 Although reporting standards may have benefits in improving consistency, additional considerations beyond these recommendations may be needed for regulatory evaluation, many of which are the subject of this analysis (“Considerations.”)

These first Considerations to come from our FPOAI Working Group present the scope of the issues and concepts and briefly summarize the discussion on diagnostic AI and other software as a medical device (SaMD) systems that use images of the eye, as first presented during the Collaborative Community on Ophthalmic Imaging inaugural meeting on September 7, 2020, and later discussed within the FPOAI Working Group.26 Specifically, it describes both clinical constraints for AI systems, as well as bioethically founded constraints derived from the 3 major bioethical principles of nonmaleficence, equity, and autonomy. Although, as FPOAI Working Group stakeholders, we realize the tremendous potential advantages of AI systems, we also realize that substantial concerns exist from the scientific and clinical communities, as well as society at large. Therefore, involvement of35 all stakeholders to resolve ethical issues, including nonmaleficence, autonomy, and equity, is key.

Design, validation, and implementation of diagnostic AI systems warrant meticulous attention. We limit the scope of these Considerations, for the time being, to AI intended for diagnosis. Although therapeutic AI, including autonomous AI for prescribing and autonomous AI for surgery, are on the horizon, we decided that these currently are beyond the scope of these Considerations, given the multiple ethical and even theoretical problems that need to be resolved. Furthermore, no regulatory guidance exists for therapeutic AI systems using images of the eye.

Obviously, the Considerations will be commensurate with the risk of harm to the patient, with different indications for use, conditions diagnosed, autonomy of the AI, consequences of a missed diagnosis, the population at risk, and other factors. Thus, the right balance needs to be considered between resource requirements and burden on AI creators27,28 to align with proposed ethical principles on the one hand and risk of patient harm from lack of access to AI systems on the other hand in order for patients, patient populations, and the wider health care system to benefit. In addition, although some AI systems are marketed medical devices and are under regulatory oversight, other AI systems are never marketed. Such so-called homebrew AI is used—by the clinicians who developed it or others—in patient care, and their safety and equity can be of concern.29

There are many useful resources, such as the reporting guidelines mentioned (e.g., Clinical Evaluation of SaMD,26 Standards for Reporting of Diagnostic Accuracy Studies,24 and CONSORT-AI23), clinical practice guidelines (e.g., the American Telemedicine Association Telehealth Practice Guidelines for Diabetic Retinopathy30,31), standards (e.g., Digital Communications in Medicine32), and FDA guidance26 that can be referenced to help mitigate the aforementioned concerns. Ultimately, we incorporated these useful resources as initial steps in developing best practices, as well as incorporating AI-tailored regulatory frameworks, including Good Machine Learning Practice and other equivalents to the more familiar good manufacturing practices, as was called for by the United States General Accounting Office in its recent report33 as well as by regulatory agencies such as FDA.1

Clinical Considerations for Artificial Intelligence Systems

These Considerations divide the requirements for AI systems into 2 categories: the clinical requirements, covered in this Section, and the ethical requirements, derived from a bioethical foundation, covered in the next section. Thus, this section discusses the various clinical aspects of AI systems that use images of the eye in some form, conforming to the scope of the Collaborative Community on Ophthalmic Imaging.2 We define images of the eye as topologically ordered sets of intensities, which represent physical and pathophysiologic processes occurring in the eye and that may reflect conditions of the eye and other conditions of parts of the patient’s body. Specifically, we cover intended use, impact, inputs and outputs, and human factor design aspects of the AI system.

Intended Use of the Diagnostic Artificial Intelligence System

The rationale for designing, developing, validating, and deploying AI systems includes improving individual patient care, population health, and scientific research. Specifically, for individual patients, the rationale includes improving their quality of care, lowering cost, increasing access, decreasing health disparities, and improving efficiency. For scientific research, the rationale includes discovering new disease mechanisms and gaining a better understanding of a disease.

Impact of the Diagnostic Artificial Intelligence System

After the use is identified, the impact of the AI systems can be assessed. Artificial intelligence systems span a wide range of impact, from having no direct impact on an individual patient or group of patients (e.g., inform a provider) to having an important decision-making impact on an individual patient (i.e., drive or treat).34 From a regulatory perspective, many AI systems are considered medical devices—SaMD—whereas other systems may not meet the definition of a medical device because definitions differ across regulatory agencies.35,36 We refer to the FDA’s more narrow definition of medical devices under section 201h,35 as modified under section 3060 of the 21st century Cures Act,37 as well as the broader definition used by the International Medical Device Regulators Forum.34 Based on those definitions, AI systems can be subdivided by impact, as shown in Table 1.38

Table 1.

Artificial Intelligence System Impact

Use Case Description Examples Food and Drug Administration Oversight

Population care Prioritization and triage with potential impact on groups of patients and individual patients Care pathway assignment Likely35
Individual patient care
 Assistive AI Assists a clinician who determines the patient’s management Provides a probability or likelihood of a disease or condition or may highlight potential lesions that should be reviewed by a specialist Likely35
 Autonomous AI Makes a medical decision without input from a clinician For example, an autonomous AI system may evaluate for the presence of a disease, such as diabetic retinopathy and macular edema, or condition and notify the user whether the disease or condition is present Likely35
 Scientific research Not used for individual patient or population care, although the results of the research may impact populations or patients downstream Health care analytics Unlikely
 Operations and data management Where this does not impact individual patient or population care; these often exist within the realm of health information technology systems as they relate to administrative purposes VIM Referral Guidance, a triage system from EHRs (https://getvim.com/solution/referral-guidance) Unlikely
 Clinical decision support Informs the clinician by aggregating, reformatting, or visualizing data, without providing analytical insights of the data, in a manner that allows the clinician to review the basis of the information provided by the software independently AI system that suggests a G6PD test before prescribing an antimalarial therapy38 Depends*
 General wellness Collects physiologic information from devices and sensors, including wearables Smart watch that captures heart rate Depends*

AI = artificial intelligence; EHR = electronic health record.

See Center for Devices and Radiological Health, United States Food and Drug Administration.38 This explains when a software function qualifies as nondevice CDS as well as device CDS, and which of these are regulated actively or for which compliance with applicable regulation would not be enforced.

An important aspect of these AI systems is their theoretically unlimited scalability. After being designed and validated, the algorithms of a single AI system can be used on hundreds of millions of patients. Although the number of patients a human clinician may encounter varies greatly based on the health setting and geography (e.g., 800–1000 unique patients per year, or during their entire career, no more than approximately 30 000–40 000 unique patients8), the scale is significantly different than for an AI system. Thus, the impact of any benefits or risks stemming from the use of the AI system is massively scaled, and in just 1 year of implementation, possibly 1000-fold or more than the impact any individual clinician can have in their lifetime.

The training and practice of an individual clinician may be optimal for a specific (sub-)population, based on demographics, geographic proximity, and other facts39; we define this as vernacular medicine. Such vernacular medicine may be less generalizable than is often acknowledged. For an AI system at scale, such optimality may not necessarily be present, depending on training data as well as other factors. Although this may increase its value for multiple, but geographically or demographically different, groups, it may be less optimized for specific groups, and thus this needs to be considered carefully. We cover this in more detail in the “Ethical Considerations” section. Privacy, confidentiality, and other clinical data security aspects may differ across regions as well. Recently, a concept of federated machine learning was introduced that allows for an aggregated, scalable AI system to fine-tune from independent training datasets.40 A more recent concept of federated machine learning enables remote devices (e.g., mobile phones) to engage collaboratively in model learning and improvement that can take place at a more local level. Such an approach decouples the machine learning from any global training data that ordinarily would be derived from a single discrete storage system. Rather, model training obtains multiple, different, localized, and vernacular datasets. For deployment, the trained AI model contains no reference to the local training data that were used to refine and tune the model. This technique, similar to edge computing, may seem to have benefits. However, novel risk considerations also may be relevant relating to algorithm or model iteration that would need to be captured for accurate documentation. These include training data characterization, good machine learning practice, model version and updates, as well as the assumption that multiple vernacular datasets that are distributed normally can be reduced to a simple distribution function. Probable risks of patient harm and benefits of such a federated approach have not been studied sufficiently.

Artificial Intelligence System Outputs

The intended use and impact of an AI system constrains its outputs. According to the International Medical Device Regulators Forum’s definitions of the type of the output (inform, drive, diagnose, and treat), as well as the significance of the condition (nonserious, serious, and critical), outputs can be categorized as shown in Table 2.26 Artificial intelligence system outputs may be aligned with preferred practice patterns or other standards of care to maximize the potential of the AI system to impact clinical outcome positively. This is discussed in more detail in “Nonmaleficence” section.

Table 2.

Artificial Intelligence System Outputs

Type of Output Significance of the Condition Category Clinical Context

Inform Nonserious, serious, or critical Risk prediction Suggest specific test types that may be implemented as part of a diagnostic workup of a patient based on clinician suspicion
Drive Nonserious, serious, or critical Likelihood, probability, or prediction of disease Used by clinician who understands how to interpret the input image (e.g., ophthalmic clinician)
Saliency, such as highlighting regions of interest or specific lesions in an image Used by clinician who understands how to interpret the input image (e.g., ophthalmic clinician)
Diagnose or treat Nonserious, serious, or critical Disease staging Assistive use case: clinician receives specific aspects of the inputs that indicate the disease stage and decides the stage
Disease staging Autonomous use case: the user receives the disease stage
Screening Assistive use case: clinician receives specific aspects of the inputs that indicate abnormalities and decides whether disease may be present
Screening Autonomous use case: the user receives output on whether the disease may be present
Diagnosis Assistive use case: a clinician receives specific aspects of the inputs that indicate disease-specific abnormalities and the absence of disease-specific abnormalities and decides the diagnosis by excluding other disease
Diagnosis Autonomous use case: the user receives a diagnosis; an autonomous AI system may evaluate for the presence of a disease or condition and notify the user whether the disease or condition is present without showing how the AI system arrived at the decision

AI = artificial intelligence.

The term assistive is usually used for those systems in which the clinician makes the ultimate medical decision and carries liability for the AI performance, while the term autonomous is reserved for those systems where the AI system makes the ultimate medical decision and the AI creator carries the liability for the AI performance.6 This distinction, assistive versus autonomous, coupled with intended use, including the significance of the condition, have important bearings on the interpretation of risk as well as other regulatory implications (e.g., clinical study design). The interaction between AI and physicians, who risk becoming, as it were, physicians of the magenta41 and too dependent on monitoring diagnostic AI devices, is of crucial importance here. Potentially, assistive systems may need subdivision into additional categories that more specifically delineate the roles of humans versus AI.

Artificial Intelligence System Use Environment

The AI outputs, including for whom it is meant, the information provided, and so forth, may dictate de facto the use environment, including the operator, for the AI system (Table 3).

Table 3.

Artificial Intelligence System Use Environment

Use Case Setting Description

Home AI system is used by the patient, and the patient images himself or herself without clinician or other operator assistance, or imaging is carried out by the general home health care provider. The output may be provided to the user (patient or home health care provider) or may be provided to a remote clinician.
Nonspecialist (primary care or other nonophthalmologist) AI system is used by clinicians and operators, who have minimal experience with imaging the eye or the evaluations of ocular images or other input. The specific interpretation of the image may be important for that clinician to manage the patient in the context of a disease—e.g., evaluation of fundus photographs for presence of diabetic retinopathy while managing diabetes—or to determine the presence or severity of a systemic disease or disease in another organ system than that being managed, such as determining neurologic disease from retinal images.
Specialist (ophthalmologist or other eye care provider) AI system is used by clinicians and operators who have experience with ocular imaging and with evaluation of ocular images, but not necessarily with the specific AI output. An example is an AI system for retinal vessel analysis that outputs vascular beading or caliber metrics.

AI = artificial intelligence.

Artificial Intelligence System Human Factor Considerations

Considering the use environment leads to consideration of human factors and impact and outputs of the AI system (Table 4).

Table 4.

Artificial Intelligence System Human Factors

Operator expertise level Patient operated
Untrained operator
Ophthalmic photographer
Certified ophthalmic photographer
Operator AI assistance level Differing levels of assistance during the imaging process and protocol, which may include evaluation of image quality evaluation, field, and sequence order

AI = artificial intelligence.

Artificial Intelligence System Inputs

An AI system achieves its intended use through sampling inputs that are analyzed via the algorithm. One goal of an AI system is to obtain a reliable, consistent output while minimizing the number of inputs (samples and types) to help improve robustness of an algorithm to changes in input signal quality and environment, among other factors. For ophthalmic images, inputs can range from image sets from an entire population with multiple images for each member of that population (for population risk assessment) to multiple images from a single patient (for diagnosis). The number and extent of these images are typically dictated by the intended use of the algorithm and the use environment. A nonexhaustive list of input types (image and nonimage input types) is shown in Table 5.

Table 5.

Artificial Intelligence System Inputs

Input Characteristic Examples

Image based Image method Fundus imaging
Slit-lamp photography
OCT
Ultrasound
Scanning laser ophthalmoscope, topography
Aberrometry
Perimetry (functional)
Multifocal electroretinography (functional)
CT, including orbital CT
MRI, including orbital MRI
Image characteristics: although currently no required standard exists, standardization of image metadata such as defined by the Digital Imaging and Communications in Medicine (DICOM) standard 9132,42 will benefit these considerations Sample area
x, y, or en face resolution
x, y, or depth or axial resolution
Field of view or area of retina covered
Number of fields
Stereo vs mono images
Depth penetration limit
Center wavelength(s)
Momentary pupil diameter Compression characteristics
Ambient light level and other environmental conditions
Nonimage Input from methods that do not meet the definition of a medical device (i.e., that are not FDA regulated as a medical device) Patient history
Medication history
Systemic comorbidities
Input from methods that do meet the definition of a medical device (i.e., that are FDA regulated) Axial eye length
Intraocular pressure
Pachymetry
Keratometry
Visual acuity
Heart rate
Blood pressure
Hemoglobin A1C

Ethical Considerations for Artificial Intelligence Systems

Bioethical Foundation

In addition to their clinical requirements, such as intended use, human factors, and input and output requirements, as set forward in the previous section, AI systems will have to meet ethical requirements to function. This has both practical and philosophical importance: AI systems should follow ethical standards because the field of medicine has defined these standards as guiding principles for the appropriate delivery of health care; if AI systems are perceived as unethical or not bound by ethical constraints, stakeholders will not trust these systems and may refuse to engage with them, and this promising technology will fail to reach the populations it is designed to impact. Consequently, this section introduces the relevant bioethical foundation,43 and then derives operational ethical dimensions or principles that can be used to create and evaluate ethical requirements for AI systems.

All health care stakeholders, as well as society as large, are already concerned with the use of AI in health care, even when they understand the potential efficiency gains. Their concerns include AI systems’ safety44; actual patient outcome benefit45; mitigation of health care disparities, rather than worsening them; potential for racial, ethnic, or other inappropriate biases46; (mis)use of patient data, including personal health information, during training and implementation46; (mis)use, including off-label use47, and liability, that is, who can be held accountable or liable for any patient harm.6

To address these concerns, an ethical framework to identify ethical concerns before they become consequential is considered essential. Several such ethical frameworks for AI4 and autonomous AI3 have been proposed and discussed. We focus on the primary bioethical principles of nonmaleficence (or patient benefit), autonomy, and justice, per Beauchamp and Childress.48 Instead of the term justice, which is widely used in the ethics literature, but which may have legal connotations and thus lead to confusion, herein we use the more familiar term equity, to describe freedom from bias or favoritism. Accountability, although strictly speaking not an ethical concern, leads to related requirements primarily related to autonomy and is discussed as well.

Such an ethical framework, as developed along the cited publications,3,4 leads to the following: (1) ethical metrics to be created, derived from each of the ethical principles, for example population achieved sensitivity, which is derived from equity, see next section; (2) the insight that the framework will be nonorthogonal, because most ethical metrics are not independent axes, but instead partially overlap (if they formed independent metrics or axes, this would allow an orthogonal framework); and (3) the requirement for a balance to be found or defined among the 3 ethical axes we focus on (beneficence, equity, and autonomy). Thus, a so-called Pareto optimum needs to be defined, because it is impossible to meet all 3 ethical principles perfectly.

In effect, we use the 3 bioethical principles as (nonorthogonal) axes along which to analyze and constrain AI systems and to define their ethical requirements. We emphasize that they exist in tension to each other, such that improving one of them for a particular AI system may decrease another one. For example, for an autonomous AI for the diabetic eye examination, an acceptable balance needs to be found between (1) improving access to a disadvantaged population (equity) while (2) ensuring that increasing diabetic eye examination compliance leads to an overall net benefit in improvement in care, rather than just increasing diagnoses without access to treatment (nonmaleficence) and (3) while also maintaining sufficient transparency about the use of AI, training data limitations, and data use, so that patients can decide about their own participation, even if opting out means losing access to AI benefits (autonomy).49 Theoretically, health disparities can be mitigated through adjusting the output of the AI system for those patients who are considered advantaged according to some metric. Although potentially increasing equity of the AI system, such an approach will likely conflict with nonmaleficence and autonomy.

In addition, complicating ethical analyses, AI output itself impacts clinical workflows and clinical decisions, both of which may increase tensions between these bioethical axes. Much as intention to treat is standard for randomized clinical trial evaluation, the downstream consequences from AI output will need to be part of any ethical evaluation of a medical AI application. That is, bioethical analysis per se, along these dimensions, cannot prescribe the right balance. Rather, it offers a framework to guide and evaluate such decisions. The Pareto-optimal balance among nonmaleficence, autonomy, and equity has to be determined by all stakeholders (Fig 1). After being determined, such a balance results in (ethical) constraints on the design, validation, and implementation of AI systems. Thus, we next examine the different bioethical principles and show how these principles affect AI system requirements, design, development, and validation. Ultimately, the goal of such ethical requirements is to address and answer the valid existing concerns about AI systems in health care that were introduced in the previous section.

Figure 1.

Figure 1.

Diagram showing the balance and tension among the 3 bioethical principles: nonmaleficence, autonomy, and equity (justice).

Nonmaleficence

This principle non-maleficence, or patient benefit, often described as “first, do no harm,” is often interpreted as including safety for the individual patient. It affects all aspects of autonomous AI systems, including design, validation, and implementation. An AI system’s risk of harm is affected by intended use, impact, inputs and outputs, and use context, as explained in the previous section. However, additional considerations unique to AI systems affect probable risk of harm and are specific to AI and machine learning: design and development, validation, and postmarket validation and monitoring. We explain how these considerations are related to nonmaleficence and how these may lead to more detailed ethical requirements. Many considerations are not unique to AI systems and instead are common to all systems involving software that are not discussed herein.

Design of the Artificial Intelligence System and Nonmaleficence.

In general, AI system design and development share many characteristics with non-AI software systems, and the requirements are laid down in standards like International Organization for Standardization (ISO) 90003,50 and, for medical devices, in ISO 13485.51 In addition, AI-specific design considerations exist that are related to insight into the AI, that derive from nonmaleficence, and affect the risk of harm to the patient. We differentiate 3 forms of considerations that can be assessed for the design: (1) explainability, the amount of insight the user (typically the physician) has into the clinical logic that determined the AI output for a specific patient; (2) transparency, the amount of insight the user has into the clinical usefulness of the AI system for all patients; and (3) validability, the amount of insight that exists into the nonclinical validity (analytical validity) of the AI system, and that can be determined without clinical validation studies. The following are examples of relevant aspects that were discussed by FPOAI and that need further consideration52:

  1. Transparency is defined as the degree to which the user or clinician of the AI system has insight into the requirements and limitations for the AI system inputs, its training data characteristics, and how the AI outputs are derived from the inputs for the intended use (i.e., for the specific disease or condition).1,23 Transparency also may include how the AI system creator uses patient-derived data outside this AI system’s intended use, for example, whether patient-derived data can be monetized after the AI system output has been derived. This aspect of transparency also serves autonomy (see next section).

  2. Explainability, while fundamentally related to transparency, refers more to how the output is related to clinical practice and scientific literature. For example, is the output clinically meaningful (e.g., diagnosis of a known condition, presence of a particular lesion), rather than something not well understood (e.g., disease severity on a scale that has not been validated clinically or recognized widely)? Other aspects of transparency beyond algorithmic functionality in the clinic, such as aspects relating to validation efforts (including analytical validation), should be transparent to the user to help replicate the measured performance in real-world use. In fact, per the main principles of Enhancing the QUAlity and Transparency Of health Research (EQUATOR) (which includes the CONSORT-AI extension),23,24 complete, accurate, and transparent reporting is an integral part of responsible research conduct. Thus, trial reporting should include a thorough description of the input-data handling, including image acquisition, selection, and any preprocessing before feeding into an AI system for analysis. This transparency is integral to the replicability of the intervention beyond the clinical trial in real-world use.

  3. Validability is defined as the degree to which the validity of the AI system can be assessed without clinical validation studies. That is, to what extent is it possible to self-validate an AI system without going through formal bench or clinical performance validation? This includes aspects like algorithmic bugs, unresolved anomalies, open loops, and so forth, that would be found on inspecting algorithm coding. For cases of black-box systems, not as much can be inspected, which decreases the overall validability of the system. Thus, validability qualifies our understanding of the analytical performance of the AI system and the impact of other systems on its performance. Examples of this may include the following:

    1. AI algorithm structure and infrastructure, including unit level and code analysis, hardware, firmware, and operating system.

    2. Use of federated hardware—dynamically allocated hardware—such as cloud-based systems. As more and more AI algorithms move to such environments, execution of the code may be removed from the original computational infrastructure (i.e., hardware, firmware, and software) where it was validated. On one hand, federated, or cloud-based, execution environments, for example, Amazon Web Services or Microsoft Azure, make it easier to have only a single version of the codebase, rather than multiple different versions, thus enhancing deterministicness. On the other hand, the same code may now be executed on a diversity of computational infrastructures. Executing a code fragment then may have a variety of floating point and other computational results, lowering deterministicness of the code fragment. Mitigation may require prevalidation of the computational infrastructure for a specific code fragment, or instead, may require constraining the range of computational infrastructure on which such a code fragment may be executed. Prevalidation may maximize a computational infrastructure agnostic approach and thus may allow a single codebase globally, as well as providing easier maintenance costs and higher redundancy.

    3. Inspection of intellectual property that includes source code and patented and copyrighted components. Determining who has authority and expertise to evaluate validability, as well as what can be shared at which level, has implications for AI creators. Such inspection may include algorithmic correctness verification.53

    4. The AI system’s use of priors. This may include analysis of whether the AI system is designed as a black box (minimal validability), a gray box (limited validability), or detector based (enumerated validability).3 Here, validability is primarily concerned with whether analysis of catastrophic and graceful failures of the AI system shows unanticipated risks, which have been shown to occur more often in black-box than detector-based AI systems.54,55

    5. Full characterization of the training datasets at the patient level, which may include partial or full traceability to individual patients, as well as patient demographics and other patient-specific characteristics. Compare the amount of information needed for validability with that needed for transparency, which could require only aggregate characteristics to be identified; for validability, the requirements could be more strict.

As shown, both explainability and transparency primarily involve the clinically oriented AI system user, whereas validability primarily involves AI creators, regulators, and nonclinical (technical) AI system users.

Validation of the Artificial Intelligence System and Nonmaleficence.

The ethical principle of nonmaleficence also leads to requirements for nonclinical and clinical validation or testing of the AI system. Nonclinical testing may include: input data compatibility, discussed in the next section; software verification, including software or firmware description, hazard analysis, software requirements specifications, architecture design specifications, and code traceability.51

For clinical validation, common reporting standards,24 CONSORT-AI,23 preregistration of study and analysis protocols,10,56,57 and validated relationship to patient management3 are important factors to enhance reproducibility. Given the many concerns about replicability, preregistration of the study protocol, inclusion and exclusion criteria, and statistical analysis, according to good clinical practice58 or other standards, should be considered. Although potentially beneficial, such standards may not provide sufficient information to help inform regulatory evaluation and have not been recognized by the FDA. See also the FDA’s Recognized Consensus Standards.25 An important decision is whether the AI system is locked before validation, because this affects the external validity and power of any validation study.

The requirements for clinical validation should be commensurate with the risk of harm to the patient. Determining the right balance between resource requirements and burden on AI creators for validation on the one hand, and risk of patient harm from AI system use on the other hand, is essential for patients, patient populations, and the wider health care system to benefit from health care AI carried out the right way.

Validation Study Design.

As far as AI validation study design is concerned, prospective longitudinal or cross-sectional designs may be most appropriate for diagnostic AI validation studies. Incorporating as much of the real-world workflow as possible should be considered. Consider the importance of incorporating the actual workflow59 into AI system validation, and the risk of leaving workflow out, in a purely observational validation study, as first shown by Fenton et al.3,60 In this pivotal retrospective cohort study, the outcomes of women undergoing breast cancer screening by a radiologist assisted by a previously FDA-cleared (based on a study showing high accuracy of the AI compared with radiologists) assistive AI system were compared with those of women who underwent breast cancer screening by a radiologist without an assistive AI.60 When this assistive AI system was evaluated in the setting of actual workflow—where it assists a radiologist who makes the final clinical decision—outcomes were found to be worse for the women who underwent breast cancer screening with AI assistance. This finding and its implications highlight the importance of evaluating such technologies within the intended workflow. This applies to the validation clinical trial design as well as through continuing evaluation after actual deployment, as discussed in the next section. It also aligns well with the trend toward the use of real-world data and increasing emphasis on continuous efficacy assessment in the postmarket phase of the FDA and other regulatory agencies.

As far as study design is considered, for diagnostic AI, prospective longitudinal or cross-sectional designs may be appropriate. Such study designs allow hypothesis testing of the effect of the AI diagnostic on patient outcomes or, where diagnosis already has been linked to (untreated) clinical outcome, of the diagnostic accuracy of the AI. For example, diagnostic accuracy hypothesis testing may allow a prospective cohort study design, whereas outcome hypothesis testing design likely will require a randomized clinical trial design. Although a null hypothesis of no effect works well in interventional validation studies, a null hypothesis of not informative in a randomized clinical trial may be less desirable for validation of diagnostic AI systems, especially for validation of autonomous AI systems.61,62 Consider that such a randomized clinical trial needs an arm where the patient management is based on the autonomous AI output, including the need for intervention. To emphasize, in this arm the patient management can be based on only the diagnostic output of the autonomous AI, without the possibility of overruling by a clinician. (If clinician overruling is not ruled out, the effect measured would be that of the clinician and the AI combined, rather than of the autonomous AI only.) The autonomous AI may output a diagnosis incorrectly, leading to no treatment and leaving a patient untreated when treatment would have been beneficial. Whether the AI made the incorrect call can be known only when the study is complete.63,64 As mentioned, where diagnosis can be linked to outcome, such a design is not necessary and cohort design is appropriate.

Validation Study Reference Standards.

Consideration should be given to how AI outputs are validated, that is, what these outputs are compared against. For a diagnostic AI system, such a comparison typically is made against an appropriate reference standard, based on its diagnostic indication: from informing a health care provider or patient to driving treatment decisions and making a definitive diagnosis. These reference standards can be categorical or continuous.65

From a nonmaleficence principle, the effects of the AI system on clinical outcomes are most relevant. These may be indirect, because clinical outcomes likely may depend on medical decisions that are neither visible to, nor affected by, the AI system. Such clinical outcomes include events of which the patient is aware and wants to avoid, including death, loss of vision, visual field loss, and other events causing a reduction in the patient’s quality of life.66 The resources required to quantify such clinical outcomes objectively can be immense, particularly for chronic disease. In contrast, for acute diseases or interventions, clinical outcomes can be immediate and therefore relatively easier to obtain, such as visual acuity improvement in response to an AI that assists in refraction. For the many chronic diseases to which an AI may be applied, such as diabetic retinopathy, glaucoma, or macular degeneration, clinical outcomes may take years to manifest. Great interest in the development of alternative outcomes, or surrogate end points,67 has arisen in the evaluation of investigational medical products to reduce the cost and shorten the duration of trials.

For diagnostic AI, interest in surrogate end points has focused on prognostic standards, where a patient’s disease state has been related to a future clinical outcome. Obviously, these should be validated and correlated directly to clinical outcome.68 The advantage of a prognostic standard over a surrogate outcome as an end point is that it is not dependent on clinical decisions outside the intended use of the AI system, in other words, its output. For example, within ophthalmology, a prognostic standard is available for diabetic retinopathy, as well as diabetic macular edema, and can be determined by an autonomous diagnostic AI system. However, an expert will make clinical decisions after the diagnosis is determined, such as whether to administer laser treatment or to deliver anti–vascular endothelial growth factor treatment. Such clinical decisions impact the ultimate clinical outcome, but are not made or influenced directly by the AI system. Thus, using a prognostic standard, rather than outcome, to evaluate an AI system has the advantage of not inadvertently diminishing or underestimating the benefits of the AI for decisions outside its control in the context of the clinical outcome.

The Early Treatment Diabetic Retinopathy Study severity scale and the Diabetic Retinopathy Clinical Research Network macular edema scale, as well as the Age-Related Eye Diseases Study macular degeneration scale, are representative of such prognostic standards.69,70 Ideally, the strength of a prognostic standard is determined by the evidence available to support its capacity to predict progression—or manifestation of a condition or disease—or the benefit of a treatment or management. Its strength is also determined by any evidence that shows that treatments based on the prognostic standard correspond to effects on clinical outcome.71,72 Because the Early Treatment Diabetic Retinopathy Study, Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications Study, and Diabetic Retinopathy Clinical Research Network studies have established such evidence extensively, this applies to these prognostic standards.

Although requiring less time and fewer resources than developing and validating clinical outcomes, quantifying prognostic standards may still require considerable effort. While dependent on the intended use, for autonomous diagnostic AI studies, this is likely an important reason why clinician-derived reference standards, instead of prognostic reference standards, are used widely in AI validation.73 A widely cited meta-analysis of the quality of evidence of AI accuracy takes as a given the comparison with clinician-derived ground truth, but the relationship to prognostic standards or clinical outcome is not considered.22 Indeed, it is a major strength of the Collaborative Community CCOI, and its disease-specific subgroups, that it has started discussing the development of such prognostic standards for disease areas of interest.

Other factors that should be considered when evaluating potential reference standards, in addition to their validity or lack of thereof against outcome, include: (1) reproducibility of the reference standard (many studies have shown that multiple clinicians evaluate the same patient differently in 30%–50% of cases7476); (2) repeatability (many studies have shown that the same clinician evaluates the same patient differently in 20%–30% of cases7476; (3) diagnostic drift (studies have shown that clinicians from different regions, countries, or continents evaluate the same patient differently in up to 50% of cases, leading to vernacular medicine, as explained in the next section39; and (4) temporal diagnostic drift (studies have shown clinicians systematically evaluating the same hypothetical patient differently over generations of clinicians77).

Because the evidence for a given treatment based on a given evaluation may have been derived decades ago, temporal drift in a prognostic standard may be hard to determine and difficult to correct for. We want to clarify that, although temporal drift typically is pernicious and undesirable, temporal diagnostic shift, where new, and better treatments leading to a new prognostic standard, often are desirable. An example of temporal shift is the shift from the prognostic standard of clinically significant macular edema as defined by the Early Treatment Diabetic Retinopathy Study to the new prognostic standard of center-involved macular edema that is derived from OCT, not fundus photography, and was developed in conjunction with evaluating with novel anti–vascular endothelial growth factor treatments for macular edema.70,78 Optimally, correction for reproducibility and repeatability with strict evaluation protocols and independent verification, where possible, is indicated.66

Given these considerations, and depending on an AI system’s role, output type, SaMD risk categorization, and risk of harm to the patient, certain types of reference standards may be differentiated based on the rigor or validity of the reference standard (Table 6). Although such a hierarchy, as shown, may be useful for consideration of reference standard differences, no level is tied to a specific intended use. Generally, these levels I through IV can be related to their rigor, with level I having the most rigor. Typically, an AI system that carries more risk of harm, such as personalized treatment (e.g., artificial pancreas), as a stand-alone diagnosis or determination of disease level used in treatment decisions would be compared with a more rigorous standard. Therefore, it remains up to regulatory agencies around the world to balance the intended use and risk category of the AI system and potentially to include the reference standard level in this balance.

Table 6.

Reference Standard Levels

Level Description

I A reference standard that is either a prognostic standard, a clinical outcome, or a biomarker standard. If a prognostic standard, it is determined by an independent reading center. If either a prognostic standard or a biomarker, it is validated against clinical outcome, and temporal drift, reproducibility, and repeatability metrics are published.
II A reference standard established by an independent reading center. Temporal drift, reproducibility, and repeatability metrics are published. A level II reference standard has not been validated against clinical outcome or a prognostic standard.
III A reference standard created from the same method as used by the AI, by adjudicating or voting of multiple independent expert readers. The readers are documented to be masked, and reproducibility and repeatability metrics are published. A level III reference standard has not been validated against clinical outcome or a prognostic standard and does not have known temporal drift, reproducbility or repeatability.
IV All other reference standards, created by single readers or nonexpert readers, possibly without an established protocol. A level IV reference standard has not been validated against clinical outcome or a prognostic standard and does not have known temporal drift, reproducibility, or repeatability metrics, and the readers may not have been masked.

AI = artificial intelligence.

For level I and II reference standards, no reference to methods exists because the methods are determined entirely by the requirements for outcome, prognostic standard, or reading center. Although a higher-level reference standard at first glance may always seem more desirable, in many cases, this may not be the preferred choice. Such a higher level may not be available or may even be unachievable, and the requirement for a higher level needs to be balanced with the burden to obtain it. An example is retinopathy of prematurity, where only prognostic standards are derived from expert clinicians—that is, level II—are available. At this point, it is ethically impossible to determine a level I standard for retinopathy of prematurity, and in fact level II is the accepted reference standard in the clinical community. Creating a level I standard would require a study that may leave some treatable patients untreated, and harm patients, depending on how accurate the AI under study actually is, and thus requiring a level I for an AI creator would be an undue burden, and frankly an impossible hurdle to overcome.

It is worth reemphasizing that (1) the level of reference standard is entirely independent from the AI system or its intended use, (2) different intended use cases may require different levels of reference standard, and (3) the level of reference standard is evaluated entirely separate from the minimally acceptable criteria for performance of the AI. The minimally acceptable criteria can be understood only for a given reference standard level.

Minimal Acceptable Criteria for Validation.

The minimal acceptable criteria for the AI system are the decision cutoffs for determining the safety and efficacy of the AI in hypothesis testing clinical trials to estimate nonmaleficence. Such minimal acceptable criteria include combinations of sensitivity, specificity, and area under the receiver operating characteristic curve. Although the concept of decision cutoffs for safety and efficacy of an AI system may be broadly accepted, it is also a major factor in the review processes by regulatory agencies. As an example, for the first autonomous de novo AI authorized by the FDA, 2 hypotheses had to be confirmed in a preregistered clinical trial, with sensitivity and specificity characteristics exceeding 80% at the population level. This corresponded to study-based end points of 85% for sensitivity and 82.5% for specificity.13

Theoretically, such minimal acceptable criteria can be derived analytically, with the goal to minimize subjectivity and to maximize external validity. Thus, approaches have been developed to come up with analytical solutions for diagnostic algorithm end points, including Pareto optimization, Youden and Euclidean indices for sensitivity–specificity combinations,7982 quantitative cost-benefit derivative analysis, as well as (modified) Angoff approaches.83,84 Specifically, the (modified) Angoff approach has been validated for setting testing thresholds in educational settings. These analytical approaches are helpful in informing the choices to be made by improving the understanding of risks and benefits of any choices made.

Alternatively, minimal acceptable criteria for a diagnostic AI can be set to conform to existing diagnostic procedures. For example, when excluding pulmonary embolism, negative test results should have a 3-month thromboembolic risk of less than 3%, which is derived from the equivalent risk after negative pulmonary angiography findings, the gold standard.85 An understanding of the accuracy of comparable diagnostic processes performed by human clinicians and other human experts should be a requirement (note that the current diagnostic standard of care may not necessarily involve a clinician in the future because AI systems at some point may be considered as standard of care). In contrast, the existing literature does not offer guidance on these minimal acceptable criteria for an autonomous AI performing the diabetic eye examination because the standard of care by ophthalmologists reaches sensitivity of only 33% or 34%.75,76

Given such widespread lack of scientific evidence for specific minimal acceptable criteria, deciding them involves ethical and cost effectiveness analysis and other risk-benefit trade-offs by patients, clinicians, and payors. Such decisions typically require the involvement of domain experts. As examples, minimally acceptable criteria for screening mammography were determined by a set of domain experts using a modified Angoff approach,84 and a sampled survey of pediatricians was used to estimate the minimally acceptable sensitivity threshold for a streptococcal pharyngitis test in children.86 For such approaches to work, it is important that the experts involved fully grasp the spectrum of risks and benefits for patients of each alternative set of criteria. This may not always be the case: in the latter study, 80% of pediatricians proposed a sensitivity of at least 95%, which was not achievable by any feasible test under consideration.86 The structured collection of patient preferences, also known as patient preference information, also could be included in shaping these decisions.87 Thus, the following stages can be considered in isolation or in the aggregate for setting minimal acceptable criteria.

  1. Literature or meta-analysis review of existing minimal acceptable criteria and assignment of weights to the consequences of test misclassifications, according to 1 or more metrics such as cost or quality-adjusted life years. As an example, estimates of whether the consequences of missing a case, such as increased morbidity or cost at a later stage when the disease manifests more clearly, outweigh the consequences of misclassifying a noncase as a case, such as unnecessary radical diagnostic or treatment decisions with major side effects. Scientific evidence of comparable diagnostic processes, performed by human clinicians and other human experts, should be included, if available, or may need to be collected, if not available.

  2. Analysis of a representative spectrum of sensitivity and specificity combinations and determination of the downstream cumulative weight of consequences for patients88,89 and other stakeholders in the health care system, including patient preference information.

  3. A process of domain experts (e.g., network of experts)90 can potentially generate consensus on minimal acceptable criteria, for example, using vignettes that condense analytical evidence to ensure minimal bias among domain experts.

Postmarket Monitoring of Artificial Intelligence Systems and Nonmaleficence.

Monitoring of the safety and efficacy of an AI system is important because it affects nonmaleficence. Real-world performance monitoring after implementation can be achieved by putting a prospective monitoring protocol in place. Such a prospective monitoring protocol may be agreed on by a regulatory agency—for example, implemented as part of a comprehensive Quality Management System following 21 CFR 820—and may accommodate user feedback, complaints, and reportable events. In addition, other AI system characteristics that are within creators’ control such as usability, user experience, product performance, and necessary safety controls, including a comprehensive framework for cyber security, data protection, and data privacy, also may be monitored.

To ensure continued acceptable performance of an autonomous AI system, for example, a prospective monitoring protocol may require the AI system output to be compared with the same reference standard that was used in (premarket) validation to be able to determine whether it still meets safety and efficacy standards in the post-implementation real world. As discussed in the previous section, more rigorous, higher-level reference standards often require substantial resources for patients and creators. Real-world monitoring may require the collection of this reference standard for each monitored patient, which thereby diminish the reasons why the AI system was implemented in the first place, such as improved access, lower cost, and patient friendliness. Thus, prospective monitoring protocols will have to find a balance between burden on AI creators and patients, on the one hand, and nonmaleficence, on the other.

Changing an Artificial Intelligence System after Validation.

As an AI system is used on patients and continuous efficacy monitoring is in effect, opportunities exist to improve the AI system technical specifications in terms of safety, efficacy, equity (see “Equity” section), or a combination thereof. Artificial intelligence systems—that is, SaMDs that use AI or machine learning—have the unique capacity to be updated after implementation. In fact, if an AI system is not locked after validation, potentially unlimited configurability exists.

It is important to determine that changes to the technical specifications, while intended to improve the AI system, do not affect the ethical principles of nonmaleficence and equity negatively. Traditionally, from a regulatory perspective, almost all technical specification changes to an SaMD that affect safety or effectiveness may require a new validation; cyber security changes may be the only ones currently possible without such full validation, depending on how one interprets current FDA guidance.91 Thus, safely updating the AI system requires that appropriate controls and validation methodologies are in place. These controls and methodologies are dependent on both the type of change as well as on the risk of patient harm, and we differentiate the following types of changes:

  1. Changes to AI system computations include (1) changes to preprocessing and postprocessing algorithms; (2) changes in algorithmic infrastructure, including hardware and software; (3) changes to AI algorithm architecture, including to improve performance of types of classifiers, hyperparameter and parameter (including model weights) values, and training data.

  2. Changes to AI system input, while keeping the output type constant, include: (1) change in the imaging system, such as optical, sensor, image compression, and imaging protocol; and (2) adding other information about the patient to the inputs, such as pulse, visual acuity, and intraocular pressure, used along with the original inputs by the AI to determine its output.

  3. Changes to the AI system output, while keeping the input types unchanged, include marking regions of interest when previously only a normal or abnormal output was validated.

  4. Changes to AI system indications and intended use, include accumulating scientific evidence that an AI system that was validated as a referral tool and authorized by a regulatory agency as actually being used as a diagnostic tool, as it becomes more accepted in the clinical community and its performance thresholds are adjusted to support such use. Other examples include changes to (1) inclusion or exclusion criteria, such as expansion to people with different risk of having the disease, age groups, ancestry, race or ethnicity that were not accounted for in the design or validation of the AI system to be improved; (2) disease level or threshold; and (3) disease type, for example, macular degeneration when previously validated for diabetic retinopathy.

An important component of AI system changes is the method of change validation that is used to establish safety, efficacy, and equity of the changed AI system. Artificial intelligence systems may differ in the data that were collected for their validation. At one end of this spectrum, a recent autonomous AI system required a full preregistered clinical trial—a pivotal trial—comparing AI output against a level I prognostic standard.13 Depending on the patient risk of harm and the type of change, as set forth in the previous section, the following categories of such methods can be discerned (as an aside, many of these methods require the pivotal trial data of the index AI system to have been escrowed under a so-called algorithmic integrity protocol13):

  1. Regression identity testing to establish non-probabilistically that for any input data, changes to the AI system do not result in any change whatsoever in diagnostic output.

  2. Bench validation to test the statistical hypothesis formally, that a change that can impact the AI algorithm, for example, a change in Graphics Processing Unit (GPU), has no impact on the diagnostic output for any input from a given group of participants.

  3. Recursive validation to test the statistical hypothesis formally, that a change in input type, such as a change in imaging system, has no impact on the diagnostic output compared with the index AI system output. Recursive validation uses the index AI system output as the reference standard.92 It is similar to a reproducibility study,92 where the output of the index AI system is compared with a modified AI system with the inputs slightly perturbed.

  4. Performance (safety, effectiveness, and equity) bracketing. Analytically, the maximum change in performance metrics caused by a specific change in the algorithm can be calculated and bounded quantitatively, and these brackets can be used to ensure maximum change continues to be within expectations, and can also exceed the minimally acceptable criteria that were determined for the index system’s pivotal trial.

  5. Escrowed validation study iteration to test the hypothesis statistically that an AI system is not inferior, or possibly superior, to the index AI system. This can be achieved by reusing the inputs of the index AI system validation dataset that were escrowed previously and comparing the outputs of the changed system with that escrowed, established reference standard. Limits exist on the number of iterations that can be achieved, as explored by Ioannidis,93 because each dataset reuse increases the potential for overfitting to the escrowed validation data.94 The degree to which escrowed dataset reuse leads to false-positive claims and overfitting can be quantified through systematic frameworks, including the dataset positive predictive value framework. The success of this approach depends on parameters including the number of available escrowed validation subjects, type 1 and type 2 error rates, and the degree of dependence between outputs of the index AI system and the modified AI system. The validation study needs to have been escrowed as part of the preregistration algorithm integrity process for this to be a valid methodology.13,95

  6. Escrowed validation study expansion to test statistically the hypothesis that the AI system is not inferior or possibly superior because of a change in target patient population. Escrowed validation study expansion reuses the inputs of the index AI system validation dataset that has been escrowed, expands this dataset with participants from the new target patient population, and then compares the outputs of the changed system with the reference standard. Either the identical workflow can be used, or a secondary analysis on the effect, if any, of a change in workflow is required. As new participants are added to the original study for this expansion, information is gained, and this may compensate for the information loss and risk of overfitting from dataset reuse.95 As with escrowed validation study iteration, it is critical to monitor the overall degree of dataset reuse.

Where “index AI system” refers to the AI system that was validated in a pivotal trial. The term escrowed, under an algorithm integrity protocol, signifies that the human participant input data (including the corresponding reference standard) collected in the pivotal trial is kept inaccessible from the creators by a third, independent, party. Thus, a complete arms-length chain of custody of any access or use of this data by the index AI system developer will exist, for example, for retraining a modified AI system, somewhat analogous to the concept of clinical trial preregistration.

The above studies of course can be performed by the AI creator or by independent research groups.

Autonomy

Analysis of autonomy of the patient with respect to AI leads to at least 2 important considerations.

The first is consideration of the use of patient-derived data, which applies to both training data for the AI system algorithms as well as to implementation, where the AI system collects input data to determine its outputs. Transparency may include how the AI system creator uses patient-derived data outside the use for this AI system’s intended use. An example is insight into whether patient-derived data are monetized for purposes other than the diagnosis by the AI. Autonomy is greater when the collection of patient-derived data is lawful and in compliance with laws and regulations and best practices. This may include compliance with the Health Insurance Portability and Accountability Act, the Health Information Technology for Economic and Clinical Health Act, and other data security aspects of 21 CFR 50, the Declaration of Helsinki, and other statutory and regulatory rules in place, in a manner that is transparent about the purpose and scope for which the data will be used.96 Ideally, patient-derived data used by AI creators is traceable to patient authorization to use that data. Those involved in the design of AI systems should have accountability with respect to protecting patient rights as stewards of patient-derived data. Auditable processes and security controls aid in ensuring that patient data are being used in accordance with the scope for which they were authorized and to protect the data from unauthorized use or access.

A current controversy is the reward or recognition of clinicians contributing a reference standard to patient-derived data incorporated in the intellectual property of an AI system. Such contributions may include their diagnostic work recorded in medical records, subsequently used to train or evaluate an AI system.97 Such ownership collides with rising public desire for increased control over, and privacy with respect to, electronic data and emerging regulations to address these (General Data Protection Regulation (European Union) 2016/679, and the California Consumer Privacy Act, Cal. Civ. Code § 1798.100 et seq.), as well as increasing patient activism for recognition for contributions to scientific advances.

The second consideration is that liability for AI system malfunction is related to autonomy. Abramoff et al3 previously proposed that creators of autonomous AI systems assume liability for harm caused by the diagnostic output of the device when used properly and on label. In their article, they state that this is essential for adoption: it may be inappropriate for clinicians using an autonomous AI to make a clinical decision they are not comfortable making themselves, to nevertheless have full medical liability for harm caused by that autonomous AI. This view was recently endorsed by the American Medical Association in its 2019 AI Policy.6 Such a paradigm for responsibility is more complex for assistive AI, where medical liability may fall only on the provider using it, because they are ultimately responsible for the medical decision, or on a combination of both, where even the relative balance of liability of the AI user and the AI creator come into play.

Meanwhile, as Abramoff et al3 proposed elsewhere, medical decisions by autonomous AI for an individual patient typically cannot be labeled unequivocally as correct or incorrect, especially in chronic diseases, where outcomes may emerge years later. However, for populations of patients, the medical decisions can be compared statistically with the desired decisions, for example, to claimed correctness, and thus that is where the liability should be focused. Another issue is that, although autonomous AI is preferable validated against patient outcome or prognostic standards, these comparisons require enormous resources that are not available for an individual patient when liability is at stake. Instead, the autonomous AI decision may be compared with that of an individual physician or group of physicians, lacking validation, and thus with unknown correspondence to outcome or surrogate outcome. As an aside, this can also be an issue for so-called continuous learning AI systems.

These distinctions will need to be resolved as various AI applications move forward. The legal responsibility for an AI system, built in partnership with a large health care system, and intended to be used on its patient population, is by definition more diffuse and likely to vest in the sponsoring health care system, or with some comparative or contributory analysis of fault. A privately designed system, sold as a finished product, may need to bear its own responsibility for autonomous output, absent superseding or intervening causation. Responsibility for proper use and maintenance of the AI system, consistent with terms of service and FDA or other regulatory agency labeling, remains with the provider: the practice of medicine.

Finally, the output of the autonomous AI system, although valid as a diagnostic record from a regulatory perspective, currently is not defined as a medical record when it is not signed off on by a physician. What is and is not, and who can and cannot create, a medical record is determined in the United States primarily by the State medical boards or their equivalent. At present, such boards do not consider an autonomous AI output to have the same medicolegal status as physician documentation, and the legal status of reports generated by AI has been brought to the attention of the United States Federation of State Medical Boards.

Equity

The third bioethical principle is equity. We mentioned previously that we use this term rather than the traditional bioethical term justice for the same concept. Equity primarily concerns itself with the impact on a population level, beyond the impact on an individual patient. In the context of AI, this translates to estimating its differential impact on safety, or any other characteristics of the AI system, for members of a group with respect to members of other groups. Any differences are referred to as health disparities. For example, inappropriate bias of the AI system may result in the AI system being less safe for some group characterized according to race,98,99 ethnicity, sex,100 age, income, or other categories than another, even though on average it was found to be safe. Any medical process has the potential either to increase or decrease health disparities, depending on how it is used. Because of the scale at which AI systems operate, their potential to increase or decrease disparities also is magnified tremendously.

Inappropriate bias, increase in health disparities, and thus decreased equity can manifest across the entire AI pipeline, as Char et al4 outlined, including in the choice of intended use of the AI, its design, its validability, its validation, the choice of reference standards, and how and where it is implemented. For example, as far as design of the AI is considered, lower validability of a black-box algorithmic approach makes bias harder to anticipate, to detect, and to mitigate where explicit priors with models that cannot be analyzed and evaluated. Another example with respect to design is how incomplete or unrepresentative training data, or a reliance on complete and representative data that reflects and reproduces (at scale) pre-existing health care bias, increases risk-worsening health disparities. As far as validation is concerned, selection of study sites and biased inclusion and exclusion criteria can decrease validity for certain subgroups, and thereby exacerbate health disparities. Finally, implementation of the AI system preferably in some populations over others may affect access to disadvantaged groups, and thereby increase health disparities.

Validation can be used to measure equity by testing for the presence or absence of an effect of predefined characteristics of subgroups on performance of the AI system, such as sensitivity or specificity. Such characteristics typically include race, ethnicity, age, and gender on sensitivity and specificity.101 In addition, differential use in subgroups will affect equity, and such effects can be compared using metrics like population-achieved sensitivity (see the next section).98

As mentioned, when analyzing the equity of an AI system, particularly in the context of health disparities, it is useful to consider the implementation context. Different diagnostic processes, including AI systems, may differ in patient friendliness, availability, access, and direct and indirect cost, even with equal sensitivity and specificity (i.e., equally high nonmaleficence).

With respect to intended use and implementation in the context of equity, the goal of the diagnostic process at the population level is to identify the maximum number of true cases of disease identified in that population. A given diagnostic process, like a high-performing AI system, may have a high sensitivity; that is, nonmaleficence is maximum for those patients who have access. However, if, for example, this AI system is available only in one place, the number of cases identified will not be maximized because many in that population simply never undergo its diagnostic process.

Population-achieved sensitivity, or access-corrected sensitivity, is used to analyze such effects on equity. That is, although an AI system—and any diagnostic process—with very high sensitivity is attractive from an individual (non-maleficence) perspective, if only a few people have access to the diagnostic AI, the population-achieved sensitivity (PAS), or effective sensitivity at the population level, will be much lower, and concomitantly its equity:

PAS=sccpccpc+(1c)pnc^scc,

where sc is sensitivity (as determined in adherence population), c is compliance or (adherence), pc is measured prevalence in the adherent population, and pnc^ is the estimated prevalence in the nonadherent population. When we assume pcpnc^, that is, the prevalence of the disease is the same in the nonadherent population as in the adherent population, we can use the simplified estimate scc.102 For example, if compliance, c, with the diabetic eye examination is 15%102 and the minimum acceptable sensitivity is 85%,13 then the PAS is 0.13. That is, only 13% of cases in the population will be identified correctly with this diagnostic system. In many cases, the prevalence in the part of the population that does not undergo the AI system actually is even higher than in the adherent population, so that this estimate of PAS forms an upper bound. It is useful to consider PAS in determining minimum acceptable sensitivity. A more accessible AI system may have lower sc, but still may result in higher PAS because compliance is higher.

Conclusions

The considerations in this article are a useful first step in the development of a bioethically sound foundation, based in nonmaleficence, autonomy, and equity, of considerations for the design, validation, and implementation of AI systems. The CCOI’s FPOAI exceptional and diverse experience means it is well placed to develop and evaluate such a foundation. Considerations of FPOAI’s future consensus statements and cooperation among AI creators, industry, ethicists,3,4 clinicians, patients, and regulatory agencies are key to facilitating rapid innovation of AI technologies and their successful implementation in clinical medicine. Such global collaboration will adhere to bioethical principles and will guide development and use of clinical AI, helping to make fundamental improvements in accessibility and quality of health care, to decrease disparities, and to lower the overall cost of health care.

Acknowledgments

Supported in part by The Robert C. Watzke MD Professorship to (M.D.A), and Research to Prevent Blindness, Inc, New York, New York (unrestricted grant to the Department of Ophthalmology and Visual Sciences, University of Iowa, [M.D.A.], an unrestricted grant to the Department of Ophthalmology, University of Wisconsin [B.B], and an unrestricted grant to the Department of Ophthalmology, Stanford University [T.L.]).

Members of the Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group as of writing: Michael D. Abràmoff, MD, PhD (Chair, University of Iowa); Malvina B. Eydelman, MD (Center for Devices and Radiological Health, Office of Health Technology 1, United States Food and Drug Administration); Brad Cunningham, MSE (Center for Devices and Radiological Health, Office of Health Technology 1, United States Food and Drug Administration); Bakul Patel, MBA (Center for Devices and Radiological Health, Digital Health Center of Excellence, United States Food and Drug Administration); Karen A. Goldman, PhD, JD (OPP, United States Federal Trade Commission); Danton Char, MD, MS (Stanford University); Taiji Sakamoto, MD (Kagoshima University, Japanese Ophthalmological Society); Barbara Blodi, MD (Department of Ophthalmology, University of Wisconsin); Risa Wolf, MD (Department of Pediatrics, Johns Hopkins University); Jean--Louis Gassee (Apple); Theodore Leng, MD, MS (Department of Ophthalmology, Stanford University School of Medicine); Dan Roman (Director Diabetes Measures, National Committee of Quality Assurance); Sally Satel (Yale, AEI, data usage ethics); Donald Fong (Kaiser Permanente); David Rhew (Chief Medical Officer, Microsoft); Henry Wei (Google Health); Michael Willingham (Google Health); Michael Chiang, MD, PhD (Director, National Eye Institute); Mark Blumenkranz, MD (Facilitator, Stanford University). Although the members’ main affiliations are stated, they do not in every case represent their institution or company. Members of the Collaborative Community on Ophthalmic Imaging Executive Committee: Michael Abramoff, MD, PhD; Mark Blumenkranz, MD; Emily Chew, MD; Michael Chiang, MD; Malvina Eydelman, MD; David Myung, MD, PhD; Joel S. Schuman, MD; and Carol Shields, MD.

The FDA participates in the Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group as a member of the Collaborative Community on Ophthalmic Imaging Foundation. This manuscript reflects the views of the authors and should not be construed to represent FDA’s views or policies.

Abbreviations and Acronyms:

AI

artificial intelligence

CCOI

Collaborative Community on Ophthalmic Imaging

CT

computed tomography

DICOM

Digital Imaging and Communications in Medicine

FDA U.S.

Food and Drug Administration

FPOAI

Foundational Principles of Algortihmic Interpretation Workgroup of the Collaborative Community for Ophthalmic Imaging, Washington, DC

GPU

graphics processing unit

ISO

International Organization for Standardization

MRI

magnetic resonance imaging

PAS

population-achieved sensitivity

SaMD

Software as a Medical Device

Footnotes

Disclosure(s):

All authors have completed and submitted the ICMJE disclosures form.

Michael Chiang, an editor of this journal, was recused form the peer-review process of this article and had no access to information regarding its peer review.

The author(s) have made the following disclosure(s): M.D.A.: Executive Chairman, Equity Owner, Founder, Patents and other Intellectual Property, Royalties, Consultant – Digital Diagnostics Inc, Coralville, Iowa.

J.M.K.: Chair – American Academy of Dermatology Committee on Augmented Intelligence; Equity owner – Skin Analytics

HUMAN SUBJECTS: No human subjects were included in this study. No animal subjects were included in this study.

References

RESOURCES