Foundational Considerations for Artificial Intelligence Using Ophthalmic Images

Michael D Abràmoff; Brad Cunningham; Bakul Patel; Malvina B Eydelman; Theodore Leng; Taiji Sakamoto; Barbara Blodi; S Marlene Grenon; Risa M Wolf; Arjun K Manrai; Justin M Ko; Michael F Chiang; Danton Char

doi:10.1016/j.ophtha.2021.08.023

. Author manuscript; available in PMC: 2023 Feb 1.

Published in final edited form as: Ophthalmology. 2021 Aug 31;129(2):e14–e32. doi: 10.1016/j.ophtha.2021.08.023

Foundational Considerations for Artificial Intelligence Using Ophthalmic Images

Michael D Abràmoff ^1,^16,¹⁷, Brad Cunningham ², Bakul Patel ³, Malvina B Eydelman ², Theodore Leng ⁴, Taiji Sakamoto ^5,⁶, Barbara Blodi ⁷, S Marlene Grenon ^8,¹⁸, Risa M Wolf ⁹, Arjun K Manrai ^10,¹¹, Justin M Ko ¹², Michael F Chiang ¹³, Danton Char ^14,¹⁵, Collaborative Community on Ophthalmic Imaging Executive Committee and Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group

¹Department of Ophthalmology and Visual Sciences, University of Iowa, Iowa City, Iowa.

²Center for Devices and Radiological Health, Office of Health Technology 1, United States Food and Drug Administration, Silver Springs, Maryland.

³Center for Devices and Radiological Health, Digital Health Center of Excellence, United States Food and Drug Administration, Silver Springs, Maryland.

⁴Byers Eye Institute at Stanford, Stanford University School of Medicine, Palo Alto, California.

⁵Department of Ophthalmology, Kagoshima University Graduate School of Medical and Dental Sciences, Kagoshima, Japan.

⁶Japanese Vitreous Retina Society, Osaka, Japan.

⁷Department of Ophthalmology, University of Wisconsin, Madison, Wisconsin.

⁸Innovation Ventures, University of California, San Francisco, San Francisco, California.

⁹Department of Pediatric Endocrinology, Johns Hopkins University School of Medicine, Baltimore, Maryland.

¹⁰Computational Health Informatics Program, Boston Children’s Hospital, Boston, Massachusetts.

¹¹Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts.

¹²Department of Dermatology, Stanford University School of Medicine, Stanford, California.

¹³National Eye Institute, Bethesda, Maryland.

¹⁴Division of Pediatric Cardiac Anesthesia, Department of Anesthesiology, Stanford University School of Medicine, San Francisco, California.

¹⁵Center for Biomedical Ethics, Stanford University School of Medicine, San Francisco, California.

¹⁶Department of Elecrical and Computer Engineering, University of Iowa, Iowa City, Iowa.

¹⁷Department of Biomedical Engineering, University of Iowa, Iowa City, Iowa.

¹⁸Division of Vascular and Endovascular Surgery, Universify of California San Francisco, California.

Author Contributions:

Conception and design: Abramoff, Cunningham, Patel, Eydelman, Chiang, Char

Analysis and interpretation: Abramoff, Char

Data collection: Abramoff, Cunningham, Patel, Eydelman, Leng

Obtained funding: N/A; Study was performed as part of the authors’ regular employment duties. No additional funding was provided.

Overall responsibility: Abramoff, Cunningham, Patel, Eydelman, Leng, Sakamoto, Blodi, Grenon, Wolf, Manrai, Ko, Chiang, Char

^✉

Correspondence: Michael D. Abràmoff, MD, PhD, Department of Ophthalmology and Visual Sciences, University of Iowa, 200 Hawkins Drive, 11205 PFP, Iowa City, IA 52242-1091, michael-abramoff@uiowa.edu.

PMCID: PMC9175066 NIHMSID: NIHMS1813852 PMID: 34478784

Abstract

Importance:

The development of artificial intelligence (AI) and other machine diagnostic systems, also known as software as a medical device, and its recent introduction into clinical practice requires a deeply rooted foundation in bioethics for consideration by regulatory agencies and other stakeholders around the globe.

Objectives:

To initiate a dialogue on the issues to consider when developing a bioethically sound foundation for AI in medicine, based on images of eye structures, for discussion with all stakeholders.

Evidence Review:

The scope of the issues and summaries of the discussions under consideration by the Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group, as first presented during the Collaborative Community on Ophthalmic Imaging inaugural meeting on September 7, 2020, and afterward in the working group.

Findings:

Artificial intelligence has the potential to improve health care access and patient outcome fundamentally while decreasing disparities, lowering cost, and enhancing the care team. Nevertheless, substantial concerns exist. Bioethicists, AI algorithm experts, as well as the Food and Drug Administration and other regulatory agencies, industry, patient advocacy groups, clinicians and their professional societies, other provider groups, and payors (i.e., stakeholders) working together in collaborative communities to resolve the fundamental ethical issues of nonmaleficence, autonomy, and equity are essential to attain this potential. Resolution impacts all levels of the design, validation, and implementation of AI in medicine. Design, validation, and implementation of AI warrant meticulous attention.

Conclusions and Relevance:

The development of a bioethically sound foundation may be possible if it is based in the fundamental ethical principles of nonmaleficence, autonomy, and equity for considerations for the design, validation, and implementation for AI systems. Achieving such a foundation will be helpful for continuing successful introduction into medicine before consideration by regulatory agencies. Important improvements in accessibility and quality of health care, decrease in health disparities, and lower cost thereby can be achieved. These considerations should be discussed with all stakeholders and expanded on as a useful initiation of this dialogue.

Keywords: Artificial intelligence, Augmented intelligence, Clinical standards, Clinical trial, Cornea, Ethics, FDA, Glaucoma, Oculoplastics, Regulation, Retina, Safety, imaging, non-maleficence, equity, autonomy, patient benefit, health disparities, population health, clinical outcome, validation, explainability, validability, transparency, population achieved sensitivity, vernacular medicine, scalability

The Collaborative Community on Ophthalmic Imaging (CCOI) formed in 2019 to advance the innovation of ophthalmic imaging with a focus on medical devices using artificial intelligence (AI).^1,2 The CCOI’s Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation (FPOAI) Working Group was established in March 2020 to generate consensus on a bioethical foundation for AI of ophthalmic imaging for consideration by all stakeholders in the health care system, including, but not limited to, the United States Food and Drug Administration (FDA) and other regulatory agencies. Its processes draw on the expertise of bioethicists,^3,4 AI algorithm experts, FDA and other regulatory agencies, as well as industry, patients, patient advocacy groups, clinicians and their professional societies, and payors,⁵ to identify best practices for addressing novel issues emerging with AI conception, evaluation, and implementation, including validation, reference standards, performance metrics, accountability for output, bias, and impacts on workflow.

The terms artificial intelligence and augmented intelligence are used interchangeably for systems that perform tasks that mimic human cognitive capabilities.¹ The authors use artificial intelligence to refer to the concept of programming computer systems to perform tasks to mimic human cognitive capabilities—such as understanding language, recognizing objects and sounds, learning, and problem solving—by using logic, decision trees, machine learning, or deep learning. Such anthropomorphic AI systems, which are becoming more common, are not programmed explicitly and instead learn from data that reflect highly cognitive tasks, typically performed by trained health care professionals. In some cases, these AI systems are used to aid health care professionals.⁶ The introduction of AI in medicine has the potential to improve quality, reduce costs, diminish health disparities, and increase accessibility, as well as enhance the care team, at both the individual and population levels.^7,8 Thus, its introduction aligns with the American Medical Association’s principle of quadruple aim of improved outcomes, lower cost, improved patient experience, and improved clinician experience.⁹ After the first FDA de novo clearance for an autonomous AI,¹⁰ that is, an AI system that makes a clinical decision without human oversight,¹⁰ AI has entered mainstream health care, including standards of care.¹¹ The use of AI in the ophthalmic setting has been studied for many applications,¹² including in diseases such as diabetic retinopathy,¹³ retinopathy of prematurity,¹⁴ macular degeneration,¹⁵ glaucoma,¹⁶ and cancer,¹⁷ as well as many other ocular conditions, such as those of the cornea¹⁸ and other parts of the anterior segment.¹⁹

To maximize AI’s benefits, many ethical, economic, and scientific issues, including algorithmic bias, safety, efficacy, and equity—terms that are explained in the next section—need to be addressed in a transparent fashion for acceptance by all stakeholders. So far, studies to establish scientific evidence for the safety and other criteria of AI in general are quite limited, with few exceptions.²⁰ In a meta-analysis of 81 AI clinical trials, only 9 were prospective, and just 6 were tested in a non-research, clinical setting.²¹ The relationship of the AI’s diagnostic accuracy to clinical outcomes in this widely cited study was not even mentioned, and more generally, in an analysis of 126 published diagnostic accuracy studies, only 12% reported any statistical test of a hypothesis related to the study objectives.²²

Reporting standards for AI studies have been published recently, such as Consolidated Standards of Reporting Trials-Artificial Intelligence (CONSORT-AI),²³ as well as an AI extension to the Standards for Reporting of Diagnostic Accuracy Studies²⁴ currently under development. Although potentially beneficial, such standards may not provide sufficient information to help inform regulatory evaluation and have not been recognized by the FDA; see also the FDA’s Recognized Consensus Standards.²⁵ Although reporting standards may have benefits in improving consistency, additional considerations beyond these recommendations may be needed for regulatory evaluation, many of which are the subject of this analysis (“Considerations.”)

These first Considerations to come from our FPOAI Working Group present the scope of the issues and concepts and briefly summarize the discussion on diagnostic AI and other software as a medical device (SaMD) systems that use images of the eye, as first presented during the Collaborative Community on Ophthalmic Imaging inaugural meeting on September 7, 2020, and later discussed within the FPOAI Working Group.²⁶ Specifically, it describes both clinical constraints for AI systems, as well as bioethically founded constraints derived from the 3 major bioethical principles of nonmaleficence, equity, and autonomy. Although, as FPOAI Working Group stakeholders, we realize the tremendous potential advantages of AI systems, we also realize that substantial concerns exist from the scientific and clinical communities, as well as society at large. Therefore, involvement of^3–5 all stakeholders to resolve ethical issues, including nonmaleficence, autonomy, and equity, is key.

Design, validation, and implementation of diagnostic AI systems warrant meticulous attention. We limit the scope of these Considerations, for the time being, to AI intended for diagnosis. Although therapeutic AI, including autonomous AI for prescribing and autonomous AI for surgery, are on the horizon, we decided that these currently are beyond the scope of these Considerations, given the multiple ethical and even theoretical problems that need to be resolved. Furthermore, no regulatory guidance exists for therapeutic AI systems using images of the eye.

Obviously, the Considerations will be commensurate with the risk of harm to the patient, with different indications for use, conditions diagnosed, autonomy of the AI, consequences of a missed diagnosis, the population at risk, and other factors. Thus, the right balance needs to be considered between resource requirements and burden on AI creators^27,28 to align with proposed ethical principles on the one hand and risk of patient harm from lack of access to AI systems on the other hand in order for patients, patient populations, and the wider health care system to benefit. In addition, although some AI systems are marketed medical devices and are under regulatory oversight, other AI systems are never marketed. Such so-called homebrew AI is used—by the clinicians who developed it or others—in patient care, and their safety and equity can be of concern.²⁹

There are many useful resources, such as the reporting guidelines mentioned (e.g., Clinical Evaluation of SaMD,²⁶ Standards for Reporting of Diagnostic Accuracy Studies,²⁴ and CONSORT-AI²³), clinical practice guidelines (e.g., the American Telemedicine Association Telehealth Practice Guidelines for Diabetic Retinopathy^30,31), standards (e.g., Digital Communications in Medicine³²), and FDA guidance²⁶ that can be referenced to help mitigate the aforementioned concerns. Ultimately, we incorporated these useful resources as initial steps in developing best practices, as well as incorporating AI-tailored regulatory frameworks, including Good Machine Learning Practice and other equivalents to the more familiar good manufacturing practices, as was called for by the United States General Accounting Office in its recent report³³ as well as by regulatory agencies such as FDA.¹

Clinical Considerations for Artificial Intelligence Systems

These Considerations divide the requirements for AI systems into 2 categories: the clinical requirements, covered in this Section, and the ethical requirements, derived from a bioethical foundation, covered in the next section. Thus, this section discusses the various clinical aspects of AI systems that use images of the eye in some form, conforming to the scope of the Collaborative Community on Ophthalmic Imaging.² We define images of the eye as topologically ordered sets of intensities, which represent physical and pathophysiologic processes occurring in the eye and that may reflect conditions of the eye and other conditions of parts of the patient’s body. Specifically, we cover intended use, impact, inputs and outputs, and human factor design aspects of the AI system.

Intended Use of the Diagnostic Artificial Intelligence System

The rationale for designing, developing, validating, and deploying AI systems includes improving individual patient care, population health, and scientific research. Specifically, for individual patients, the rationale includes improving their quality of care, lowering cost, increasing access, decreasing health disparities, and improving efficiency. For scientific research, the rationale includes discovering new disease mechanisms and gaining a better understanding of a disease.

Impact of the Diagnostic Artificial Intelligence System

After the use is identified, the impact of the AI systems can be assessed. Artificial intelligence systems span a wide range of impact, from having no direct impact on an individual patient or group of patients (e.g., inform a provider) to having an important decision-making impact on an individual patient (i.e., drive or treat).³⁴ From a regulatory perspective, many AI systems are considered medical devices—SaMD—whereas other systems may not meet the definition of a medical device because definitions differ across regulatory agencies.^35,36 We refer to the FDA’s more narrow definition of medical devices under section 201h,³⁵ as modified under section 3060 of the 21st century Cures Act,³⁷ as well as the broader definition used by the International Medical Device Regulators Forum.³⁴ Based on those definitions, AI systems can be subdivided by impact, as shown in Table 1.³⁸

Table 1.

Artificial Intelligence System Impact

Use Case	Description	Examples	Food and Drug Administration Oversight

Population care	Prioritization and triage with potential impact on groups of patients and individual patients	Care pathway assignment	Likely³⁵
Individual patient care
Assistive AI	Assists a clinician who determines the patient’s management	Provides a probability or likelihood of a disease or condition or may highlight potential lesions that should be reviewed by a specialist	Likely³⁵
Autonomous AI	Makes a medical decision without input from a clinician	For example, an autonomous AI system may evaluate for the presence of a disease, such as diabetic retinopathy and macular edema, or condition and notify the user whether the disease or condition is present	Likely³⁵
Scientific research	Not used for individual patient or population care, although the results of the research may impact populations or patients downstream	Health care analytics	Unlikely
Operations and data management	Where this does not impact individual patient or population care; these often exist within the realm of health information technology systems as they relate to administrative purposes	VIM Referral Guidance, a triage system from EHRs (https://getvim.com/solution/referral-guidance)	Unlikely
Clinical decision support	Informs the clinician by aggregating, reformatting, or visualizing data, without providing analytical insights of the data, in a manner that allows the clinician to review the basis of the information provided by the software independently	AI system that suggests a G6PD test before prescribing an antimalarial therapy³⁸	Depends*
General wellness	Collects physiologic information from devices and sensors, including wearables	Smart watch that captures heart rate	Depends*

Open in a new tab

AI = artificial intelligence; EHR = electronic health record.

See Center for Devices and Radiological Health, United States Food and Drug Administration.³⁸ This explains when a software function qualifies as nondevice CDS as well as device CDS, and which of these are regulated actively or for which compliance with applicable regulation would not be enforced.

An important aspect of these AI systems is their theoretically unlimited scalability. After being designed and validated, the algorithms of a single AI system can be used on hundreds of millions of patients. Although the number of patients a human clinician may encounter varies greatly based on the health setting and geography (e.g., 800–1000 unique patients per year, or during their entire career, no more than approximately 30 000–40 000 unique patients⁸), the scale is significantly different than for an AI system. Thus, the impact of any benefits or risks stemming from the use of the AI system is massively scaled, and in just 1 year of implementation, possibly 1000-fold or more than the impact any individual clinician can have in their lifetime.

The training and practice of an individual clinician may be optimal for a specific (sub-)population, based on demographics, geographic proximity, and other facts³⁹; we define this as vernacular medicine. Such vernacular medicine may be less generalizable than is often acknowledged. For an AI system at scale, such optimality may not necessarily be present, depending on training data as well as other factors. Although this may increase its value for multiple, but geographically or demographically different, groups, it may be less optimized for specific groups, and thus this needs to be considered carefully. We cover this in more detail in the “Ethical Considerations” section. Privacy, confidentiality, and other clinical data security aspects may differ across regions as well. Recently, a concept of federated machine learning was introduced that allows for an aggregated, scalable AI system to fine-tune from independent training datasets.⁴⁰ A more recent concept of federated machine learning enables remote devices (e.g., mobile phones) to engage collaboratively in model learning and improvement that can take place at a more local level. Such an approach decouples the machine learning from any global training data that ordinarily would be derived from a single discrete storage system. Rather, model training obtains multiple, different, localized, and vernacular datasets. For deployment, the trained AI model contains no reference to the local training data that were used to refine and tune the model. This technique, similar to edge computing, may seem to have benefits. However, novel risk considerations also may be relevant relating to algorithm or model iteration that would need to be captured for accurate documentation. These include training data characterization, good machine learning practice, model version and updates, as well as the assumption that multiple vernacular datasets that are distributed normally can be reduced to a simple distribution function. Probable risks of patient harm and benefits of such a federated approach have not been studied sufficiently.

Artificial Intelligence System Outputs

The intended use and impact of an AI system constrains its outputs. According to the International Medical Device Regulators Forum’s definitions of the type of the output (inform, drive, diagnose, and treat), as well as the significance of the condition (nonserious, serious, and critical), outputs can be categorized as shown in Table 2.²⁶ Artificial intelligence system outputs may be aligned with preferred practice patterns or other standards of care to maximize the potential of the AI system to impact clinical outcome positively. This is discussed in more detail in “Nonmaleficence” section.

Table 2.

Artificial Intelligence System Outputs

Type of Output	Significance of the Condition	Category	Clinical Context

Inform	Nonserious, serious, or critical	Risk prediction	Suggest specific test types that may be implemented as part of a diagnostic workup of a patient based on clinician suspicion
Drive	Nonserious, serious, or critical	Likelihood, probability, or prediction of disease	Used by clinician who understands how to interpret the input image (e.g., ophthalmic clinician)
		Saliency, such as highlighting regions of interest or specific lesions in an image	Used by clinician who understands how to interpret the input image (e.g., ophthalmic clinician)
Diagnose or treat	Nonserious, serious, or critical	Disease staging	Assistive use case: clinician receives specific aspects of the inputs that indicate the disease stage and decides the stage
		Disease staging	Autonomous use case: the user receives the disease stage
		Screening	Assistive use case: clinician receives specific aspects of the inputs that indicate abnormalities and decides whether disease may be present
		Screening	Autonomous use case: the user receives output on whether the disease may be present
		Diagnosis	Assistive use case: a clinician receives specific aspects of the inputs that indicate disease-specific abnormalities and the absence of disease-specific abnormalities and decides the diagnosis by excluding other disease
		Diagnosis	Autonomous use case: the user receives a diagnosis; an autonomous AI system may evaluate for the presence of a disease or condition and notify the user whether the disease or condition is present without showing how the AI system arrived at the decision

Open in a new tab

AI = artificial intelligence.

The term assistive is usually used for those systems in which the clinician makes the ultimate medical decision and carries liability for the AI performance, while the term autonomous is reserved for those systems where the AI system makes the ultimate medical decision and the AI creator carries the liability for the AI performance.⁶ This distinction, assistive versus autonomous, coupled with intended use, including the significance of the condition, have important bearings on the interpretation of risk as well as other regulatory implications (e.g., clinical study design). The interaction between AI and physicians, who risk becoming, as it were, physicians of the magenta⁴¹ and too dependent on monitoring diagnostic AI devices, is of crucial importance here. Potentially, assistive systems may need subdivision into additional categories that more specifically delineate the roles of humans versus AI.

Artificial Intelligence System Use Environment

The AI outputs, including for whom it is meant, the information provided, and so forth, may dictate de facto the use environment, including the operator, for the AI system (Table 3).

Table 3.

Artificial Intelligence System Use Environment

Use Case Setting	Description

Home	AI system is used by the patient, and the patient images himself or herself without clinician or other operator assistance, or imaging is carried out by the general home health care provider. The output may be provided to the user (patient or home health care provider) or may be provided to a remote clinician.
Nonspecialist (primary care or other nonophthalmologist)	AI system is used by clinicians and operators, who have minimal experience with imaging the eye or the evaluations of ocular images or other input. The specific interpretation of the image may be important for that clinician to manage the patient in the context of a disease—e.g., evaluation of fundus photographs for presence of diabetic retinopathy while managing diabetes—or to determine the presence or severity of a systemic disease or disease in another organ system than that being managed, such as determining neurologic disease from retinal images.
Specialist (ophthalmologist or other eye care provider)	AI system is used by clinicians and operators who have experience with ocular imaging and with evaluation of ocular images, but not necessarily with the specific AI output. An example is an AI system for retinal vessel analysis that outputs vascular beading or caliber metrics.

Open in a new tab

AI = artificial intelligence.

Artificial Intelligence System Human Factor Considerations

Considering the use environment leads to consideration of human factors and impact and outputs of the AI system (Table 4).

Table 4.

Artificial Intelligence System Human Factors

Operator expertise level	Patient operated Untrained operator Ophthalmic photographer Certified ophthalmic photographer
Operator AI assistance level	Differing levels of assistance during the imaging process and protocol, which may include evaluation of image quality evaluation, field, and sequence order

Open in a new tab

AI = artificial intelligence.

Artificial Intelligence System Inputs

An AI system achieves its intended use through sampling inputs that are analyzed via the algorithm. One goal of an AI system is to obtain a reliable, consistent output while minimizing the number of inputs (samples and types) to help improve robustness of an algorithm to changes in input signal quality and environment, among other factors. For ophthalmic images, inputs can range from image sets from an entire population with multiple images for each member of that population (for population risk assessment) to multiple images from a single patient (for diagnosis). The number and extent of these images are typically dictated by the intended use of the algorithm and the use environment. A nonexhaustive list of input types (image and nonimage input types) is shown in Table 5.

Table 5.

Artificial Intelligence System Inputs

Input	Characteristic	Examples

Image based	Image method	Fundus imaging Slit-lamp photography OCT Ultrasound Scanning laser ophthalmoscope, topography Aberrometry Perimetry (functional) Multifocal electroretinography (functional) CT, including orbital CT MRI, including orbital MRI
	Image characteristics: although currently no required standard exists, standardization of image metadata such as defined by the Digital Imaging and Communications in Medicine (DICOM) standard 91^32,42 will benefit these considerations	Sample area x, y, or en face resolution x, y, or depth or axial resolution Field of view or area of retina covered Number of fields Stereo vs mono images Depth penetration limit Center wavelength(s) Momentary pupil diameter Compression characteristics Ambient light level and other environmental conditions
Nonimage	Input from methods that do not meet the definition of a medical device (i.e., that are not FDA regulated as a medical device)	Patient history Medication history Systemic comorbidities
	Input from methods that do meet the definition of a medical device (i.e., that are FDA regulated)	Axial eye length Intraocular pressure Pachymetry Keratometry Visual acuity Heart rate Blood pressure Hemoglobin A1C

Open in a new tab

Ethical Considerations for Artificial Intelligence Systems

Bioethical Foundation

In addition to their clinical requirements, such as intended use, human factors, and input and output requirements, as set forward in the previous section, AI systems will have to meet ethical requirements to function. This has both practical and philosophical importance: AI systems should follow ethical standards because the field of medicine has defined these standards as guiding principles for the appropriate delivery of health care; if AI systems are perceived as unethical or not bound by ethical constraints, stakeholders will not trust these systems and may refuse to engage with them, and this promising technology will fail to reach the populations it is designed to impact. Consequently, this section introduces the relevant bioethical foundation,⁴³ and then derives operational ethical dimensions or principles that can be used to create and evaluate ethical requirements for AI systems.

All health care stakeholders, as well as society as large, are already concerned with the use of AI in health care, even when they understand the potential efficiency gains. Their concerns include AI systems’ safety⁴⁴; actual patient outcome benefit⁴⁵; mitigation of health care disparities, rather than worsening them; potential for racial, ethnic, or other inappropriate biases⁴⁶; (mis)use of patient data, including personal health information, during training and implementation⁴⁶; (mis)use, including off-label use⁴⁷, and liability, that is, who can be held accountable or liable for any patient harm.⁶

To address these concerns, an ethical framework to identify ethical concerns before they become consequential is considered essential. Several such ethical frameworks for AI⁴ and autonomous AI³ have been proposed and discussed. We focus on the primary bioethical principles of nonmaleficence (or patient benefit), autonomy, and justice, per Beauchamp and Childress.⁴⁸ Instead of the term justice, which is widely used in the ethics literature, but which may have legal connotations and thus lead to confusion, herein we use the more familiar term equity, to describe freedom from bias or favoritism. Accountability, although strictly speaking not an ethical concern, leads to related requirements primarily related to autonomy and is discussed as well.

Such an ethical framework, as developed along the cited publications,^3,4 leads to the following: (1) ethical metrics to be created, derived from each of the ethical principles, for example population achieved sensitivity, which is derived from equity, see next section; (2) the insight that the framework will be nonorthogonal, because most ethical metrics are not independent axes, but instead partially overlap (if they formed independent metrics or axes, this would allow an orthogonal framework); and (3) the requirement for a balance to be found or defined among the 3 ethical axes we focus on (beneficence, equity, and autonomy). Thus, a so-called Pareto optimum needs to be defined, because it is impossible to meet all 3 ethical principles perfectly.

In effect, we use the 3 bioethical principles as (nonorthogonal) axes along which to analyze and constrain AI systems and to define their ethical requirements. We emphasize that they exist in tension to each other, such that improving one of them for a particular AI system may decrease another one. For example, for an autonomous AI for the diabetic eye examination, an acceptable balance needs to be found between (1) improving access to a disadvantaged population (equity) while (2) ensuring that increasing diabetic eye examination compliance leads to an overall net benefit in improvement in care, rather than just increasing diagnoses without access to treatment (nonmaleficence) and (3) while also maintaining sufficient transparency about the use of AI, training data limitations, and data use, so that patients can decide about their own participation, even if opting out means losing access to AI benefits (autonomy).⁴⁹ Theoretically, health disparities can be mitigated through adjusting the output of the AI system for those patients who are considered advantaged according to some metric. Although potentially increasing equity of the AI system, such an approach will likely conflict with nonmaleficence and autonomy.

In addition, complicating ethical analyses, AI output itself impacts clinical workflows and clinical decisions, both of which may increase tensions between these bioethical axes. Much as intention to treat is standard for randomized clinical trial evaluation, the downstream consequences from AI output will need to be part of any ethical evaluation of a medical AI application. That is, bioethical analysis per se, along these dimensions, cannot prescribe the right balance. Rather, it offers a framework to guide and evaluate such decisions. The Pareto-optimal balance among nonmaleficence, autonomy, and equity has to be determined by all stakeholders (Fig 1). After being determined, such a balance results in (ethical) constraints on the design, validation, and implementation of AI systems. Thus, we next examine the different bioethical principles and show how these principles affect AI system requirements, design, development, and validation. Ultimately, the goal of such ethical requirements is to address and answer the valid existing concerns about AI systems in health care that were introduced in the previous section.

Figure 1. — Diagram showing the balance and tension among the 3 bioethical principles: nonmaleficence, autonomy, and equity (justice).

Nonmaleficence

This principle non-maleficence, or patient benefit, often described as “first, do no harm,” is often interpreted as including safety for the individual patient. It affects all aspects of autonomous AI systems, including design, validation, and implementation. An AI system’s risk of harm is affected by intended use, impact, inputs and outputs, and use context, as explained in the previous section. However, additional considerations unique to AI systems affect probable risk of harm and are specific to AI and machine learning: design and development, validation, and postmarket validation and monitoring. We explain how these considerations are related to nonmaleficence and how these may lead to more detailed ethical requirements. Many considerations are not unique to AI systems and instead are common to all systems involving software that are not discussed herein.

Design of the Artificial Intelligence System and Nonmaleficence.

In general, AI system design and development share many characteristics with non-AI software systems, and the requirements are laid down in standards like International Organization for Standardization (ISO) 90003,⁵⁰ and, for medical devices, in ISO 13485.⁵¹ In addition, AI-specific design considerations exist that are related to insight into the AI, that derive from nonmaleficence, and affect the risk of harm to the patient. We differentiate 3 forms of considerations that can be assessed for the design: (1) explainability, the amount of insight the user (typically the physician) has into the clinical logic that determined the AI output for a specific patient; (2) transparency, the amount of insight the user has into the clinical usefulness of the AI system for all patients; and (3) validability, the amount of insight that exists into the nonclinical validity (analytical validity) of the AI system, and that can be determined without clinical validation studies. The following are examples of relevant aspects that were discussed by FPOAI and that need further consideration⁵²:

Transparency is defined as the degree to which the user or clinician of the AI system has insight into the requirements and limitations for the AI system inputs, its training data characteristics, and how the AI outputs are derived from the inputs for the intended use (i.e., for the specific disease or condition).^1,23 Transparency also may include how the AI system creator uses patient-derived data outside this AI system’s intended use, for example, whether patient-derived data can be monetized after the AI system output has been derived. This aspect of transparency also serves autonomy (see next section).
Explainability, while fundamentally related to transparency, refers more to how the output is related to clinical practice and scientific literature. For example, is the output clinically meaningful (e.g., diagnosis of a known condition, presence of a particular lesion), rather than something not well understood (e.g., disease severity on a scale that has not been validated clinically or recognized widely)? Other aspects of transparency beyond algorithmic functionality in the clinic, such as aspects relating to validation efforts (including analytical validation), should be transparent to the user to help replicate the measured performance in real-world use. In fact, per the main principles of Enhancing the QUAlity and Transparency Of health Research (EQUATOR) (which includes the CONSORT-AI extension),^23,24 complete, accurate, and transparent reporting is an integral part of responsible research conduct. Thus, trial reporting should include a thorough description of the input-data handling, including image acquisition, selection, and any preprocessing before feeding into an AI system for analysis. This transparency is integral to the replicability of the intervention beyond the clinical trial in real-world use.
Validability is defined as the degree to which the validity of the AI system can be assessed without clinical validation studies. That is, to what extent is it possible to self-validate an AI system without going through formal bench or clinical performance validation? This includes aspects like algorithmic bugs, unresolved anomalies, open loops, and so forth, that would be found on inspecting algorithm coding. For cases of black-box systems, not as much can be inspected, which decreases the overall validability of the system. Thus, validability qualifies our understanding of the analytical performance of the AI system and the impact of other systems on its performance. Examples of this may include the following:
1. AI algorithm structure and infrastructure, including unit level and code analysis, hardware, firmware, and operating system.
2. Use of federated hardware—dynamically allocated hardware—such as cloud-based systems. As more and more AI algorithms move to such environments, execution of the code may be removed from the original computational infrastructure (i.e., hardware, firmware, and software) where it was validated. On one hand, federated, or cloud-based, execution environments, for example, Amazon Web Services or Microsoft Azure, make it easier to have only a single version of the codebase, rather than multiple different versions, thus enhancing deterministicness. On the other hand, the same code may now be executed on a diversity of computational infrastructures. Executing a code fragment then may have a variety of floating point and other computational results, lowering deterministicness of the code fragment. Mitigation may require prevalidation of the computational infrastructure for a specific code fragment, or instead, may require constraining the range of computational infrastructure on which such a code fragment may be executed. Prevalidation may maximize a computational infrastructure agnostic approach and thus may allow a single codebase globally, as well as providing easier maintenance costs and higher redundancy.
3. Inspection of intellectual property that includes source code and patented and copyrighted components. Determining who has authority and expertise to evaluate validability, as well as what can be shared at which level, has implications for AI creators. Such inspection may include algorithmic correctness verification.⁵³
4. The AI system’s use of priors. This may include analysis of whether the AI system is designed as a black box (minimal validability), a gray box (limited validability), or detector based (enumerated validability).³ Here, validability is primarily concerned with whether analysis of catastrophic and graceful failures of the AI system shows unanticipated risks, which have been shown to occur more often in black-box than detector-based AI systems.^54,55
5. Full characterization of the training datasets at the patient level, which may include partial or full traceability to individual patients, as well as patient demographics and other patient-specific characteristics. Compare the amount of information needed for validability with that needed for transparency, which could require only aggregate characteristics to be identified; for validability, the requirements could be more strict.

As shown, both explainability and transparency primarily involve the clinically oriented AI system user, whereas validability primarily involves AI creators, regulators, and nonclinical (technical) AI system users.

Validation of the Artificial Intelligence System and Nonmaleficence.

The ethical principle of nonmaleficence also leads to requirements for nonclinical and clinical validation or testing of the AI system. Nonclinical testing may include: input data compatibility, discussed in the next section; software verification, including software or firmware description, hazard analysis, software requirements specifications, architecture design specifications, and code traceability.⁵¹

For clinical validation, common reporting standards,²⁴ CONSORT-AI,²³ preregistration of study and analysis protocols,^10,56,57 and validated relationship to patient management³ are important factors to enhance reproducibility. Given the many concerns about replicability, preregistration of the study protocol, inclusion and exclusion criteria, and statistical analysis, according to good clinical practice⁵⁸ or other standards, should be considered. Although potentially beneficial, such standards may not provide sufficient information to help inform regulatory evaluation and have not been recognized by the FDA. See also the FDA’s Recognized Consensus Standards.²⁵ An important decision is whether the AI system is locked before validation, because this affects the external validity and power of any validation study.

The requirements for clinical validation should be commensurate with the risk of harm to the patient. Determining the right balance between resource requirements and burden on AI creators for validation on the one hand, and risk of patient harm from AI system use on the other hand, is essential for patients, patient populations, and the wider health care system to benefit from health care AI carried out the right way.

Validation Study Design.

As far as AI validation study design is concerned, prospective longitudinal or cross-sectional designs may be most appropriate for diagnostic AI validation studies. Incorporating as much of the real-world workflow as possible should be considered. Consider the importance of incorporating the actual workflow⁵⁹ into AI system validation, and the risk of leaving workflow out, in a purely observational validation study, as first shown by Fenton et al.^3,60 In this pivotal retrospective cohort study, the outcomes of women undergoing breast cancer screening by a radiologist assisted by a previously FDA-cleared (based on a study showing high accuracy of the AI compared with radiologists) assistive AI system were compared with those of women who underwent breast cancer screening by a radiologist without an assistive AI.⁶⁰ When this assistive AI system was evaluated in the setting of actual workflow—where it assists a radiologist who makes the final clinical decision—outcomes were found to be worse for the women who underwent breast cancer screening with AI assistance. This finding and its implications highlight the importance of evaluating such technologies within the intended workflow. This applies to the validation clinical trial design as well as through continuing evaluation after actual deployment, as discussed in the next section. It also aligns well with the trend toward the use of real-world data and increasing emphasis on continuous efficacy assessment in the postmarket phase of the FDA and other regulatory agencies.

As far as study design is considered, for diagnostic AI, prospective longitudinal or cross-sectional designs may be appropriate. Such study designs allow hypothesis testing of the effect of the AI diagnostic on patient outcomes or, where diagnosis already has been linked to (untreated) clinical outcome, of the diagnostic accuracy of the AI. For example, diagnostic accuracy hypothesis testing may allow a prospective cohort study design, whereas outcome hypothesis testing design likely will require a randomized clinical trial design. Although a null hypothesis of no effect works well in interventional validation studies, a null hypothesis of not informative in a randomized clinical trial may be less desirable for validation of diagnostic AI systems, especially for validation of autonomous AI systems.^61,62 Consider that such a randomized clinical trial needs an arm where the patient management is based on the autonomous AI output, including the need for intervention. To emphasize, in this arm the patient management can be based on only the diagnostic output of the autonomous AI, without the possibility of overruling by a clinician. (If clinician overruling is not ruled out, the effect measured would be that of the clinician and the AI combined, rather than of the autonomous AI only.) The autonomous AI may output a diagnosis incorrectly, leading to no treatment and leaving a patient untreated when treatment would have been beneficial. Whether the AI made the incorrect call can be known only when the study is complete.^63,64 As mentioned, where diagnosis can be linked to outcome, such a design is not necessary and cohort design is appropriate.

Validation Study Reference Standards.

Consideration should be given to how AI outputs are validated, that is, what these outputs are compared against. For a diagnostic AI system, such a comparison typically is made against an appropriate reference standard, based on its diagnostic indication: from informing a health care provider or patient to driving treatment decisions and making a definitive diagnosis. These reference standards can be categorical or continuous.⁶⁵

From a nonmaleficence principle, the effects of the AI system on clinical outcomes are most relevant. These may be indirect, because clinical outcomes likely may depend on medical decisions that are neither visible to, nor affected by, the AI system. Such clinical outcomes include events of which the patient is aware and wants to avoid, including death, loss of vision, visual field loss, and other events causing a reduction in the patient’s quality of life.⁶⁶ The resources required to quantify such clinical outcomes objectively can be immense, particularly for chronic disease. In contrast, for acute diseases or interventions, clinical outcomes can be immediate and therefore relatively easier to obtain, such as visual acuity improvement in response to an AI that assists in refraction. For the many chronic diseases to which an AI may be applied, such as diabetic retinopathy, glaucoma, or macular degeneration, clinical outcomes may take years to manifest. Great interest in the development of alternative outcomes, or surrogate end points,⁶⁷ has arisen in the evaluation of investigational medical products to reduce the cost and shorten the duration of trials.

For diagnostic AI, interest in surrogate end points has focused on prognostic standards, where a patient’s disease state has been related to a future clinical outcome. Obviously, these should be validated and correlated directly to clinical outcome.⁶⁸ The advantage of a prognostic standard over a surrogate outcome as an end point is that it is not dependent on clinical decisions outside the intended use of the AI system, in other words, its output. For example, within ophthalmology, a prognostic standard is available for diabetic retinopathy, as well as diabetic macular edema, and can be determined by an autonomous diagnostic AI system. However, an expert will make clinical decisions after the diagnosis is determined, such as whether to administer laser treatment or to deliver anti–vascular endothelial growth factor treatment. Such clinical decisions impact the ultimate clinical outcome, but are not made or influenced directly by the AI system. Thus, using a prognostic standard, rather than outcome, to evaluate an AI system has the advantage of not inadvertently diminishing or underestimating the benefits of the AI for decisions outside its control in the context of the clinical outcome.

The Early Treatment Diabetic Retinopathy Study severity scale and the Diabetic Retinopathy Clinical Research Network macular edema scale, as well as the Age-Related Eye Diseases Study macular degeneration scale, are representative of such prognostic standards.^69,70 Ideally, the strength of a prognostic standard is determined by the evidence available to support its capacity to predict progression—or manifestation of a condition or disease—or the benefit of a treatment or management. Its strength is also determined by any evidence that shows that treatments based on the prognostic standard correspond to effects on clinical outcome.^71,72 Because the Early Treatment Diabetic Retinopathy Study, Diabetes Control and Complications Trial/Epidemiology of Diabetes Interventions and Complications Study, and Diabetic Retinopathy Clinical Research Network studies have established such evidence extensively, this applies to these prognostic standards.

Although requiring less time and fewer resources than developing and validating clinical outcomes, quantifying prognostic standards may still require considerable effort. While dependent on the intended use, for autonomous diagnostic AI studies, this is likely an important reason why clinician-derived reference standards, instead of prognostic reference standards, are used widely in AI validation.⁷³ A widely cited meta-analysis of the quality of evidence of AI accuracy takes as a given the comparison with clinician-derived ground truth, but the relationship to prognostic standards or clinical outcome is not considered.²² Indeed, it is a major strength of the Collaborative Community CCOI, and its disease-specific subgroups, that it has started discussing the development of such prognostic standards for disease areas of interest.

Other factors that should be considered when evaluating potential reference standards, in addition to their validity or lack of thereof against outcome, include: (1) reproducibility of the reference standard (many studies have shown that multiple clinicians evaluate the same patient differently in 30%–50% of cases^74–76); (2) repeatability (many studies have shown that the same clinician evaluates the same patient differently in 20%–30% of cases^74–76; (3) diagnostic drift (studies have shown that clinicians from different regions, countries, or continents evaluate the same patient differently in up to 50% of cases, leading to vernacular medicine, as explained in the next section³⁹; and (4) temporal diagnostic drift (studies have shown clinicians systematically evaluating the same hypothetical patient differently over generations of clinicians⁷⁷).

Because the evidence for a given treatment based on a given evaluation may have been derived decades ago, temporal drift in a prognostic standard may be hard to determine and difficult to correct for. We want to clarify that, although temporal drift typically is pernicious and undesirable, temporal diagnostic shift, where new, and better treatments leading to a new prognostic standard, often are desirable. An example of temporal shift is the shift from the prognostic standard of clinically significant macular edema as defined by the Early Treatment Diabetic Retinopathy Study to the new prognostic standard of center-involved macular edema that is derived from OCT, not fundus photography, and was developed in conjunction with evaluating with novel anti–vascular endothelial growth factor treatments for macular edema.^70,78 Optimally, correction for reproducibility and repeatability with strict evaluation protocols and independent verification, where possible, is indicated.⁶⁶

Given these considerations, and depending on an AI system’s role, output type, SaMD risk categorization, and risk of harm to the patient, certain types of reference standards may be differentiated based on the rigor or validity of the reference standard (Table 6). Although such a hierarchy, as shown, may be useful for consideration of reference standard differences, no level is tied to a specific intended use. Generally, these levels I through IV can be related to their rigor, with level I having the most rigor. Typically, an AI system that carries more risk of harm, such as personalized treatment (e.g., artificial pancreas), as a stand-alone diagnosis or determination of disease level used in treatment decisions would be compared with a more rigorous standard. Therefore, it remains up to regulatory agencies around the world to balance the intended use and risk category of the AI system and potentially to include the reference standard level in this balance.

Table 6.

Reference Standard Levels

Level	Description

I	A reference standard that is either a prognostic standard, a clinical outcome, or a biomarker standard. If a prognostic standard, it is determined by an independent reading center. If either a prognostic standard or a biomarker, it is validated against clinical outcome, and temporal drift, reproducibility, and repeatability metrics are published.
II	A reference standard established by an independent reading center. Temporal drift, reproducibility, and repeatability metrics are published. A level II reference standard has not been validated against clinical outcome or a prognostic standard.
III	A reference standard created from the same method as used by the AI, by adjudicating or voting of multiple independent expert readers. The readers are documented to be masked, and reproducibility and repeatability metrics are published. A level III reference standard has not been validated against clinical outcome or a prognostic standard and does not have known temporal drift, reproducbility or repeatability.
IV	All other reference standards, created by single readers or nonexpert readers, possibly without an established protocol. A level IV reference standard has not been validated against clinical outcome or a prognostic standard and does not have known temporal drift, reproducibility, or repeatability metrics, and the readers may not have been masked.

Open in a new tab

AI = artificial intelligence.

For level I and II reference standards, no reference to methods exists because the methods are determined entirely by the requirements for outcome, prognostic standard, or reading center. Although a higher-level reference standard at first glance may always seem more desirable, in many cases, this may not be the preferred choice. Such a higher level may not be available or may even be unachievable, and the requirement for a higher level needs to be balanced with the burden to obtain it. An example is retinopathy of prematurity, where only prognostic standards are derived from expert clinicians—that is, level II—are available. At this point, it is ethically impossible to determine a level I standard for retinopathy of prematurity, and in fact level II is the accepted reference standard in the clinical community. Creating a level I standard would require a study that may leave some treatable patients untreated, and harm patients, depending on how accurate the AI under study actually is, and thus requiring a level I for an AI creator would be an undue burden, and frankly an impossible hurdle to overcome.

It is worth reemphasizing that (1) the level of reference standard is entirely independent from the AI system or its intended use, (2) different intended use cases may require different levels of reference standard, and (3) the level of reference standard is evaluated entirely separate from the minimally acceptable criteria for performance of the AI. The minimally acceptable criteria can be understood only for a given reference standard level.

Minimal Acceptable Criteria for Validation.

The minimal acceptable criteria for the AI system are the decision cutoffs for determining the safety and efficacy of the AI in hypothesis testing clinical trials to estimate nonmaleficence. Such minimal acceptable criteria include combinations of sensitivity, specificity, and area under the receiver operating characteristic curve. Although the concept of decision cutoffs for safety and efficacy of an AI system may be broadly accepted, it is also a major factor in the review processes by regulatory agencies. As an example, for the first autonomous de novo AI authorized by the FDA, 2 hypotheses had to be confirmed in a preregistered clinical trial, with sensitivity and specificity characteristics exceeding 80% at the population level. This corresponded to study-based end points of 85% for sensitivity and 82.5% for specificity.¹³

Theoretically, such minimal acceptable criteria can be derived analytically, with the goal to minimize subjectivity and to maximize external validity. Thus, approaches have been developed to come up with analytical solutions for diagnostic algorithm end points, including Pareto optimization, Youden and Euclidean indices for sensitivity–specificity combinations,^79–82 quantitative cost-benefit derivative analysis, as well as (modified) Angoff approaches.^83,84 Specifically, the (modified) Angoff approach has been validated for setting testing thresholds in educational settings. These analytical approaches are helpful in informing the choices to be made by improving the understanding of risks and benefits of any choices made.

Alternatively, minimal acceptable criteria for a diagnostic AI can be set to conform to existing diagnostic procedures. For example, when excluding pulmonary embolism, negative test results should have a 3-month thromboembolic risk of less than 3%, which is derived from the equivalent risk after negative pulmonary angiography findings, the gold standard.⁸⁵ An understanding of the accuracy of comparable diagnostic processes performed by human clinicians and other human experts should be a requirement (note that the current diagnostic standard of care may not necessarily involve a clinician in the future because AI systems at some point may be considered as standard of care). In contrast, the existing literature does not offer guidance on these minimal acceptable criteria for an autonomous AI performing the diabetic eye examination because the standard of care by ophthalmologists reaches sensitivity of only 33% or 34%.^75,76

Given such widespread lack of scientific evidence for specific minimal acceptable criteria, deciding them involves ethical and cost effectiveness analysis and other risk-benefit trade-offs by patients, clinicians, and payors. Such decisions typically require the involvement of domain experts. As examples, minimally acceptable criteria for screening mammography were determined by a set of domain experts using a modified Angoff approach,⁸⁴ and a sampled survey of pediatricians was used to estimate the minimally acceptable sensitivity threshold for a streptococcal pharyngitis test in children.⁸⁶ For such approaches to work, it is important that the experts involved fully grasp the spectrum of risks and benefits for patients of each alternative set of criteria. This may not always be the case: in the latter study, 80% of pediatricians proposed a sensitivity of at least 95%, which was not achievable by any feasible test under consideration.⁸⁶ The structured collection of patient preferences, also known as patient preference information, also could be included in shaping these decisions.⁸⁷ Thus, the following stages can be considered in isolation or in the aggregate for setting minimal acceptable criteria.

Literature or meta-analysis review of existing minimal acceptable criteria and assignment of weights to the consequences of test misclassifications, according to 1 or more metrics such as cost or quality-adjusted life years. As an example, estimates of whether the consequences of missing a case, such as increased morbidity or cost at a later stage when the disease manifests more clearly, outweigh the consequences of misclassifying a noncase as a case, such as unnecessary radical diagnostic or treatment decisions with major side effects. Scientific evidence of comparable diagnostic processes, performed by human clinicians and other human experts, should be included, if available, or may need to be collected, if not available.
Analysis of a representative spectrum of sensitivity and specificity combinations and determination of the downstream cumulative weight of consequences for patients^88,89 and other stakeholders in the health care system, including patient preference information.
A process of domain experts (e.g., network of experts)⁹⁰ can potentially generate consensus on minimal acceptable criteria, for example, using vignettes that condense analytical evidence to ensure minimal bias among domain experts.

Postmarket Monitoring of Artificial Intelligence Systems and Nonmaleficence.

Monitoring of the safety and efficacy of an AI system is important because it affects nonmaleficence. Real-world performance monitoring after implementation can be achieved by putting a prospective monitoring protocol in place. Such a prospective monitoring protocol may be agreed on by a regulatory agency—for example, implemented as part of a comprehensive Quality Management System following 21 CFR 820—and may accommodate user feedback, complaints, and reportable events. In addition, other AI system characteristics that are within creators’ control such as usability, user experience, product performance, and necessary safety controls, including a comprehensive framework for cyber security, data protection, and data privacy, also may be monitored.

To ensure continued acceptable performance of an autonomous AI system, for example, a prospective monitoring protocol may require the AI system output to be compared with the same reference standard that was used in (premarket) validation to be able to determine whether it still meets safety and efficacy standards in the post-implementation real world. As discussed in the previous section, more rigorous, higher-level reference standards often require substantial resources for patients and creators. Real-world monitoring may require the collection of this reference standard for each monitored patient, which thereby diminish the reasons why the AI system was implemented in the first place, such as improved access, lower cost, and patient friendliness. Thus, prospective monitoring protocols will have to find a balance between burden on AI creators and patients, on the one hand, and nonmaleficence, on the other.

Changing an Artificial Intelligence System after Validation.

As an AI system is used on patients and continuous efficacy monitoring is in effect, opportunities exist to improve the AI system technical specifications in terms of safety, efficacy, equity (see “Equity” section), or a combination thereof. Artificial intelligence systems—that is, SaMDs that use AI or machine learning—have the unique capacity to be updated after implementation. In fact, if an AI system is not locked after validation, potentially unlimited configurability exists.

It is important to determine that changes to the technical specifications, while intended to improve the AI system, do not affect the ethical principles of nonmaleficence and equity negatively. Traditionally, from a regulatory perspective, almost all technical specification changes to an SaMD that affect safety or effectiveness may require a new validation; cyber security changes may be the only ones currently possible without such full validation, depending on how one interprets current FDA guidance.⁹¹ Thus, safely updating the AI system requires that appropriate controls and validation methodologies are in place. These controls and methodologies are dependent on both the type of change as well as on the risk of patient harm, and we differentiate the following types of changes:

Changes to AI system computations include (1) changes to preprocessing and postprocessing algorithms; (2) changes in algorithmic infrastructure, including hardware and software; (3) changes to AI algorithm architecture, including to improve performance of types of classifiers, hyperparameter and parameter (including model weights) values, and training data.
Changes to AI system input, while keeping the output type constant, include: (1) change in the imaging system, such as optical, sensor, image compression, and imaging protocol; and (2) adding other information about the patient to the inputs, such as pulse, visual acuity, and intraocular pressure, used along with the original inputs by the AI to determine its output.
Changes to the AI system output, while keeping the input types unchanged, include marking regions of interest when previously only a normal or abnormal output was validated.
Changes to AI system indications and intended use, include accumulating scientific evidence that an AI system that was validated as a referral tool and authorized by a regulatory agency as actually being used as a diagnostic tool, as it becomes more accepted in the clinical community and its performance thresholds are adjusted to support such use. Other examples include changes to (1) inclusion or exclusion criteria, such as expansion to people with different risk of having the disease, age groups, ancestry, race or ethnicity that were not accounted for in the design or validation of the AI system to be improved; (2) disease level or threshold; and (3) disease type, for example, macular degeneration when previously validated for diabetic retinopathy.

An important component of AI system changes is the method of change validation that is used to establish safety, efficacy, and equity of the changed AI system. Artificial intelligence systems may differ in the data that were collected for their validation. At one end of this spectrum, a recent autonomous AI system required a full preregistered clinical trial—a pivotal trial—comparing AI output against a level I prognostic standard.¹³ Depending on the patient risk of harm and the type of change, as set forth in the previous section, the following categories of such methods can be discerned (as an aside, many of these methods require the pivotal trial data of the index AI system to have been escrowed under a so-called algorithmic integrity protocol¹³):

Regression identity testing to establish non-probabilistically that for any input data, changes to the AI system do not result in any change whatsoever in diagnostic output.
Bench validation to test the statistical hypothesis formally, that a change that can impact the AI algorithm, for example, a change in Graphics Processing Unit (GPU), has no impact on the diagnostic output for any input from a given group of participants.
Recursive validation to test the statistical hypothesis formally, that a change in input type, such as a change in imaging system, has no impact on the diagnostic output compared with the index AI system output. Recursive validation uses the index AI system output as the reference standard.⁹² It is similar to a reproducibility study,⁹² where the output of the index AI system is compared with a modified AI system with the inputs slightly perturbed.
Performance (safety, effectiveness, and equity) bracketing. Analytically, the maximum change in performance metrics caused by a specific change in the algorithm can be calculated and bounded quantitatively, and these brackets can be used to ensure maximum change continues to be within expectations, and can also exceed the minimally acceptable criteria that were determined for the index system’s pivotal trial.
Escrowed validation study iteration to test the hypothesis statistically that an AI system is not inferior, or possibly superior, to the index AI system. This can be achieved by reusing the inputs of the index AI system validation dataset that were escrowed previously and comparing the outputs of the changed system with that escrowed, established reference standard. Limits exist on the number of iterations that can be achieved, as explored by Ioannidis,⁹³ because each dataset reuse increases the potential for overfitting to the escrowed validation data.⁹⁴ The degree to which escrowed dataset reuse leads to false-positive claims and overfitting can be quantified through systematic frameworks, including the dataset positive predictive value framework. The success of this approach depends on parameters including the number of available escrowed validation subjects, type 1 and type 2 error rates, and the degree of dependence between outputs of the index AI system and the modified AI system. The validation study needs to have been escrowed as part of the preregistration algorithm integrity process for this to be a valid methodology.^13,95
Escrowed validation study expansion to test statistically the hypothesis that the AI system is not inferior or possibly superior because of a change in target patient population. Escrowed validation study expansion reuses the inputs of the index AI system validation dataset that has been escrowed, expands this dataset with participants from the new target patient population, and then compares the outputs of the changed system with the reference standard. Either the identical workflow can be used, or a secondary analysis on the effect, if any, of a change in workflow is required. As new participants are added to the original study for this expansion, information is gained, and this may compensate for the information loss and risk of overfitting from dataset reuse.⁹⁵ As with escrowed validation study iteration, it is critical to monitor the overall degree of dataset reuse.

Where “index AI system” refers to the AI system that was validated in a pivotal trial. The term escrowed, under an algorithm integrity protocol, signifies that the human participant input data (including the corresponding reference standard) collected in the pivotal trial is kept inaccessible from the creators by a third, independent, party. Thus, a complete arms-length chain of custody of any access or use of this data by the index AI system developer will exist, for example, for retraining a modified AI system, somewhat analogous to the concept of clinical trial preregistration.

The above studies of course can be performed by the AI creator or by independent research groups.

Autonomy

Analysis of autonomy of the patient with respect to AI leads to at least 2 important considerations.

The first is consideration of the use of patient-derived data, which applies to both training data for the AI system algorithms as well as to implementation, where the AI system collects input data to determine its outputs. Transparency may include how the AI system creator uses patient-derived data outside the use for this AI system’s intended use. An example is insight into whether patient-derived data are monetized for purposes other than the diagnosis by the AI. Autonomy is greater when the collection of patient-derived data is lawful and in compliance with laws and regulations and best practices. This may include compliance with the Health Insurance Portability and Accountability Act, the Health Information Technology for Economic and Clinical Health Act, and other data security aspects of 21 CFR 50, the Declaration of Helsinki, and other statutory and regulatory rules in place, in a manner that is transparent about the purpose and scope for which the data will be used.⁹⁶ Ideally, patient-derived data used by AI creators is traceable to patient authorization to use that data. Those involved in the design of AI systems should have accountability with respect to protecting patient rights as stewards of patient-derived data. Auditable processes and security controls aid in ensuring that patient data are being used in accordance with the scope for which they were authorized and to protect the data from unauthorized use or access.

A current controversy is the reward or recognition of clinicians contributing a reference standard to patient-derived data incorporated in the intellectual property of an AI system. Such contributions may include their diagnostic work recorded in medical records, subsequently used to train or evaluate an AI system.⁹⁷ Such ownership collides with rising public desire for increased control over, and privacy with respect to, electronic data and emerging regulations to address these (General Data Protection Regulation (European Union) 2016/679, and the California Consumer Privacy Act, Cal. Civ. Code § 1798.100 et seq.), as well as increasing patient activism for recognition for contributions to scientific advances.

The second consideration is that liability for AI system malfunction is related to autonomy. Abramoff et al³ previously proposed that creators of autonomous AI systems assume liability for harm caused by the diagnostic output of the device when used properly and on label. In their article, they state that this is essential for adoption: it may be inappropriate for clinicians using an autonomous AI to make a clinical decision they are not comfortable making themselves, to nevertheless have full medical liability for harm caused by that autonomous AI. This view was recently endorsed by the American Medical Association in its 2019 AI Policy.⁶ Such a paradigm for responsibility is more complex for assistive AI, where medical liability may fall only on the provider using it, because they are ultimately responsible for the medical decision, or on a combination of both, where even the relative balance of liability of the AI user and the AI creator come into play.

Meanwhile, as Abramoff et al³ proposed elsewhere, medical decisions by autonomous AI for an individual patient typically cannot be labeled unequivocally as correct or incorrect, especially in chronic diseases, where outcomes may emerge years later. However, for populations of patients, the medical decisions can be compared statistically with the desired decisions, for example, to claimed correctness, and thus that is where the liability should be focused. Another issue is that, although autonomous AI is preferable validated against patient outcome or prognostic standards, these comparisons require enormous resources that are not available for an individual patient when liability is at stake. Instead, the autonomous AI decision may be compared with that of an individual physician or group of physicians, lacking validation, and thus with unknown correspondence to outcome or surrogate outcome. As an aside, this can also be an issue for so-called continuous learning AI systems.

These distinctions will need to be resolved as various AI applications move forward. The legal responsibility for an AI system, built in partnership with a large health care system, and intended to be used on its patient population, is by definition more diffuse and likely to vest in the sponsoring health care system, or with some comparative or contributory analysis of fault. A privately designed system, sold as a finished product, may need to bear its own responsibility for autonomous output, absent superseding or intervening causation. Responsibility for proper use and maintenance of the AI system, consistent with terms of service and FDA or other regulatory agency labeling, remains with the provider: the practice of medicine.

Finally, the output of the autonomous AI system, although valid as a diagnostic record from a regulatory perspective, currently is not defined as a medical record when it is not signed off on by a physician. What is and is not, and who can and cannot create, a medical record is determined in the United States primarily by the State medical boards or their equivalent. At present, such boards do not consider an autonomous AI output to have the same medicolegal status as physician documentation, and the legal status of reports generated by AI has been brought to the attention of the United States Federation of State Medical Boards.

Equity

The third bioethical principle is equity. We mentioned previously that we use this term rather than the traditional bioethical term justice for the same concept. Equity primarily concerns itself with the impact on a population level, beyond the impact on an individual patient. In the context of AI, this translates to estimating its differential impact on safety, or any other characteristics of the AI system, for members of a group with respect to members of other groups. Any differences are referred to as health disparities. For example, inappropriate bias of the AI system may result in the AI system being less safe for some group characterized according to race,^98,99 ethnicity, sex,¹⁰⁰ age, income, or other categories than another, even though on average it was found to be safe. Any medical process has the potential either to increase or decrease health disparities, depending on how it is used. Because of the scale at which AI systems operate, their potential to increase or decrease disparities also is magnified tremendously.

Inappropriate bias, increase in health disparities, and thus decreased equity can manifest across the entire AI pipeline, as Char et al⁴ outlined, including in the choice of intended use of the AI, its design, its validability, its validation, the choice of reference standards, and how and where it is implemented. For example, as far as design of the AI is considered, lower validability of a black-box algorithmic approach makes bias harder to anticipate, to detect, and to mitigate where explicit priors with models that cannot be analyzed and evaluated. Another example with respect to design is how incomplete or unrepresentative training data, or a reliance on complete and representative data that reflects and reproduces (at scale) pre-existing health care bias, increases risk-worsening health disparities. As far as validation is concerned, selection of study sites and biased inclusion and exclusion criteria can decrease validity for certain subgroups, and thereby exacerbate health disparities. Finally, implementation of the AI system preferably in some populations over others may affect access to disadvantaged groups, and thereby increase health disparities.

Validation can be used to measure equity by testing for the presence or absence of an effect of predefined characteristics of subgroups on performance of the AI system, such as sensitivity or specificity. Such characteristics typically include race, ethnicity, age, and gender on sensitivity and specificity.¹⁰¹ In addition, differential use in subgroups will affect equity, and such effects can be compared using metrics like population-achieved sensitivity (see the next section).⁹⁸

As mentioned, when analyzing the equity of an AI system, particularly in the context of health disparities, it is useful to consider the implementation context. Different diagnostic processes, including AI systems, may differ in patient friendliness, availability, access, and direct and indirect cost, even with equal sensitivity and specificity (i.e., equally high nonmaleficence).

With respect to intended use and implementation in the context of equity, the goal of the diagnostic process at the population level is to identify the maximum number of true cases of disease identified in that population. A given diagnostic process, like a high-performing AI system, may have a high sensitivity; that is, nonmaleficence is maximum for those patients who have access. However, if, for example, this AI system is available only in one place, the number of cases identified will not be maximized because many in that population simply never undergo its diagnostic process.

Population-achieved sensitivity, or access-corrected sensitivity, is used to analyze such effects on equity. That is, although an AI system—and any diagnostic process—with very high sensitivity is attractive from an individual (non-maleficence) perspective, if only a few people have access to the diagnostic AI, the population-achieved sensitivity (PAS), or effective sensitivity at the population level, will be much lower, and concomitantly its equity:

P A S = \frac{s_{c} c p_{c}}{c p_{c} + (1 - c) \hat{p_{n c}}} ≅ s_{c} c,

where s_c is sensitivity (as determined in adherence population), c is compliance or (adherence), p_c is measured prevalence in the adherent population, and $\hat{p_{n c}}$ is the estimated prevalence in the nonadherent population. When we assume $p_{c} ≅ \hat{p_{n c}}$ , that is, the prevalence of the disease is the same in the nonadherent population as in the adherent population, we can use the simplified estimate s_cc.¹⁰² For example, if compliance, c, with the diabetic eye examination is 15%¹⁰² and the minimum acceptable sensitivity is 85%,¹³ then the PAS is 0.13. That is, only 13% of cases in the population will be identified correctly with this diagnostic system. In many cases, the prevalence in the part of the population that does not undergo the AI system actually is even higher than in the adherent population, so that this estimate of PAS forms an upper bound. It is useful to consider PAS in determining minimum acceptable sensitivity. A more accessible AI system may have lower s_c, but still may result in higher PAS because compliance is higher.

Conclusions

The considerations in this article are a useful first step in the development of a bioethically sound foundation, based in nonmaleficence, autonomy, and equity, of considerations for the design, validation, and implementation of AI systems. The CCOI’s FPOAI exceptional and diverse experience means it is well placed to develop and evaluate such a foundation. Considerations of FPOAI’s future consensus statements and cooperation among AI creators, industry, ethicists,^3,4 clinicians, patients, and regulatory agencies are key to facilitating rapid innovation of AI technologies and their successful implementation in clinical medicine. Such global collaboration will adhere to bioethical principles and will guide development and use of clinical AI, helping to make fundamental improvements in accessibility and quality of health care, to decrease disparities, and to lower the overall cost of health care.

Acknowledgments

Supported in part by The Robert C. Watzke MD Professorship to (M.D.A), and Research to Prevent Blindness, Inc, New York, New York (unrestricted grant to the Department of Ophthalmology and Visual Sciences, University of Iowa, [M.D.A.], an unrestricted grant to the Department of Ophthalmology, University of Wisconsin [B.B], and an unrestricted grant to the Department of Ophthalmology, Stanford University [T.L.]).

Members of the Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group as of writing: Michael D. Abràmoff, MD, PhD (Chair, University of Iowa); Malvina B. Eydelman, MD (Center for Devices and Radiological Health, Office of Health Technology 1, United States Food and Drug Administration); Brad Cunningham, MSE (Center for Devices and Radiological Health, Office of Health Technology 1, United States Food and Drug Administration); Bakul Patel, MBA (Center for Devices and Radiological Health, Digital Health Center of Excellence, United States Food and Drug Administration); Karen A. Goldman, PhD, JD (OPP, United States Federal Trade Commission); Danton Char, MD, MS (Stanford University); Taiji Sakamoto, MD (Kagoshima University, Japanese Ophthalmological Society); Barbara Blodi, MD (Department of Ophthalmology, University of Wisconsin); Risa Wolf, MD (Department of Pediatrics, Johns Hopkins University); Jean--Louis Gassee (Apple); Theodore Leng, MD, MS (Department of Ophthalmology, Stanford University School of Medicine); Dan Roman (Director Diabetes Measures, National Committee of Quality Assurance); Sally Satel (Yale, AEI, data usage ethics); Donald Fong (Kaiser Permanente); David Rhew (Chief Medical Officer, Microsoft); Henry Wei (Google Health); Michael Willingham (Google Health); Michael Chiang, MD, PhD (Director, National Eye Institute); Mark Blumenkranz, MD (Facilitator, Stanford University). Although the members’ main affiliations are stated, they do not in every case represent their institution or company. Members of the Collaborative Community on Ophthalmic Imaging Executive Committee: Michael Abramoff, MD, PhD; Mark Blumenkranz, MD; Emily Chew, MD; Michael Chiang, MD; Malvina Eydelman, MD; David Myung, MD, PhD; Joel S. Schuman, MD; and Carol Shields, MD.

The FDA participates in the Foundational Principles of Ophthalmic Imaging and Algorithmic Interpretation Working Group as a member of the Collaborative Community on Ophthalmic Imaging Foundation. This manuscript reflects the views of the authors and should not be construed to represent FDA’s views or policies.

Abbreviations and Acronyms:

AI: artificial intelligence
CCOI: Collaborative Community on Ophthalmic Imaging
CT: computed tomography
DICOM: Digital Imaging and Communications in Medicine
FDA U.S.: Food and Drug Administration
FPOAI: Foundational Principles of Algortihmic Interpretation Workgroup of the Collaborative Community for Ophthalmic Imaging, Washington, DC
GPU: graphics processing unit
ISO: International Organization for Standardization
MRI: magnetic resonance imaging
PAS: population-achieved sensitivity
SaMD: Software as a Medical Device

Footnotes

Disclosure(s):

All authors have completed and submitted the ICMJE disclosures form.

Michael Chiang, an editor of this journal, was recused form the peer-review process of this article and had no access to information regarding its peer review.

The author(s) have made the following disclosure(s): M.D.A.: Executive Chairman, Equity Owner, Founder, Patents and other Intellectual Property, Royalties, Consultant – Digital Diagnostics Inc, Coralville, Iowa.

J.M.K.: Chair – American Academy of Dermatology Committee on Augmented Intelligence; Equity owner – Skin Analytics

HUMAN SUBJECTS: No human subjects were included in this study. No animal subjects were included in this study.

References

1.Center for Devices and Radiological Health, United States Food and Drug Administration. Artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD) action plan. January 2021. Available at: https://www.fda.gov/media/145022/download. Accessed August 14, 2021.
2.Stanford University. Collaborative community on ophthalmic imaging (CCOI). Available at: https://www.cc-oi.org/; 2020. Accessed August 14, 2021.
3.Abramoff MD, Tobey D, Char DS. Lessons learned about autonomous AI: finding a safe, efficacious, and ethical path through the development process. Am J Ophthalmol. 2020;214(1):134–142. [DOI] [PubMed] [Google Scholar]
4.Char DS, Abràmoff MD, Feudtner C. Identifying ethical considerations for machine learning healthcare applications. Am J Bioethics. 2020/11/01 2020;20(11):7–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Abramoff MD. The autonomous point of care diabetic retinopathy examination. In: Klonoff DC, Kerr D, Mulvaney SA, eds. Diabetes Digital Health. New York: Springer; 2020:55–67. [Google Scholar]
6.American Medical Association (AMA) Board of Trustees. Policy summary. Augmented intelligence in healthcare. Updated November 4, 2019. Available at: https://www.ama-assn.org/system/files/2019-08/ai-2018-board-policy-summary.pdf; 2019. Accessed August 14, 2021. [Google Scholar]
7.Emanuel EJ, Wachter RM. Artificial intelligence in health care: will the value match the hype? JAMA. 2019;321(23):2281–2282. [DOI] [PubMed] [Google Scholar]
8.Preston R Autonomous AI in action. January 16, 2020. Available at: https://www.forbes.com/sites/oraclecloud/2020/01/16/autonomous-in-action-self-driving-cars-get-all-the-publicity-but-other-industries-are-already-getting-exceptional-value-from-ai-based-systems/#1ecc65d86e94. Accessed August 15, 2021. [Google Scholar]
9.Bodenheimer T, Sinsky C. From triple to quadruple aim: care of the patient requires care of the provider. Ann Fam Med. 2014;12(6):573–576. [DOI] [PMC free article] [PubMed] [Google Scholar]
10.United States Food and Drug Administration. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. Available at: https://www.fda.gov/newsevents/newsroom/pressannouncements/ucm604357.htm; 2018. Accessed August 15, 2021.
11.American Diabetes Association. 11. Microvascular complications and foot care: standards of medical care in diabetes-2020. Diabetes Care. 2020;43(Suppl 1):S135–S151. [DOI] [PubMed] [Google Scholar]
12.Ting DSW, Peng L, Varadarajan AV, et al. Deep learning in ophthalmology: the technical and clinical considerations. Prog Retin Eye Res. 2019;72:100759. [DOI] [PubMed] [Google Scholar]
13.Abràmoff MD, Lavin PT, Birch M, et al. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. Nat Digital Med. 2018;1(1):39. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Gensure RH, Chiang MF, Campbell JP. Artificial intelligence for retinopathy of prematurity. Curr Opin Ophthalmol. 2020;31(5):312–317. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Peng Y, Dharssi S, Chen Q, et al. DeepSeeNet: a deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs. Ophthalmology. 2019;126(4):565–575. [DOI] [PMC free article] [PubMed] [Google Scholar]
16.Christopher M, Belghith A, Weinreb RN, et al. Retinal nerve fiber layer features identified by unsupervised machine learning on optical coherence tomography scans predict glaucoma progression. Invest Ophthalmol Vis Sci. 2018;59(7):2748–2756. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Kaiserman I, Rosner M, Pe’er J. Forecasting the prognosis of choroidal melanoma with an artificial neural network. Ophthalmology. 2005;112(9):1608. [DOI] [PubMed] [Google Scholar]
18.Siddiqui AA, Ladas JG, Lee JK. Artificial intelligence in cornea, refractive, and cataract surgery. Curr Opin Ophthalmol. 2020;31(4):253–260. [DOI] [PubMed] [Google Scholar]
19.Yu F, Silva Croso G, Kim TS, et al. Assessment of automated identification of phases in videos of cataract surgery using machine learning and deep learning techniques. JAMA Netw Open. 2019;2(4):e191860. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1(6):e271–e297. [DOI] [PubMed] [Google Scholar]
21.Ochodo EA, de Haan MC, Reitsma JB, et al. Over-interpretation and misreporting of diagnostic accuracy studies: evidence of “spin.” Radiology. 2013;267(2):581–588. [DOI] [PubMed] [Google Scholar]
22.Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Cohen JF, Korevaar DA, Altman DG, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Center for Devices and Radiological Health, United States Food and Drug Administration. Recognized consensus standards. Available at: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfstandards/search.cfm. Accessed August 15, 2021.
26.United States Food and Drug Administration; International Medical Device Regulators Forum. Software as a medical device (SaMD): clinical evaluation. 2016. Available at: https://www.fda.gov/media/100714/download. Accessed August 15, 2021.
27.Botkin JR, Goldenberg AJ, Rothwell E, et al. Retention and research use of residual newborn screening bloodspots. Pediatrics. 2013;131(1):120–127. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Blizinky KD. Ethical considerations in the All of Us research program 2018. Available at: https://www.bioethics.nih.gov/sites/nihbioethics/files/bioethics-files/courses/pdf/2018/session5_blizinsky.pdf; 2018. Accessed August 15, 2021. [Google Scholar]
29.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–453. [DOI] [PubMed] [Google Scholar]
30.Cavallerano J, Lawrence MG, Zimmer-Galler I, et al. Telehealth practice recommendations for diabetic retinopathy. Telemed J E Health. 2004;10(4):469–482. [DOI] [PubMed] [Google Scholar]
31.Abramoff MD, Leng T, Ting DSW, et al. Automated and computer-assisted detection, classification, and diagnosis of diabetic retinopathy. Telemed J E Health. 2020;26(4):544–550. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.DICOM Standards Committee. Digital Imaging and Communications in Medicine (DICOM) Standard. Supplement 91: ophthalmic photography image SOP classes. Rosslyn, VA: USA National Electrical Manufacturers Association (NEMA). Available at: https://www.dicomstandard.org/News/ftsup/docs/sups/sup91.pdf; 2004, Accessed August 15, 2021. [Google Scholar]
33.United States General Accounting Office (GAO). Artificial intelligence in health care: benefits and challenges of technologies to augment patient care. GAO-21-7SP. Washington, DC: United States General Accounting Office; 2020. [Google Scholar]
34.Software as a Medical Device (SaMD) Working Group, International Medical Device Regulators Forum. “Software as a medical device”: possible framework for risk categorization and corresponding considerations. Available at: http://www.imdrf.org/docs/imdrf/final/technical/imdrf-tech-140918-samd-framework-risk-categorization-141013.pdf; 2014. Accessed August 15, 2021.
35.United States Food and Drug Administration. How to determine if your product is a medical device. Available at: https://www.fda.gov/medical-devices/classify-your-medical-device/how-determine-if-your-product-medical-device; 2018. Accessed August 15, 2021.
36.Digital Therapeutics Alliance (DTA). Digital health industry categorization; 2019. Available at: https://dtxalliance.org/wp-content/uploads/2019/11/DTA_Digital-Industry-Categorization_Nov19.pdf. Accessed August 15, 2021.
37.United States Food and Drug Administration. Changes to existing medical software policies resulting from Section 3060 of the 21st Century Cures Act. Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/changes-existing-medical-software-policies-resulting-section-3060-21st-century-cures-act; 2019. Accessed August 15, 2021.
38.Center for Devices and Radiological Health, United States Food and Drug Administration. Clinical decision support software: draft guidance for industry and Food and Drug Administration staff. Available at: https://www.fda.gov/media/109618/download; 2019. Accessed August 15, 2021.
39.van Dijk HW, Verbraak FD, Kok PH, et al. Variability in photocoagulation treatment of diabetic macular oedema. Acta Ophthalmol. 2013;91(8):722–727. [DOI] [PubMed] [Google Scholar]
40.Huang L, Shea AL, Qian H, et al. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J Biomed Inform. 2019;99:103291. [DOI] [PubMed] [Google Scholar]
41.Geer D Children of the magenta. IEEE Secur Priv. 2015;13(05):104–104. [Google Scholar]
42.Lee A, Campbell J, Hwang T, et al. Recommendations for standardization of images in ophthalmology. Ophthalmology. 2021;128(7):969–970. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Char DS, Shah NH, Magnus D. Implementing machine learning in health care—addressing ethical challenges. N Engl J Med. 2018;378(11):981–983. [DOI] [PMC free article] [PubMed] [Google Scholar]
44.Kent J Artificial intelligence falls short in detecting diabetic eye disease. Health IT Analytics; 2021. Available at: https://healthitanalytics.com/news/artificial-intelligence-falls-short-in-detecting-diabetic-eye-disease. Accessed August 15, 2021. [Google Scholar]
45.Centers for Medicare and Medicaid Services. Artificial Intelligence (AI). Health Outcomes Challenge; 2019. https://innovation.cms.gov/innovation-models/artificial-intelligence-health-outcomes-challenge. Accessed August 15, 2021. [Google Scholar]
46.Challen R, Denny J, Pitt M, et al. Artificial intelligence, bias and clinical safety. BMJ Qual Saf. 2019;28(3):231–237. [DOI] [PMC free article] [PubMed] [Google Scholar]
47.Center for Devices and Radiological Health, United States Food and Drug Administration. Understanding unapproved use of approved drugs “off label.” Available at: https://www.fda.gov/patients/learn-about-expanded-access-and-other-treatment-options/understanding-unapproved-use-approved-drugs-label; 2018. Accessed August 15, 2021.
48.Beauchamp TL, Childress JF. Principles of Biomedical Ethics. Eighth ed. Oxford, UK: Oxford University Press; 2019. [Google Scholar]
49.Gayle HD, Childress JF. Race, racism, and structural injustice: equitable allocation and distribution of vaccines for the COVID-19. Am J Bioeth. 2021;21(3):4–7. [DOI] [PubMed] [Google Scholar]
50.International Organization for Standardization (ISO). ISO/IEC/IEEE 90003:2018 software engineering—guidelines for the application of ISO 9001:2015 to computer software; 2018. Available at: https://www.iso.org/standard/74348.html. Accessed August 15, 2021. [Google Scholar]
51.International Organization for Standardization (ISO). ISO 13485:2016 medical devices—quality management systems—requirements for regulatory purposes; 2016. Available at: https://www.iso.org/standard/59752.html. Accessed August 15, 2021. [Google Scholar]
52.Norgeot B, Quer G, Beaulieu-Jones BK, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26(9):1320–1324. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Yang Y, Rinard M. Correctness verification of neural networks. arXiv.1906.01030. https://ui.adsabs.harvard.edu/abs/2019arXiv190601030Y; 2019. Accessed 01.06.19. [Google Scholar]
54.Finlayson SG, Bowers JD, Ito J, et al. Adversarial attacks on medical machine learning. Science. 2019;363(6433):1287–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Shah A, Lynch S, Niemeijer M, et al. Susceptibility to misdiagnosis of adversarial images by deep learning based retinal image analysis algorithms. Proceedings - International Symposium on Biomedical Imaging. 2018:1454–1457. 10.1109/ISBI.2018.8363846. [DOI] [Google Scholar]
56.Kaplan RM, Irvin VL. Likelihood of null effects of large NHLBI clinical trials has increased over time. PLoS One. 2015;10(8):e0132382. [DOI] [PMC free article] [PubMed] [Google Scholar]
57.Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proc Natl Acad Sci U S A. 2018;115(11):2600–2606. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.United States Food and Drug Administration. E6(R2) good clinical practice: integrated addendum to ICH E6(R1). Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e6r2-good-clinical-practice-integrated-addendum-ich-e6r1; 2018. Accessed August 15, 2021. [Google Scholar]
59.Gaube S, Suresh H, Raue M, et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit Med. 2021;4(1):31. [DOI] [PMC free article] [PubMed] [Google Scholar]
60.Fenton JJ, Taplin SH, Carney PA, et al. Influence of computer-aided detection on performance of screening mammography. N Engl J Med. 2007;356(14):1399–1409. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Lu B, Gatsonis C. Efficiency of study designs in diagnostic randomized clinical trials. Stat Med. 2013;32(9):1451–1466. [DOI] [PMC free article] [PubMed] [Google Scholar]
62.Pearl J, Mackenzie D. The Book of Why: the New Science of Cause and Effect. New York, NY: Basic Books; 2018. [Google Scholar]
63.Bossuyt PM, Lijmer JG, Mol BW. Randomised comparisons of medical tests: sometimes invalid, not always efficient. Lancet. 2000;356(9244):1844–1847. [DOI] [PubMed] [Google Scholar]
64.Korevaar DA, Gopalakrishna G, Cohen JF, Bossuyt PM. Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses. Diagn Progn Res. 2019;3:22. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Cash BD, Schoenfeld P, Rex D. An evidence-based medicine approach to studies of diagnostic tests: assessing the validity of virtual colonoscopy. Clin Gastroenterol Hepatol. 2003;1(2):136–144. [DOI] [PubMed] [Google Scholar]
66.Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being misled? Ann Intern Med. 1996;125(7):605–613. [DOI] [PubMed] [Google Scholar]
67.Temple R A regulatory authority’s opinion about surrogate endpoints. In: Nimmo W, Tucker G, eds. Clinical Measurement in Drug Evaluation. New York, NY: J Wiley; 1995. [Google Scholar]
68.United States Food and Drug Administration. Design considerations for pivotal clinical investigations for medical devices. Guidance for industry, clinical investigators, institutional review boards and Food and Drug Administration staff. Available at: https://www.fda.gov/media/87363/download; 2013. Accessed August 15, 2021.
69.Early Treatment Diabetic Retinopathy Study Research Group. Fundus photographic risk factors for progression of diabetic retinopathy. ETDRS report number 12. Ophthalmology. 1991;98(5 Suppl):823–833. [PubMed] [Google Scholar]
70.Browning DJ, Glassman AR, Aiello LP, et al. Optical coherence tomography measurements and analysis methods in optical coherence tomography studies of diabetic macular edema. Ophthalmology. 2008;115(8):1366–1371, 1371 e1. [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med. 1989;8(4):431–440. [DOI] [PubMed] [Google Scholar]
72.United States Food and Drug Administration. International conference on harmonisation; guidance on statistical principles for clinical trials; availability—FDA. Notice. Fed Regist. 1998;63(179):49583–49598. [PubMed] [Google Scholar]
73.Lee AY, Yanagihara RT, Lee CS, et al. Multicenter, head-to-head, real-world validation study of seven automated artificial intelligence diabetic retinopathy screening systems. Diabetes Care. 2021;44(5):1168–1175. [DOI] [PMC free article] [PubMed] [Google Scholar]
74.Lin AP, Katz LJ, Spaeth GL, et al. Agreement of visual field interpretation among glaucoma specialists and comprehensive ophthalmologists: comparison of time and methods. Br J Ophthalmol. 2011;95(6):828–831. [DOI] [PubMed] [Google Scholar]
75.Lin DY, Blumenkranz MS, Brothers RJ, Grosvenor DM. The sensitivity and specificity of single-field nonmydriatic monochromatic digital fundus photography with remote image interpretation for diabetic retinopathy screening: a comparison with ophthalmoscopy and standardized mydriatic color photography. Am J Ophthalmol. 2002;134(2):204–213. [DOI] [PubMed] [Google Scholar]
76.Pugh JA, Jacobson JM, Van Heuven WA, et al. Screening for diabetic retinopathy. The wide-angle retinal camera. Diabetes Care. 1993;16(6):889–895. [DOI] [PubMed] [Google Scholar]
77.Abramoff MD, Lou Y, Erginay A, et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest Ophthalmol Vis Sci. 2016;57(13):5200–5206. [DOI] [PubMed] [Google Scholar]
78.Glassman AR, Beck RW, Browning DJ, et al. Comparison of optical coherence tomography in diabetic macular edema, with and without reading center manual grading from a clinical trials perspective. Invest Ophthalmol Vis Sci. 2009;50(2):560–566. [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Hajian-Tilaki K The choice of methods in determining the optimal cut-off value for quantitative diagnostic test evaluation. Stat Methods Med Res. 2017;27(8):2374–2383. [DOI] [PubMed] [Google Scholar]
80.van Stralen KJ, Stel VS, Reitsma JB, et al. Diagnostic methods I: sensitivity, specificity, and other measures of accuracy. Kidney Int. 2009;75(12):1257–1263. [DOI] [PubMed] [Google Scholar]
81.Sánchez MS, Ortiz MC, Sarabia LA, Lletí R. On Pareto-optimal fronts for deciding about sensitivity and specificity in class-modelling problems. Analytica Chimica Acta. 2005;544(1):236–245. [Google Scholar]
82.Kupinski MA, Anastasio MA. Multiobjective genetic optimization of diagnostic classifiers with implications for generating receiver operating characteristic curves. IEEE Trans Med Imaging. 1999;18(8):675–685. [DOI] [PubMed] [Google Scholar]
83.Pepe MS, Janes H, Li CI, et al. Early-phase studies of biomarkers: what target sensitivity and specificity values might confer clinical utility? Clin Chem. 2016;62(5):737–742. [DOI] [PMC free article] [PubMed] [Google Scholar]
84.Carney PA, Sickles EA, Monsees BS, et al. Identifying minimally acceptable interpretive performance criteria for screening mammography. Radiology. 2010;255(2):354–361. [DOI] [PMC free article] [PubMed] [Google Scholar]
85.Righini M, Van Es J, Den Exter PL, et al. Age-adjusted D-dimer cutoff levels to rule out pulmonary embolism: the ADJUST-PE study. JAMA. 2014;311(11):1117–1124. [DOI] [PubMed] [Google Scholar]
86.Gieseker KE, Roe MH, MacKenzie T, Todd JK. Evaluating the American Academy of Pediatrics diagnostic standard for Streptococcus pyogenes pharyngitis: backup culture versus repeat rapid antigen testing. Pediatrics. 2003;111(6 Pt 1):e666–e670. [DOI] [PubMed] [Google Scholar]
87.United States Food and Drug Administration. Patient preference information (PPI) in medical device decision-making. Available at: https://www.fda.gov/about-fda/cdrh-patient-science-and-engagement-program/patient-preference-information-ppi-medical-device-decision-making; 2020. Accessed August 15, 2021.
88.Center for Devices and Radiological Health, United States Food and Drug Administration. Guidance for industry patient-reported outcome measures: use in medical product development to support labeling claims. Available at: https://www.fda.gov/media/77832/download; 2009. Accessed August 15, 2021.
89.Center for Devices and Radiological Health, United States Food and Drug Administration. Patient preference information—voluntary submission, review in premarket approval applications, humanitarian device exemption applications, and de novo requests, and inclusion in decision summaries and device labeling: guidance for industry, Food and Drug Administration staff, and other stakeholders. Available at: https://www.fda.gov/media/92593/download; 2016. Accessed August 15, 2021.
90.United States Food and Drug Administration. Network of experts program: connecting the FDA with external expertise. Available at: https://www.fda.gov/about-fda/center-devices-and-radiological-health/network-experts-program-connecting-fda-external-expertise; 2020. Accessed August 15, 2018.
91.United States Food and Drug Administration. Deciding when to submit a 510(k) for a software change to an existing device. Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/deciding-when-submit-510k-software-change-existing-device; 2017. Accessed August 15, 2021.
92.US FDA. Guideline for industry: text on validation of analytical procedures. Available at: https://www.fda.gov/media/71724/download; 1995. Accessed August 15, 2021.
93.Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. [DOI] [PMC free article] [PubMed] [Google Scholar]
94.Shannon CE, Weaver W. The Mathematical Theory of Communication. Urbana, IL: University of Illinois Press; 1949. [Google Scholar]
95.Xu A, Raginsky M. Information-theoretic analysis of generalization capability of learning algorithms. arXiv.1705.07809. Available at: https://ui.adsabs.harvard.edu/abs/2017arXiv170507809X; 2017. Accessed 01.05.17. [Google Scholar]
96.Blumenthal D Launching HITECH. N Engl J Med. 2010;362(5):382–385. [DOI] [PubMed] [Google Scholar]
97.Mongovern A Sloan Kettering Controversies: Trust is the Public Foundation of Medical Research. Available at: https://www.bioethics.net/2018/10/sloan-kettering-controversies-trust-is-the-public-foundation-of-medical-research/. Accessed August 15, 2021.
98.United States Food and Drug Administration. Evaluation and reporting of age-, race-, and ethnicity-specific data in medical device clinical studies. Available at: https://www.fda.gov/media/98686/download; 2017. Accessed August 15, 2021.
99.Center for Devices and Radiological Health, United States Food and Drug Administration. Collection of race and ethnicity data in clinical trials, guidance for industry and Food and Drug Administration staff. Available at: https://www.fda.gov/media/75453/download; 2016. Accessed August 15, 2021.
100.Center for Devices and Radiological Health, United States Food and Drug Administration. Evaluation of sex-specific data in medical device clinical studies guidance for industry and Food and Drug Administration staff. Available at: https://www.fda.gov/media/82005/download; 2014. Accessed August 15, 2021.
101.Mitchell M, Wu S, Zaldivar A, et al. Model cards for model reporting. arXiv. 1810.03993. Available at: https://ui.adsabs.harvard.edu/abs/2018arXiv181003993M; 2018. Accessed 01.10.18. [Google Scholar]
102.Benoit SR, Swenor B, Geiss LS, et al. Eye care utilization among insured people with diabetes in the U.S., 2010–2014. Diabetes Care. 2019;42(3):427–433. [DOI] [PubMed] [Google Scholar]

[R1] 1.Center for Devices and Radiological Health, United States Food and Drug Administration. Artificial intelligence/machine learning (AI/ML)-based software as a medical device (SaMD) action plan. January 2021. Available at: https://www.fda.gov/media/145022/download. Accessed August 14, 2021.

[R2] 2.Stanford University. Collaborative community on ophthalmic imaging (CCOI). Available at: https://www.cc-oi.org/; 2020. Accessed August 14, 2021.

[R3] 3.Abramoff MD, Tobey D, Char DS. Lessons learned about autonomous AI: finding a safe, efficacious, and ethical path through the development process. Am J Ophthalmol. 2020;214(1):134–142. [DOI] [PubMed] [Google Scholar]

[R4] 4.Char DS, Abràmoff MD, Feudtner C. Identifying ethical considerations for machine learning healthcare applications. Am J Bioethics. 2020/11/01 2020;20(11):7–17. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Abramoff MD. The autonomous point of care diabetic retinopathy examination. In: Klonoff DC, Kerr D, Mulvaney SA, eds. Diabetes Digital Health. New York: Springer; 2020:55–67. [Google Scholar]

[R6] 6.American Medical Association (AMA) Board of Trustees. Policy summary. Augmented intelligence in healthcare. Updated November 4, 2019. Available at: https://www.ama-assn.org/system/files/2019-08/ai-2018-board-policy-summary.pdf; 2019. Accessed August 14, 2021. [Google Scholar]

[R7] 7.Emanuel EJ, Wachter RM. Artificial intelligence in health care: will the value match the hype? JAMA. 2019;321(23):2281–2282. [DOI] [PubMed] [Google Scholar]

[R8] 8.Preston R Autonomous AI in action. January 16, 2020. Available at: https://www.forbes.com/sites/oraclecloud/2020/01/16/autonomous-in-action-self-driving-cars-get-all-the-publicity-but-other-industries-are-already-getting-exceptional-value-from-ai-based-systems/#1ecc65d86e94. Accessed August 15, 2021. [Google Scholar]

[R9] 9.Bodenheimer T, Sinsky C. From triple to quadruple aim: care of the patient requires care of the provider. Ann Fam Med. 2014;12(6):573–576. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R10] 10.United States Food and Drug Administration. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. Available at: https://www.fda.gov/newsevents/newsroom/pressannouncements/ucm604357.htm; 2018. Accessed August 15, 2021.

[R11] 11.American Diabetes Association. 11. Microvascular complications and foot care: standards of medical care in diabetes-2020. Diabetes Care. 2020;43(Suppl 1):S135–S151. [DOI] [PubMed] [Google Scholar]

[R12] 12.Ting DSW, Peng L, Varadarajan AV, et al. Deep learning in ophthalmology: the technical and clinical considerations. Prog Retin Eye Res. 2019;72:100759. [DOI] [PubMed] [Google Scholar]

[R13] 13.Abràmoff MD, Lavin PT, Birch M, et al. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. Nat Digital Med. 2018;1(1):39. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R14] 14.Gensure RH, Chiang MF, Campbell JP. Artificial intelligence for retinopathy of prematurity. Curr Opin Ophthalmol. 2020;31(5):312–317. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Peng Y, Dharssi S, Chen Q, et al. DeepSeeNet: a deep learning model for automated classification of patient-based age-related macular degeneration severity from color fundus photographs. Ophthalmology. 2019;126(4):565–575. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R16] 16.Christopher M, Belghith A, Weinreb RN, et al. Retinal nerve fiber layer features identified by unsupervised machine learning on optical coherence tomography scans predict glaucoma progression. Invest Ophthalmol Vis Sci. 2018;59(7):2748–2756. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R17] 17.Kaiserman I, Rosner M, Pe’er J. Forecasting the prognosis of choroidal melanoma with an artificial neural network. Ophthalmology. 2005;112(9):1608. [DOI] [PubMed] [Google Scholar]

[R18] 18.Siddiqui AA, Ladas JG, Lee JK. Artificial intelligence in cornea, refractive, and cataract surgery. Curr Opin Ophthalmol. 2020;31(4):253–260. [DOI] [PubMed] [Google Scholar]

[R19] 19.Yu F, Silva Croso G, Kim TS, et al. Assessment of automated identification of phases in videos of cataract surgery using machine learning and deep learning techniques. JAMA Netw Open. 2019;2(4):e191860. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R20] 20.Liu X, Faes L, Kale AU, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1(6):e271–e297. [DOI] [PubMed] [Google Scholar]

[R21] 21.Ochodo EA, de Haan MC, Reitsma JB, et al. Over-interpretation and misreporting of diagnostic accuracy studies: evidence of “spin.” Radiology. 2013;267(2):581–588. [DOI] [PubMed] [Google Scholar]

[R22] 22.Nagendran M, Chen Y, Lovejoy CA, et al. Artificial intelligence versus clinicians: systematic review of design, reporting standards, and claims of deep learning studies. BMJ. 2020;368:m689. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R23] 23.Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. 2020;26(9):1364–1374. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R24] 24.Cohen JF, Korevaar DA, Altman DG, et al. STARD 2015 guidelines for reporting diagnostic accuracy studies: explanation and elaboration. BMJ Open. 2016;6(11):e012799. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R25] 25.Center for Devices and Radiological Health, United States Food and Drug Administration. Recognized consensus standards. Available at: https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfstandards/search.cfm. Accessed August 15, 2021.

[R26] 26.United States Food and Drug Administration; International Medical Device Regulators Forum. Software as a medical device (SaMD): clinical evaluation. 2016. Available at: https://www.fda.gov/media/100714/download. Accessed August 15, 2021.

[R27] 27.Botkin JR, Goldenberg AJ, Rothwell E, et al. Retention and research use of residual newborn screening bloodspots. Pediatrics. 2013;131(1):120–127. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R28] 28.Blizinky KD. Ethical considerations in the All of Us research program 2018. Available at: https://www.bioethics.nih.gov/sites/nihbioethics/files/bioethics-files/courses/pdf/2018/session5_blizinsky.pdf; 2018. Accessed August 15, 2021. [Google Scholar]

[R29] 29.Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447–453. [DOI] [PubMed] [Google Scholar]

[R30] 30.Cavallerano J, Lawrence MG, Zimmer-Galler I, et al. Telehealth practice recommendations for diabetic retinopathy. Telemed J E Health. 2004;10(4):469–482. [DOI] [PubMed] [Google Scholar]

[R31] 31.Abramoff MD, Leng T, Ting DSW, et al. Automated and computer-assisted detection, classification, and diagnosis of diabetic retinopathy. Telemed J E Health. 2020;26(4):544–550. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R32] 32.DICOM Standards Committee. Digital Imaging and Communications in Medicine (DICOM) Standard. Supplement 91: ophthalmic photography image SOP classes. Rosslyn, VA: USA National Electrical Manufacturers Association (NEMA). Available at: https://www.dicomstandard.org/News/ftsup/docs/sups/sup91.pdf; 2004, Accessed August 15, 2021. [Google Scholar]

[R33] 33.United States General Accounting Office (GAO). Artificial intelligence in health care: benefits and challenges of technologies to augment patient care. GAO-21-7SP. Washington, DC: United States General Accounting Office; 2020. [Google Scholar]

[R34] 34.Software as a Medical Device (SaMD) Working Group, International Medical Device Regulators Forum. “Software as a medical device”: possible framework for risk categorization and corresponding considerations. Available at: http://www.imdrf.org/docs/imdrf/final/technical/imdrf-tech-140918-samd-framework-risk-categorization-141013.pdf; 2014. Accessed August 15, 2021.

[R35] 35.United States Food and Drug Administration. How to determine if your product is a medical device. Available at: https://www.fda.gov/medical-devices/classify-your-medical-device/how-determine-if-your-product-medical-device; 2018. Accessed August 15, 2021.

[R36] 36.Digital Therapeutics Alliance (DTA). Digital health industry categorization; 2019. Available at: https://dtxalliance.org/wp-content/uploads/2019/11/DTA_Digital-Industry-Categorization_Nov19.pdf. Accessed August 15, 2021.

[R37] 37.United States Food and Drug Administration. Changes to existing medical software policies resulting from Section 3060 of the 21st Century Cures Act. Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/changes-existing-medical-software-policies-resulting-section-3060-21st-century-cures-act; 2019. Accessed August 15, 2021.

[R38] 38.Center for Devices and Radiological Health, United States Food and Drug Administration. Clinical decision support software: draft guidance for industry and Food and Drug Administration staff. Available at: https://www.fda.gov/media/109618/download; 2019. Accessed August 15, 2021.

[R39] 39.van Dijk HW, Verbraak FD, Kok PH, et al. Variability in photocoagulation treatment of diabetic macular oedema. Acta Ophthalmol. 2013;91(8):722–727. [DOI] [PubMed] [Google Scholar]

[R40] 40.Huang L, Shea AL, Qian H, et al. Patient clustering improves efficiency of federated machine learning to predict mortality and hospital stay time using distributed electronic medical records. J Biomed Inform. 2019;99:103291. [DOI] [PubMed] [Google Scholar]

[R41] 41.Geer D Children of the magenta. IEEE Secur Priv. 2015;13(05):104–104. [Google Scholar]

[R42] 42.Lee A, Campbell J, Hwang T, et al. Recommendations for standardization of images in ophthalmology. Ophthalmology. 2021;128(7):969–970. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R43] 43.Char DS, Shah NH, Magnus D. Implementing machine learning in health care—addressing ethical challenges. N Engl J Med. 2018;378(11):981–983. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R44] 44.Kent J Artificial intelligence falls short in detecting diabetic eye disease. Health IT Analytics; 2021. Available at: https://healthitanalytics.com/news/artificial-intelligence-falls-short-in-detecting-diabetic-eye-disease. Accessed August 15, 2021. [Google Scholar]

[R45] 45.Centers for Medicare and Medicaid Services. Artificial Intelligence (AI). Health Outcomes Challenge; 2019. https://innovation.cms.gov/innovation-models/artificial-intelligence-health-outcomes-challenge. Accessed August 15, 2021. [Google Scholar]

[R46] 46.Challen R, Denny J, Pitt M, et al. Artificial intelligence, bias and clinical safety. BMJ Qual Saf. 2019;28(3):231–237. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R47] 47.Center for Devices and Radiological Health, United States Food and Drug Administration. Understanding unapproved use of approved drugs “off label.” Available at: https://www.fda.gov/patients/learn-about-expanded-access-and-other-treatment-options/understanding-unapproved-use-approved-drugs-label; 2018. Accessed August 15, 2021.

[R48] 48.Beauchamp TL, Childress JF. Principles of Biomedical Ethics. Eighth ed. Oxford, UK: Oxford University Press; 2019. [Google Scholar]

[R49] 49.Gayle HD, Childress JF. Race, racism, and structural injustice: equitable allocation and distribution of vaccines for the COVID-19. Am J Bioeth. 2021;21(3):4–7. [DOI] [PubMed] [Google Scholar]

[R50] 50.International Organization for Standardization (ISO). ISO/IEC/IEEE 90003:2018 software engineering—guidelines for the application of ISO 9001:2015 to computer software; 2018. Available at: https://www.iso.org/standard/74348.html. Accessed August 15, 2021. [Google Scholar]

[R51] 51.International Organization for Standardization (ISO). ISO 13485:2016 medical devices—quality management systems—requirements for regulatory purposes; 2016. Available at: https://www.iso.org/standard/59752.html. Accessed August 15, 2021. [Google Scholar]

[R52] 52.Norgeot B, Quer G, Beaulieu-Jones BK, et al. Minimum information about clinical artificial intelligence modeling: the MI-CLAIM checklist. Nat Med. 2020;26(9):1320–1324. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R53] 53.Yang Y, Rinard M. Correctness verification of neural networks. arXiv.1906.01030. https://ui.adsabs.harvard.edu/abs/2019arXiv190601030Y; 2019. Accessed 01.06.19. [Google Scholar]

[R54] 54.Finlayson SG, Bowers JD, Ito J, et al. Adversarial attacks on medical machine learning. Science. 2019;363(6433):1287–1289. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R55] 55.Shah A, Lynch S, Niemeijer M, et al. Susceptibility to misdiagnosis of adversarial images by deep learning based retinal image analysis algorithms. Proceedings - International Symposium on Biomedical Imaging. 2018:1454–1457. 10.1109/ISBI.2018.8363846. [DOI] [Google Scholar]

[R56] 56.Kaplan RM, Irvin VL. Likelihood of null effects of large NHLBI clinical trials has increased over time. PLoS One. 2015;10(8):e0132382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R57] 57.Nosek BA, Ebersole CR, DeHaven AC, Mellor DT. The preregistration revolution. Proc Natl Acad Sci U S A. 2018;115(11):2600–2606. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R58] 58.United States Food and Drug Administration. E6(R2) good clinical practice: integrated addendum to ICH E6(R1). Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/e6r2-good-clinical-practice-integrated-addendum-ich-e6r1; 2018. Accessed August 15, 2021. [Google Scholar]

[R59] 59.Gaube S, Suresh H, Raue M, et al. Do as AI say: susceptibility in deployment of clinical decision-aids. NPJ Digit Med. 2021;4(1):31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R60] 60.Fenton JJ, Taplin SH, Carney PA, et al. Influence of computer-aided detection on performance of screening mammography. N Engl J Med. 2007;356(14):1399–1409. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R61] 61.Lu B, Gatsonis C. Efficiency of study designs in diagnostic randomized clinical trials. Stat Med. 2013;32(9):1451–1466. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R62] 62.Pearl J, Mackenzie D. The Book of Why: the New Science of Cause and Effect. New York, NY: Basic Books; 2018. [Google Scholar]

[R63] 63.Bossuyt PM, Lijmer JG, Mol BW. Randomised comparisons of medical tests: sometimes invalid, not always efficient. Lancet. 2000;356(9244):1844–1847. [DOI] [PubMed] [Google Scholar]

[R64] 64.Korevaar DA, Gopalakrishna G, Cohen JF, Bossuyt PM. Targeted test evaluation: a framework for designing diagnostic accuracy studies with clear study hypotheses. Diagn Progn Res. 2019;3:22. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R65] 65.Cash BD, Schoenfeld P, Rex D. An evidence-based medicine approach to studies of diagnostic tests: assessing the validity of virtual colonoscopy. Clin Gastroenterol Hepatol. 2003;1(2):136–144. [DOI] [PubMed] [Google Scholar]

[R66] 66.Fleming TR, DeMets DL. Surrogate end points in clinical trials: are we being misled? Ann Intern Med. 1996;125(7):605–613. [DOI] [PubMed] [Google Scholar]

[R67] 67.Temple R A regulatory authority’s opinion about surrogate endpoints. In: Nimmo W, Tucker G, eds. Clinical Measurement in Drug Evaluation. New York, NY: J Wiley; 1995. [Google Scholar]

[R68] 68.United States Food and Drug Administration. Design considerations for pivotal clinical investigations for medical devices. Guidance for industry, clinical investigators, institutional review boards and Food and Drug Administration staff. Available at: https://www.fda.gov/media/87363/download; 2013. Accessed August 15, 2021.

[R69] 69.Early Treatment Diabetic Retinopathy Study Research Group. Fundus photographic risk factors for progression of diabetic retinopathy. ETDRS report number 12. Ophthalmology. 1991;98(5 Suppl):823–833. [PubMed] [Google Scholar]

[R70] 70.Browning DJ, Glassman AR, Aiello LP, et al. Optical coherence tomography measurements and analysis methods in optical coherence tomography studies of diabetic macular edema. Ophthalmology. 2008;115(8):1366–1371, 1371 e1. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R71] 71.Prentice RL. Surrogate endpoints in clinical trials: definition and operational criteria. Stat Med. 1989;8(4):431–440. [DOI] [PubMed] [Google Scholar]

[R72] 72.United States Food and Drug Administration. International conference on harmonisation; guidance on statistical principles for clinical trials; availability—FDA. Notice. Fed Regist. 1998;63(179):49583–49598. [PubMed] [Google Scholar]

[R73] 73.Lee AY, Yanagihara RT, Lee CS, et al. Multicenter, head-to-head, real-world validation study of seven automated artificial intelligence diabetic retinopathy screening systems. Diabetes Care. 2021;44(5):1168–1175. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R74] 74.Lin AP, Katz LJ, Spaeth GL, et al. Agreement of visual field interpretation among glaucoma specialists and comprehensive ophthalmologists: comparison of time and methods. Br J Ophthalmol. 2011;95(6):828–831. [DOI] [PubMed] [Google Scholar]

[R75] 75.Lin DY, Blumenkranz MS, Brothers RJ, Grosvenor DM. The sensitivity and specificity of single-field nonmydriatic monochromatic digital fundus photography with remote image interpretation for diabetic retinopathy screening: a comparison with ophthalmoscopy and standardized mydriatic color photography. Am J Ophthalmol. 2002;134(2):204–213. [DOI] [PubMed] [Google Scholar]

[R76] 76.Pugh JA, Jacobson JM, Van Heuven WA, et al. Screening for diabetic retinopathy. The wide-angle retinal camera. Diabetes Care. 1993;16(6):889–895. [DOI] [PubMed] [Google Scholar]

[R77] 77.Abramoff MD, Lou Y, Erginay A, et al. Improved automated detection of diabetic retinopathy on a publicly available dataset through integration of deep learning. Invest Ophthalmol Vis Sci. 2016;57(13):5200–5206. [DOI] [PubMed] [Google Scholar]

[R78] 78.Glassman AR, Beck RW, Browning DJ, et al. Comparison of optical coherence tomography in diabetic macular edema, with and without reading center manual grading from a clinical trials perspective. Invest Ophthalmol Vis Sci. 2009;50(2):560–566. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R79] 79.Hajian-Tilaki K The choice of methods in determining the optimal cut-off value for quantitative diagnostic test evaluation. Stat Methods Med Res. 2017;27(8):2374–2383. [DOI] [PubMed] [Google Scholar]

[R80] 80.van Stralen KJ, Stel VS, Reitsma JB, et al. Diagnostic methods I: sensitivity, specificity, and other measures of accuracy. Kidney Int. 2009;75(12):1257–1263. [DOI] [PubMed] [Google Scholar]

[R81] 81.Sánchez MS, Ortiz MC, Sarabia LA, Lletí R. On Pareto-optimal fronts for deciding about sensitivity and specificity in class-modelling problems. Analytica Chimica Acta. 2005;544(1):236–245. [Google Scholar]

[R82] 82.Kupinski MA, Anastasio MA. Multiobjective genetic optimization of diagnostic classifiers with implications for generating receiver operating characteristic curves. IEEE Trans Med Imaging. 1999;18(8):675–685. [DOI] [PubMed] [Google Scholar]

[R83] 83.Pepe MS, Janes H, Li CI, et al. Early-phase studies of biomarkers: what target sensitivity and specificity values might confer clinical utility? Clin Chem. 2016;62(5):737–742. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R84] 84.Carney PA, Sickles EA, Monsees BS, et al. Identifying minimally acceptable interpretive performance criteria for screening mammography. Radiology. 2010;255(2):354–361. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R85] 85.Righini M, Van Es J, Den Exter PL, et al. Age-adjusted D-dimer cutoff levels to rule out pulmonary embolism: the ADJUST-PE study. JAMA. 2014;311(11):1117–1124. [DOI] [PubMed] [Google Scholar]

[R86] 86.Gieseker KE, Roe MH, MacKenzie T, Todd JK. Evaluating the American Academy of Pediatrics diagnostic standard for Streptococcus pyogenes pharyngitis: backup culture versus repeat rapid antigen testing. Pediatrics. 2003;111(6 Pt 1):e666–e670. [DOI] [PubMed] [Google Scholar]

[R87] 87.United States Food and Drug Administration. Patient preference information (PPI) in medical device decision-making. Available at: https://www.fda.gov/about-fda/cdrh-patient-science-and-engagement-program/patient-preference-information-ppi-medical-device-decision-making; 2020. Accessed August 15, 2021.

[R88] 88.Center for Devices and Radiological Health, United States Food and Drug Administration. Guidance for industry patient-reported outcome measures: use in medical product development to support labeling claims. Available at: https://www.fda.gov/media/77832/download; 2009. Accessed August 15, 2021.

[R89] 89.Center for Devices and Radiological Health, United States Food and Drug Administration. Patient preference information—voluntary submission, review in premarket approval applications, humanitarian device exemption applications, and de novo requests, and inclusion in decision summaries and device labeling: guidance for industry, Food and Drug Administration staff, and other stakeholders. Available at: https://www.fda.gov/media/92593/download; 2016. Accessed August 15, 2021.

[R90] 90.United States Food and Drug Administration. Network of experts program: connecting the FDA with external expertise. Available at: https://www.fda.gov/about-fda/center-devices-and-radiological-health/network-experts-program-connecting-fda-external-expertise; 2020. Accessed August 15, 2018.

[R91] 91.United States Food and Drug Administration. Deciding when to submit a 510(k) for a software change to an existing device. Available at: https://www.fda.gov/regulatory-information/search-fda-guidance-documents/deciding-when-submit-510k-software-change-existing-device; 2017. Accessed August 15, 2021.

[R92] 92.US FDA. Guideline for industry: text on validation of analytical procedures. Available at: https://www.fda.gov/media/71724/download; 1995. Accessed August 15, 2021.

[R93] 93.Ioannidis JP. Why most published research findings are false. PLoS Med. 2005;2(8):e124. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R94] 94.Shannon CE, Weaver W. The Mathematical Theory of Communication. Urbana, IL: University of Illinois Press; 1949. [Google Scholar]

[R95] 95.Xu A, Raginsky M. Information-theoretic analysis of generalization capability of learning algorithms. arXiv.1705.07809. Available at: https://ui.adsabs.harvard.edu/abs/2017arXiv170507809X; 2017. Accessed 01.05.17. [Google Scholar]

[R96] 96.Blumenthal D Launching HITECH. N Engl J Med. 2010;362(5):382–385. [DOI] [PubMed] [Google Scholar]

[R97] 97.Mongovern A Sloan Kettering Controversies: Trust is the Public Foundation of Medical Research. Available at: https://www.bioethics.net/2018/10/sloan-kettering-controversies-trust-is-the-public-foundation-of-medical-research/. Accessed August 15, 2021.

[R98] 98.United States Food and Drug Administration. Evaluation and reporting of age-, race-, and ethnicity-specific data in medical device clinical studies. Available at: https://www.fda.gov/media/98686/download; 2017. Accessed August 15, 2021.

[R99] 99.Center for Devices and Radiological Health, United States Food and Drug Administration. Collection of race and ethnicity data in clinical trials, guidance for industry and Food and Drug Administration staff. Available at: https://www.fda.gov/media/75453/download; 2016. Accessed August 15, 2021.

[R100] 100.Center for Devices and Radiological Health, United States Food and Drug Administration. Evaluation of sex-specific data in medical device clinical studies guidance for industry and Food and Drug Administration staff. Available at: https://www.fda.gov/media/82005/download; 2014. Accessed August 15, 2021.

[R101] 101.Mitchell M, Wu S, Zaldivar A, et al. Model cards for model reporting. arXiv. 1810.03993. Available at: https://ui.adsabs.harvard.edu/abs/2018arXiv181003993M; 2018. Accessed 01.10.18. [Google Scholar]

[R102] 102.Benoit SR, Swenor B, Geiss LS, et al. Eye care utilization among insured people with diabetes in the U.S., 2010–2014. Diabetes Care. 2019;42(3):427–433. [DOI] [PubMed] [Google Scholar]

PERMALINK

Foundational Considerations for Artificial Intelligence Using Ophthalmic Images

Michael D Abràmoff, MD, PhD

Brad Cunningham, MSE, RAC

Bakul Patel, MBA, MSEE

Malvina B Eydelman, MD

Theodore Leng, MD, MS

Taiji Sakamoto, MD

Barbara Blodi, MD

S Marlene Grenon, MD, CM

Risa M Wolf, MD

Arjun K Manrai, PhD

Justin M Ko, MD, MBA

Michael F Chiang, MD, PhD

Danton Char, MD, MS

Abstract

Importance:

Objectives:

Evidence Review:

Findings:

Conclusions and Relevance:

Clinical Considerations for Artificial Intelligence Systems

Intended Use of the Diagnostic Artificial Intelligence System

Impact of the Diagnostic Artificial Intelligence System

Table 1.

Artificial Intelligence System Outputs

Table 2.

Artificial Intelligence System Use Environment

Table 3.

Artificial Intelligence System Human Factor Considerations

Table 4.

Artificial Intelligence System Inputs

Table 5.

Ethical Considerations for Artificial Intelligence Systems

Bioethical Foundation

Figure 1.

Nonmaleficence

Design of the Artificial Intelligence System and Nonmaleficence.

Validation of the Artificial Intelligence System and Nonmaleficence.

Validation Study Design.

Validation Study Reference Standards.

Table 6.

Minimal Acceptable Criteria for Validation.

Postmarket Monitoring of Artificial Intelligence Systems and Nonmaleficence.

Changing an Artificial Intelligence System after Validation.

Autonomy

Equity

Conclusions

Acknowledgments

Abbreviations and Acronyms:

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases