Skip to main content
HHS Author Manuscripts logoLink to HHS Author Manuscripts
. Author manuscript; available in PMC: 2023 Jun 20.
Published in final edited form as: Am J Bioeth. 2022 May;22(5):1–3. doi: 10.1080/15265161.2022.2059199

Research on the Clinical Translation of Health Care Machine Learning: Ethicists Experiences on Lessons Learned

Jennifer Blumenthal-Barby 1, Benjamin Lang 1, Natalie Dorfman 1, Holland Kaplan 1, William B Hooper 1, Kristin Kostick-Quenet 1
PMCID: PMC10281027  NIHMSID: NIHMS1895354  PMID: 35475968

The application of machine learning (ML) in health care holds great promise for improving care. Indeed, our own team is collaborating with experts in machine learning and statistical modeling to build a model that would predict personalized risks for patients considering left-ventricular assist device (LVAD) therapy for heart failure.

In a target article in this issue, McCradden et al. (2022) offer an ethical framework to guide ML researchers like us. Their framework or “pipeline” involves three phases of research: (1) exploratory ML research phase (obtaining and cleaning the data, developing a model, training the data), (2) a silent period evaluation (which allows the model to be tested in a live environment without directly impacting patient care; involves assessing model predictions against a reference standard and examining impact on clinical flow including usability; ensures outputs are not visible to the clinical team (masked); and aims to show there is enough evidence to enter a state of “equipoise” or genuine uncertainty about ML vs. the standard, justifying a move toward clinical evaluation), and (3) prospective clinical evaluation (which tests the model’s impact on patient care and outcomes; and can involve observational study, quasi-interventional, or interventional clinical trials).

Our research team’s experience has taught us important lessons about the practical and ethical challenges involved in these phases of research which we share and reflect on here.

CHALLENGES IN THE EXPLORATORY RESEARCH PHASE

The first step in the exploratory phase of ML involves obtaining (good) clinical data (often large datasets) to work with. While secondary use of data technically allows patient data to be used to make ML models to improve healthcare outcomes, this data often becomes “owned” by companies (e.g., device companies), professional societies, and institutions. This renders the data largely inaccessible for researchers, even if patients gave (or would give) permission for their data to be used. These groups then sell—or even more alarmingly, decide who can buy—such data. This “data gatekeeping” raises serious ethical concerns and can serve as an unanticipated obstacle to exploratory research in health care ML, making it difficult to even get ML research off the ground.

THE SILENT PERIOD EVALUATION—A FEW CHALLENGES

The silent period that McCradden et al. propose is critical for demonstrating that the model can perform well in the real world and can potentially add value at the point of clinical care. In our case, part of what this involves is assessing whether the ML-based LVAD risk predictor accurately predicts patient risks (e.g., survival free of adverse events). One of the major challenges here involves what we call a “temporal challenge.” In our case, clinical outcomes for LVAD start at 30 days (e.g., survival or risk of stroke at 30 days), but extend out 5–10 years. Within that period, advances in medical and mechanical therapies for heart failure can significantly affect outcomes. Because implementing a silent trial would be a time-costly endeavor, the model may be obsolete by the time it is implemented into clinical practice due to the advances in medical and mechanical therapies for heart failure.

A second challenge involves what we call a “feasibility problem.” Implementing a silent trial for simple diagnostic models (e.g., help detecting whether a patient has cancer) is simpler than implementing one for models such as ours, which aims to give personalized risk profiles. Implementing a silent trial for such models increases monetary and personnel costs, which may act as disincentives for implementation of McCradden et al.’s framework.

A unique element of the framework that McCradden et al. propose is the silent evaluation’s second requirement: examining impacts on clinical flow and usability while keeping outputs invisible to clinicians and patients to avoid impacting care or decision-making. While we agree with the value proposition of examining model performance and utility in real clinical contexts (a bedrock of implementation research), it is hard to know how to assess these factors while simultaneously keeping the model masked or silent. Examining impacts on clinical flow and usability, by definition, requires overt introduction of a model’s outputs into clinical flow. Avoiding resultant impacts on clinical care is not likely to be achieved by urging clinicians to use the model “hypothetically.” We ourselves considered such an approach but ultimately abandoned the idea due to a lack of control over how clinicians might utilize even “hypothetical” model outputs.

An additional challenge—raised in our own research—is that many clinicians prefer to have a better understanding of likely impacts of a model on clinical decision making before progressing to the prospective evaluation stage. In order to evaluate these impacts, a model cannot be blinded—at least not for clinician stakeholders. Thus, a silent period might entail silence for some stakeholders (e.g., patients) but not others (clinicians). Depending on the nature of the proposed intervention, this could pose ethical problems, particularly in cases where clinicians are privy to information that patients may not be, potentially tipping the balance of knowledge and agency in clinical decisions.

These challenges lead to difficult ethical tradeoffs. We agree with McCradden et al. that there are ethically desirable aspects of the silent period of research in that we gain more knowledge about potential harms and benefits of ML technologies before implementing and studying their use in clinical care. At the same time, the silent period seen as a requirement for all ML research comes with costs—especially in certain applications such as ours. Running a silent phase would borrow valuable time from simultaneously engaging in other equally important formative research, including exploring stakeholder perspectives and potential impacts on clinical decision making. One suggestion to consider is that the imposition of a silent trial must be proportionate to the perceived risks and cost/benefit analysis. Covid-19 reignited debates over the ethics of compressing trial phases and running human trials concurrently, and other healthcare research may prove similarly urgent or beneficial.

Prospective Clinical Validation… and Beyond

The final phase of the research pipeline that McCradden et al. propose for health care ML-based tools is prospective clinical validation. The point here is to examine the impact on patient outcomes—does it improve outcomes, for whom, and under what circumstances? In our case of decision support ML, we are curious about how having personalized risk/benefit information associated with LVAD-therapy impacts decision-making and preparation for dealing with post-implant events. We will also examine impact on clinical workflow, potential over (or under) reliance on the ML, and whether the likely use and impacts of our model are equitable across different patient populations. We appreciate McCradden et al.’s recommendation that best practices should involve researchers reporting on the model’s features, including, representativeness of training data, model performance divided by subgroups, stakeholder involvement to support the algorithm development, data missingness, internal and external validation, and error analysis.

The pipeline should not end here, however. We propose extending McCradden et al.’s framework to include a fourth phase which we call “the continual reevaluation stage.” As the literature on context bias (Egglin 1996) and “data/concept drift” (Lu et al. 2018) suggests, there should be a permanent iterative fourth stage to continue the work of searching for biases (such as racial and gender bias; Kristin forthcoming), and other technical issues (such as under/overfitting) within the ML algorithm. The continual reevaluation stage involves ongoing oversight throughout the implementation and utilization of models in health care settings. Within this stage, clinicians and modelers should both play significant roles in ensuring the model’s predictive accuracy and continued relevance to clinical populations and treatment contexts, as well as medical discoveries and technological developments.

Implementing a fourth stage of continual reevaluation can help to:

  1. Detect potential creep of algorithmic biases into the ML model(s) over time, particularly as standards of care evolve and populations served by the model may change.

  2. Account for potential alterations in relevant clinical inputs for the condition being addressed by the ML model.

  3. Adhere to the ethical principle of maintaining human-in-the-loop (Braun et al. 2021).

  4. Serve as a stopgap for user cognitive and decisional biases, such as over-reliance and automation bias, that may develop over time without periodic reevaluation.

In sum, we are appreciative of the excellent framework developed by McCradden and colleagues to guide researchers in health care ML. Our suggestions for modifying and extending their framework are informed by our experiences as researchers moving through these phases of research that involve complex practical interdependencies and ethical tradeoffs.

FUNDING

Funding for this editorial was provided by Agency for Healthcare Research and Quality under grant number 1R01HS027784-01.

REFERENCES

  1. Braun M, Hummel P, Beck S, and Dabrock P. 2021. Primer on an ethics of AI-based decision support systems in the clinic. Journal of Medical Ethics 47 (12):e3. 10.1136/medethics-2019-105860.Kostick-Quenet. doi:. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Egglin TK 1996. Context bias. A problem in diagnostic radiology. JAMA 276 (21): 1752–55. 10.1001/jama.276.21.1752. [DOI] [PubMed] [Google Scholar]
  3. Kristin IGC, Gerke S, Lo B, Antaki J, Movahedi F, Njah H, Schoen L, Estep JE, and Blumenthal-Barby JS. 2022. Racial bias in machine learning. Journal of Law, Medicine and Ethics 50 (1): 92–100. doi: 10.1017/jme.2022.13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Lu J, Liu A, Dong F, Gu F, Gama J, and Zhang G. 2018. Learning under concept drift: A review. IEEE Transactions on Knowledge and Data Engineering 12: 2346–2363. doi: 10.1109/TKDE.2018.2876857. [DOI] [Google Scholar]
  5. McCradden MD, Anderson JA, Stephenson EA, Drysdale E, Erdman L, Goldenberg A, and Zlotnik Shaul R. 2022. A research ethics framework for the clinical translation of healthcare machine learning. The American Journal of Bioethics 22 (5):8–12. doi: 10.1080/15265161.2021.2013977. [DOI] [PubMed] [Google Scholar]

RESOURCES