Abstract
Objective
Identifying ethical concerns with ML applications to healthcare (ML-HCA) before problems arise is now a stated goal of ML design oversight groups and regulatory agencies. Lack of accepted standard methodology for ethical analysis, however, presents challenges. In this case study, we evaluate use of a stakeholder “values-collision” approach to identify consequential ethical challenges associated with an ML-HCA for advanced care planning (ACP). Identification of ethical challenges could guide revision and improvement of the ML-HCA.
Materials and Methods
We conducted semistructured interviews of the designers, clinician-users, affiliated administrators, and patients, and inductive qualitative analysis of transcribed interviews using modified grounded theory.
Results
Seventeen stakeholders were interviewed. Five “values-collisions”—where stakeholders disagreed about decisions with ethical implications—were identified: (1) end-of-life workflow and how model output is introduced; (2) which stakeholders receive predictions; (3) benefit-harm trade-offs; (4) whether the ML design team has a fiduciary relationship to patients and clinicians; and, (5) how and if to protect early deployment research from external pressures, like news scrutiny, before research is completed.
Discussion
From these findings, the ML design team prioritized: (1) alternative workflow implementation strategies; (2) clarification that prediction was only evaluated for ACP need, not other mortality-related ends; and (3) shielding research from scrutiny until endpoint driven studies were completed.
Conclusion
In this case study, our ethical analysis of this ML-HCA for ACP was able to identify multiple sites of intrastakeholder disagreement that mark areas of ethical and value tension. These findings provided a useful initial ethical screening.
Keywords: machine learning, clinical, artificial intelligence, ethics, palliative care, end-of-life care
INTRODUCTION
With the rapid increase in the availability of healthcare data collected during clinical care and recent advances in machine learning (ML) tools, ML has become a clinical reality.1,2 ML tools promise to improve quality and access to healthcare. However, in nonhealthcare contexts significant ethical concerns have emerged with applications of ML tools. Identifying ethical concerns with ML applications to healthcare (ML-HCA) before problems result is now a stated goal of ML design oversight groups and regulatory agencies, such as the US F.D.A.3,4 A challenge to this goal is the lack of an accepted standard methodology for ethical analysis of ML-HCAs.
Thus far, the few examinations of ethical issues emerging with ML-HCA have been broad and largely theoretical,5–9,14 extrapolating ethical issues that have emerged in nonhealthcare contexts. Identification of ethical issues arising in actual ML-driven applications for healthcare in situ is needed to expand understanding of the concerns emerging with ML uses and to guide approaches to address these concerns.5,7 There have been broad calls to address a lack of frameworks in place for operationalizing AI in healthcare, including a lack of guidance on best practices for inclusivity and equity.5 There have been scoping reviews identifying a dearth of literature in a variety of settings, including public health contexts and low- and middle-income countries.10 There is also the issue of a lack of clinical acceptance of AI decision support tools by providers.11 There have been attempts to create guidelines for ethical development and deployment of AI12,13 as well as to guide policymakers and regulators of important considerations for cost and efficiency.14,15 However, there remains a gap in methodology for identification of emerging ethical problems with specific ML-HCA.16 Identifying specific ethical problems, grounded in an actual ML-HCA, will help clarify ethical decisions and their interconnected consequences, which will improve ethical decision-making.17
Attempts to demonstrate algorithmic fairness typically focus narrowly on model performance, demonstrating parity (or lack of bias) in model outputs for preidentified populations of interest.18,19 These analyses do not examine the consequences of taking actions based on the model’s output.20 For any ML-HCA, consequential ethical problems are likely to involve 3 interacting elements that determine how useful ML-HCA is: (1) model characteristics including the assumptions, underlying data, design, and risk-estimate output from the model; (2) how the model output is implemented into a given workflow, including the decision regarding what risk level warrants intervention; (3) this intervention’s benefits and harms and the trade-offs posed.21 Key stakeholders might be guided by different values regarding each of these 3 elements, resulting in what might be called a “values-collision.”
We recently published a rigorous, broadly applicable stakeholder “values-collision” approach for identifying emerging ethical concerns with ML-HCAs, conceptually grounded in the pipeline or steps from design to clinical implementation.22 This approach relies on 6 premises: (1) multiple stakeholders are impacted by any ML-HCA and these stakeholders can be identified; (2) stakeholder groups are likely to have different values, and significantly different explicit or implicit goals for the ML-HCA can be ascertained through interviews; (3) the process of design and development of an ML-HCA involves making a series of decisions, from initially conceiving of the problem to be addressed by an ML-HCA to actually deploying and maintaining it; (4) how a stakeholder makes these decisions, or would want these decisions to be made, reflects their underlying values, as such decisions often are based on value judgments;23 (5) where stakeholder groups disagree or their values are at odds about resolving these decisions—where values collide—are where ethical problems are most likely to emerge;23,24 (6) while some of these “values-collisions” may mark novel ethical concerns, many may be resolvable by drawing on prior ethical scholarship on similar or related problems, allowing designers and users of an ML-HCA to address ethical challenges before they become consequential.
In this case study, we applied our “values-collision” approach to a machine-learning enabled workflow for identifying patients for advance care planning (ACP). Currently, a significant gap exists between patients’ desires concerning how they wish to spend their final days and how they actually spend them.25 ACP could help close this gap, as ACP could substantially expand patients’ access to important components of palliative care. But ACP is not commonly done by most Americans, with less than 7.5% of those over 65 years of age having ACP.26 ML methods, by efficiently identifying patients likely to benefit most from ACP, could play a crucial role in focusing ACP information interventions and thus in closing the gap. Ethical concerns have been raised surrounding the use of ACP and mortality predictions, specifically raising questions regarding reducing use of services rather than aligning care with patient goals.27,28 The model evaluated here identifies these individuals who might benefit from ACP by addressing a proxy problem: predicting the probability of a given patient passing away within the next 12 months.29 ML approaches to predicting need for ACP have the advantage of being scalable to large populations in ways other mortality prediction models are not.30,31
In this study, we demonstrate that we could identify consequential ethical challenges associated with implementing a ML-HCA for ACP, which subsequently guided direct revision and improvement of the ML-HCA.
The mortality prediction model
The design team20 developed a gradient boosted tree model to estimate the probability of 1-year, all-cause mortality upon inpatient admission, using a deidentified, retrospective dataset of EHR records for adult patients seen at Stanford Hospital between 2010 and 2017. This dataset comprised 907 683 admissions and had a prevalence of 17.6% for 1-year all-cause mortality. The performance of the resulting model was evaluated against clinician chart review prospectively. The design team gathered data used for validation of the model against clinician chart review over 2 months in the first half of 2019; each day in that interval they presented an experienced palliative-care advanced-practice (AP) nurse with a list of patients newly admitted to General Medicine at Stanford Hospital. These lists were in random order, and no model output was presented. The nurse performed chart reviews to answer the question, “Would you be surprised if this patient passed away in the next twelve months?” This “surprise” question is an established screening approach for identifying Palliative Care needs.32 There was strong agreement between model assessment and AP nurse response: AUROC of 0.86 (95% confidence interval of 0.8–91.9).
METHODS
We used a qualitative approach, interviewing ML-HCA designers, administrators, clinicians involved in ACP and patients affected, or potentially affected, by the ML-HCA to investigate how these stakeholders anticipate and envision the impact of the ML-HCA. Our focus was: identifying affected or potentially affected stakeholder groups and their relation to each other; eliciting stakeholder’s views on practical and ethical considerations with the ML-HCA’s design, implementation, and use; and, identifying where stakeholder groups disagreed about practical or ethical considerations—where their values collided.
Study setting and recruitment
The General Medicine service at Stanford Hospital is the admitting hospital service for nonintensive-care patients with nonsurgical illness. In the last year, the service admitted 3,143 patients (∼9/day). Currently, the palliative care team sees an average of 137 new patients per month with approximately 25% of these coming from this General Medicine service. The palliative care clinicians are unable to manually screen all admitted patients; and in the General Medicine wards, 5–6% of all admissions receive a palliative care consult. Admitted patients were being screened by the ML-HCA for potential ACP needs at admission (and these screening results evaluated by the palliative care service to determine clinician agreement with the ML-HCA’s screening results, but not yet acted upon) at the time the interviews were conducted. The study site was planning a pilot implementation to have the palliative care service begin to use the ML-HCA results as a screening tool for prioritizing palliative care consultation to discuss ACP with admitted patients. This study was conducted during pilot implementation.
Interviewees (the clinicians, administrators and designers involved with the ML-HCA; clinicians on the palliative care service; and patient volunteers who had previously been admitted patients on the general medicine service) were initially approached via telephone or email. No participants dropped out of the study. Recruitment stopped after thematic saturation was reached. Not all patients or clinicians working on the general medicine or palliative care service were interviewed. Inclusion criteria included designers (ie, the bioinformatics programming team who designed the algorithm), hospital administrators, clinicians and patients or patient-representatives impacted or potentially impacted by the mortality prediction to guide ACP algorithm. Exclusion criteria included anyone else.
Data collection and analysis
We used one-on-one interviews, a technique which has been found to be productive for discussing sensitive topics and because this approach is well suited for exploratory research attempting to find a range of perspectives.33–39 The study was approved by the IRB of the Stanford University School of Medicine. Informed consent was verbal due to the low-risk nature of the study and was obtained from all participants.
A semistructured interview guide of open-ended questions intended to elicit participants’ perspectives about design, clinical use, and ethical concerns with ML-HCA and its application to ACP, was piloted with 5 stakeholders (designers and clinicians). Following these pilot interviews, explicit questions about trust in results, data safety, bias, and equity concerns were added to the interview guide (Supplementary Material).
Selection of subsequent interviews was guided by information revealed in the coding process (ie, following leads). Interviews were conducted either in person or over video conferencing and were audio-recorded and transcribed. Though every effort was made to conduct interviews in person, video interviews were used to accommodate constraints from the COVID-19 pandemic. There were no differences noted in the nature of the responses to video interviews compared to the in-person interviews, as both interview formats resulted in similar length discussions with coverage of all interview questions and prompts. The primary investigators provided initial contact to all potential participants and conducted all interviews. Transcripts were uploaded into the qualitative analysis software Dedoose, (www.dedoose.com) and interview data analyzed inductively, using grounded theory.40–44 However, we modified our grounded theory approach in 2 important ways. First, we did use a semistructured interview guide in order to ensure we elicited interviewee perspectives on decisions with value-implications along the design to clinical implementation pipeline, as well as perspectives on known ethical concerns with AI tools potentially stemming from such decisions (Supplementary Material). Second, codes were both inductive from the interviews as well as a priori from the decisions and ethical topics included in the interview guide. Codes were generated through a collaborative reading and analysis of a subset of interviews and then finalized through successive iterations into categories and codes. At least one primary and one secondary coder independently coded each transcript. Differences were reconciled through consensus coding. Consensus coding was also used to identify and characterize areas of disagreement between and within stakeholder groups (or “values-collisions”) around design and implementation decisions and their codes. The inter-rater reliability score was not calculated but codes and themes were discussed to consensus. Emerging themes were identified, described and discussed by the research group. Interviews continued until saturation was achieved.
Coding was also used to create a relationship map between the various stakeholder groups (Figure 1). This figure was created to demonstrate the wide range of relationships and pressures that exist between stakeholders in the context of the prediction algorithm. This figure maps the relationships between stakeholders and potential “values-collisions.” Finally, we presented and discussed our findings with the pilot ML design team, who prioritized several changes to the pilot ML implementation to address the identified ethical concerns.
RESULTS
Participants
Study participants included designers, administrators, clinicians, and patients (Table 1). Response was 100% to requests for participation. This response rate was likely positively influenced by individual emails and telephone calls to request participation. Interviews lasted from approximately 20 to 60 minutes.
Table 1.
Role | Percentage; n = 18 | Gender W:M |
---|---|---|
Clinician | 8 (44%) | 4:4 |
Designer | 3 (17%) | 0:3 |
Administrator | 3 (17%) | 1:2 |
Patient | 4 (22%) | 2:2 |
Seven themes of values conflicts, or where stakeholders disagreed about decisions with ethical implications, emerged from the interviews. These themes were identified as major themes based on frequency discussed throughout interviews. We grouped the themes under the 3 interacting umbrellas affecting the usefulness of a model’s implementation, namely model characteristics, model implementation, and intervention benefits and harms.
Model Characteristic: Bias and Perpetuation of Bias, Perspectives on Death and End of Life Care, and Transparency and Evaluation of Efficacy
Model Implementation: Who Should Receive ML Output, Patient Consent and Involvement
Intervention Benefits and Harms: External Pressures and Study Integrity, Palliative Care Legitimization
From these 7 sets of “values-collisions” emerged key ethical considerations which are outlined in Table 2 and further explored in the discussion.
Table 2.
Ethical concern | Patient value | Clinician value | Designer value |
---|---|---|---|
1: Model characteristics: bias and perpetuation of bias | Patients want data to include individuals like themselves | Clinicians had the concern with ML perpetuating further inequalities | Designers are concerned the algorithm could better identify individuals with better healthcare records |
2: Model characteristics: perspective on death and end of life care | Patients think the model would not impact end-of-life care decisions | Clinicians think the model could improve end-of-life care conversations | Designers feel that the model would guide ACP and impact end-of-life care decisions |
3: Model characteristics: transparency and evaluation of efficacy | Patients found that details were not important but would like overall idea of how prediction works | Clinicians would like to know about how algorithm works, emphasis on use of prespecified trial endpoints | Designers feel that it is more important to demonstrate algorithm validation than methodology |
4: Model implementation: who should receive ML output? | Patients feel as it is important to get predictions from a trusted clinician such as a PCP | Clinician’s concerns surround the algorithm further burdening the Palliative Care team | Designers feel that the algorithm has low pretest probability, and the outcome is not harmful—ideal ML “test case” |
5: Model implementation: patient consent and involvement | Patients would like knowledge of mortality prediction | Clinicians agree with patient knowledge if accompanied with conversation | Designers feel algorithm may not be an accurate predictor of mortality, so should not be shown to patients—issue of misinterpretation |
6: Intervention benefits and harms: external pressures and study integrity | No key value collision | Clinicians had concerns about media | Designers had concerns about PR blowback if ML use and intent misinterpreted |
7: Intervention benefits and harms: palliative care legitimization | No key value collision | Clinicians felt the model could add legitimacy to palliative care consultations; this reflected clinician concerns re: legitimacy of palliative care as an intervention | Designers had desire to improve access to palliative care/ACP |
Theme 1 model characteristics: Bias and perpetuation of bias
Patients were concerned with ensuring that data included individuals like themselves. It was important for patients to feel as though the data was reflective of themselves and their loved ones in order to provide accurate outputs.
Patient perspective: “I think my limited understanding of [machine learning]…is that the program learns based on inputs…so my concern would be that the inputs it’s given as it pertains to my life or my loved one’s life are similar.” -Individual 17 (Patient)
From a similar perspective, clinicians were worried of the ML technology’s impact on the system and potentially perpetuating further inequalities. They found it important to ensure that social justice was a factor to be addressed in the model inputs.
“You know, social justice in clinical care is really a big issue. I worry about things like this…that it could just perpetuate the injustice and that people of color and people who are marginalized already don’t have access” -Individual 6 (Clinician)
Designers had the concern on how an algorithm could further perpetuate an issue that already in large part exists–that more fortunate individuals are the ones with better and more complete healthcare records.
“The model is better at identifying certain kinds of folks. And so, for instance, we know that the more information we have about people the better,… But for the people who have really scarce data, they might just sort of…start getting pushed into the pile of unknown unknowns…And so like it’s…exacerbating an existing inequity in the system” -Individual 2 (Designer)
Theme 2 model characteristics: Perspective on death and end of life care
Patients mentioned that knowledge surrounding the algorithm would not impact their end-of-life decisions. They did not think there would be a substantial way they would change their lives based on these outputs. They found that for individuals who were already chronically ill with poor prognoses this may be more helpful.
“But saying to me you have five years left, is it going to change my life for me? Probably not…I think for people who are very chronically, chronically ill and have very poor prognoses even without anyone to look at it, and if a physician truly knew their patient, [the mortality prediction] might be helpful.” -Individual 15 (Patient)
Clinicians focused on how this algorithm could improve the quality of end-of-life care conversations. They believe it would help possibly relieve some of the pressures surrounding length of patient stay in particular.
“I think that could really help, especially patients who come into the hospital these days are usually really sick, there’s this huge pressure to get them out of the hospital…. if this could be used in a way [for] th[e] patient needing a big picture conversation about where things are going.” -Individual 9 (Clinician)
These concerns presented value clashes with designers. Designers believed that most individuals would not know what to do with this information—a statement that contradicts patients who were saying they would make different financial decisions or have conversations with families about end-of-life care.
“The reason people perceive it to be different is that that information is used differently. And I mean if someone tells you, you have a 10% risk of getting Alzheimer’s at age 60, most people don’t know what to do with it. Most people don’t know what to do with mortality information” -Individual 4 (Designer)
Theme 3 model characteristics: Transparency and evaluation of efficacy
Here designers urged the need to demonstrate validation rather than the details of the algorithm. They did not believe that understanding the detailed model algorithm would be helpful.
“I don’t think that doctors are going to understand how the model really works…I think to be able to say…to look back a year these are all the patients that we screened last year, and this is what happened to them, to see the validation” -Individual 9 (Designer)
This contrasted with the patients and the clinicians, who wanted at least an overview or a deeper understanding of how the algorithm works.
“But just understanding the basics…What we know is patients with similar conditions, similar ages, test scores, this is how their disease progressed.” -Individual 15 (Patient)
“I’d want to know what factors it takes into account and a little bit about how they’re weighted, and I’d want to know something about the performance of the algorithm…it would be important to make sure that people were using this algorithm and had access to the numbers from it were given appropriate training in how to contextualize it” -Individual 13 (Clinician)
The end points that the clinician and patients asked for–how factors are weighted, disease progression, ages–to show validation they needed to be prespecified. Interpretability from a designer’s perspective on how an AI model produces a result varies from a casual interpretability or a level of understanding which is needed to build trust with clinicians, although there is overlap. What the patient and clinician are asking for is often not a level of interpretability that the designer can provide.
Theme 4 model implementation: Who should receive ML output?
Clinicians and patients disagreed with the strategy of the direct to general medicine approach both because of workflow implications and because of concerns about informed consent and patient wishes.
All patients interviewed expressed they would want this information from a trusted provider, not from a stranger.
“It would best be approached if a PCP or a very trusted clinician referred –ideally with the patient’s prior permission and agreement…my gut says there needs to be a bond of trust with the person that has that first conversation with the patient.” -Individual 14 (Patient)
Palliative care clinicians were concerned with the impact on their workload–if they were acting as the “human screen” for this predictor, is there any added value or differentiation between an assisted or autonomous algorithm?
“[Advanced care planning] [is] not an unlimited resource, … to give them an additional sorting function beyond screening each of those two charts that they’re looking at…are we just adding more work and then they’re not going to be able to complete kind of the tasks of advanced care planning for the patients in the time that they’re going to have allotted to spend on this.” -Individual 1 (Clinician)
Theme 5 model implementation: Patient consent and involvement
This theme reflected the greatest value conflicts. Patients wanted to know their prediction. Every single patient interviewed wanted access to the information that would be available to them regarding this prediction. The differences were surrounding the timing and manner that this information was relayed.
“So I would like to be brought into the loop as soon as somebody knows that I will probably die soon” -Individual 14 (Patient)
Clinicians agreed, but with the condition that the information be relayed in conversation. It was important for clinicians to not inform patients without any context to what the prediction meant for them. The end goal for clinicians was more geared towards quality-of-life concerns rather than the timeline of death.
“I think [patients should have access to their mortality prediction] but not in a vacuum…yes with a conversation…it’s important for us to at least start having a conversation so that we understand the…goals and preferences towards the end of life” -Individual 13 (Clinician)
Designers did not agree with informing patients. This is because from the perspective of the designers, the algorithm output is not an accurate prediction of the patient’s mortality.
“In my view, it would be very irresponsible to show this prediction to the patient directly because this tool is just a tool….it is certainly not a well-calibrated probability of the person actually dying. So presenting this as…the probability of the patient’s mortality to the patient would be wrong and also irresponsible” -Individual 8 (Designer)
Furthermore, designers thought that because most individuals already could benefit from palliative care, and these consults were seen as nonharmful intervention, the low pretest probably made it a good “initial test case” for machine learning.
“So one of the reasons why this is attractive to us as an initial test case for using ML or machine learning in clinical medicine is because we’re really talking about providing extra care to a patient in a sense, right, just like extra attention.” – Individual 2 (Designer)
Theme 6 intervention benefits and harms: External pressures and study integrity
Both clinicians and designers had concerns about external pressures, most notably image in popular press. These are not typical study pressures (for example patient enrollment numbers or outcomes). This is because the ML algorithm is more public, and this study has more of a public setting.
“So let me tell you the story that you’re going to tell the media when this family comes back and says I got a bill for my dying loved one when she died in the hospital…on observation, because they wouldn’t admit her because they thought she had less than 24 hours to live.” -Individual 1 (Clinician)
“We saw this happen…eight years ago or so, with the whole death panels and how easy it is to spin a positive intervention… to twist and it became this, oh no, doctors are now deciding for themselves who gets to live and who dies.”– Individual 12 (Clinician)
“So let’s say extreme situation…[someone] happens to be a board member and they get offended that somebody talked to be about dying, we’d have a PR nightmare on our hands” -Individual 4 (Designer)
Participants posited that the model may need randomized control trial-like protections to prevent this blowback from happening. Protections needed to have pre-specified endpoints for auditability and some form of transparency with the clinicians utilizing the information.
Theme 7 intervention benefits and harms: Palliative care legitimization
Finally, an unanticipated benefit to this predictor was what we termed “palliative care legitimization”. Palliative care clinicians and patients both mentioned that the algorithm has the unique ability, through automation, to provide added legitimacy to the profession—especially with regards more “old school” physicians who would not usually order ACP consults.
“My experience has been the palliative team is not…they’re amazing, but they’re not held as in high esteem as the cardiac surgeons, or the neurosurgeons, or the trauma surgeons, so they don’t necessarily ask for a consult from Palliative Care” -Individual 15 (Patient)
“So maybe able to get access to patients who traditionally wouldn’t get to me because of, again, that still like human element of preconceived ideas around Palliative Care and concerns around Palliative Care…automating like the identification of patient who could benefit, we now get to see that patient who previously we would never get access to.” -Individual 1 (Clinician)
Palliative care has struggled for years as a discipline, often facing perceptions from other medical disciplines that engaging in palliative care is an abdication of care.29 While the algorithm is not responsible for this, clearly its deployment interacts with this latent value system. Theme 7 is a value collision in that it refers to intraclinician conflict. This fits strongly enough with what was already known concerns around the field of palliative care, and efforts of field of palliative care to achieve parity with other disciplines.
DISCUSSION
Almost certainly, ML-HCAs will have a substantial impact on healthcare processes, quality, cost, and access, and in so doing will raise specific and perhaps unique ethical considerations and concerns in the healthcare context.45–49 This has been the case in nonhealthcare contexts,5,50 where ML implementation has generated toughening scrutiny due to scandals regarding how large repositories of private data have been sold and used,51 how the ML design of algorithmic flight controls resulted in accidents,52 and how computer-assisted prison sentencing guidelines perpetuate racial bias,53 to name but a few of the growing number of examples. Specifically for ML-HCAs, a variety of ethical considerations and concerns have been cited, such as bias arising from training data,8 the privacy of personal data in business arrangements,54 ownership of the data used to train ML-HCAs55 and accountability for ML-HCA’s failings.56 Notably, no systematic approach has yet emerged regarding the identification of specific ethical concerns arising from actual ML-HCAs, an emerging, complex, cross-disciplinary technology that potentially affects many aspects of healthcare.
In this case study of an ML-HCA for ACP the “values-collision” approach was able to identify multiple themes of disagreement between stakeholders, which mark areas of ethical and values conflict. In identifying affected stakeholder groups, eliciting stakeholder’s views, and identifying where stakeholder groups disagreed, we were able to provide actionable guidance to the ML-HCA design team regarding where consequential ethical challenges were likely to emerge. This approach differs from that of a checklist or EHR analytic in that analysis of stakeholder perspectives is unique in its implementation.57,58 “Values-collision” screening will certainly require revision through subsequent case studies and will need to be made more efficient to screen multiple ML-HCAs in a timely fashion, but offers an initial approach to identifying ethical challenges with an ML-HCA before such challenges become consequential.
From our findings, the design team was able to prioritize needed efforts focused on: (1) examining alternative implementation strategies to delivery of mortality predictions into the workflow (ie, directly to patients or to hospitalist clinicians); (2) explicitly clarifying to clinicians and patients that the mortality prediction was only evaluated as a surrogate to predict need for ACP, and not other mortality-related decisions, and possibly renaming the prediction as “ACP recommended” rather than “mortality prediction”; and, (3) shielding their ongoing research into mortality prediction from social media scrutiny until endpoint driven studies were completed (ie, enacting protections similar to blinded clinical trials).60 These 3 measures were reported back to the design team as they were actionable and drawn from the 5 key “values-collisions” we identified. The implications of these findings for informatics team demonstrated the need for multiple stakeholder perspectives on tool development, as well as protecting design development from external pressures in practice. These are takeaways that can be adapted for future stakeholder meetings with larger “values-collisions.”
From this case study, using “values-collisions” appears to help proactively and systematically identify ethical considerations with a specific ML-HCA, and facilitate interdisciplinary dialogue and collaboration to better understand and subsequently manage the ethical implications of an ML-HCA before and during its early deployment. Where the ethical concerns arose, we were able to draw on ethical scholarship and work with designers and users to address such challenges before they became consequential. Ethical scholarship includes discussion around traditional bioethical principles including beneficence or distributive justice. For example, while the ML mortality predictions could be delivered directly to general medicine clinicians, findings from the SUPPORT trial57 suggested nonpalliative care clinicians may not act on mortality predictions for ACP. In addition, the incentive pressures to meet quality metric indicators43 were perceived by the design team as having possible unintended ethical consequences, such as guiding treatment options or decisions around admission (ie, choosing to not admit patients with high likelihood of near-term mortality so as to not count against hospital 30-day mortality rates), rather than being used to guide ACP. Mortality predictions could also be given directly to patients, though that too could give rise to ethical concerns, particularly around patients’ abilities to understand such a prediction without contextualization, as with any screening test result,59 as well as the significant emotional effects of receiving a mortality prediction without appropriate support (such as the recent public outcry around physicians delivering poor prognoses via video-link61).
Future studies are needed to clarify when future ethical analyses should be conducted as this (or any ML-HCA) is revised and deployed more broadly. Optimizing timing of ethical analysis will also need iterative study but should occur in the “sweet spot” of the Collingridge dilemma: late enough that the ML-HCA impacts can be predicted, but not so late that the ethical problems have already become entrenched.62 Additionally, research should focus on how to better streamline, and make more efficient the ethical analysis process (whether questions can be delivered via survey, which questions are of the highest yield, and the optimal number of stakeholders assessments needed.) The inclusion of the end-users in the initial development of the mortality prediction ML tool would also be an important takeaway for future studies.
Limitations of this study included a small number of participants and limited demographic information. As with all qualitative studies there are limits to generalizability. However, in this case study, this framework methodology provides a standardizable approach for how to identify ethical challenges with a ML-HCA and is necessary to ensure machine learning tools fulfill their promise to improve care for patients and families.
Supplementary Material
ACKNOWLEDGMENTS
Thank you to our study participants for their time and thoughtful comments. Thank you to Sarah Wieten for her assistance with participant interviews.
Contributor Information
Diana Cagliero, Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada.
Natalie Deuitch, Department of Genetics, Stanford University School of Medicine, Stanford, California, USA; National Institutes of Health, National Human Genome Research Institute, Bethesda, Maryland, USA.
Nigam Shah, Center for Biomedical Informatics Research, Stanford University School of Medicine, Palo Alto, California, USA.
Chris Feudtner, The Department of Medical Ethics, The Children’s Hospital of Philadelphia, Philadelphia, Pennsylvania, USA; Departments of Pediatrics, Medical Ethics and Healthcare Policy, The Perelman School of Medicine, The University of Pennsylvania, Philadelphia, Pennsylvania, USA.
Danton Char, Division of Pediatric Cardiac Anesthesia, Department of Anesthesiology, Stanford University School of Medicine, Stanford, California, USA; Center for Biomedical Ethics, Stanford University School of Medicine, Stanford, California, USA.
FUNDING
Stanford Human-Centered Artificial Intelligence Seed Grant.
AUTHOR CONTRIBUTIONS
Conception and design: DSC, ND, NS, and CF. Data Acquisition: ND and DSC. Data Analysis: ND, DAC, and DSC. Data Interpretation: all authors. Drafting of Manuscript: DSC, DAC, and ND. Critical revision of manuscript: all authors.
SUPPLEMENTARY MATERIAL
Supplementary material is available at Journal of the American Medical Informatics Association online.
CONFLICT OF INTEREST STATEMENT
None declared.
DATA AVAILABILITY
The data underlying this article are available in the article and will be shared on reasonable request to the corresponding author.
REFERENCES
- 1. Abràmoff MD, Lavin PT, Birch M, et al. Pivotal trial of an autonomous AI-based diagnostic system for detection of diabetic retinopathy in primary care offices. NPJ Digital Med 2018; 1: 39. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. SFR-IA Group. Artificial intelligence and medical imaging 2018: French Radiology Community White Paper. Diagn Intervent Imaging 2018; 99 (11): 727–42. [DOI] [PubMed] [Google Scholar]
- 3. Abràmoff MD, Tobey D, Char DS.. Autonomous AI: finding a safe, efficacious, and ethical path through the development process. Am J Ophthalmol 2020; 214: 134–42. [DOI] [PubMed] [Google Scholar]
- 4. Office of the Commissioner. FDA permits marketing of artificial intelligence-based device to detect certain diabetes-related eye problems. FDA. Retrieved February 20, 2020. http://www.fda.gov/news-events/press-announcements/fdapermits-marketing-artificial-intelligence-based-device-detect-certain-diabetesrelated-eye. Accessed February 20, 2020.
- 5. Char DS, Shah NH, Magnus D.. Implementing machine learning in health care - addressing ethical challenges. N Engl J Med 2018; 378 (11): 981–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Ho A. Deep ethical learning: taking the interplay of human and artificial intelligence seriously. Hastings Cent Rep 2019; 49 (1): 36–9. [DOI] [PubMed] [Google Scholar]
- 7. Rigby MJ. Ethical dimensions of using artificial intelligence in health care. AMA J Ethics 2019; 21 (2): E121–124. [Google Scholar]
- 8. Challen R, Denny J, Pitt M, et al. Artificial intelligence, bias and clinical safety. BMJ Qual Saf 2019; 28 (3): 231–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Matheny ME, Whicher D, Israni ST.. Artificial intelligence in health care: a report from the National Academy of Medicine. JAMA 2020; 323 (6): 509–10. [DOI] [PubMed] [Google Scholar]
- 10. Murphy K, Di Ruggiero E, Upshur R, et al. Artificial intelligence for good health: a scoping review of the ethics literature. BMC Med Ethics 2021; 22 (1): 1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Adler-Milstein, et al. Meeting the Moment: Addressing Barriers and Facilitating Clinical Adoption of Artificial Intelligence in Medical Diagnosis. NAM September 2022. https://nam.edu/meeting-the-moment-addressing-barriers-and-facilitating-clinical-adoption-of-artificial-intelligence-in-medical-diagnosis/. Accessed October 20, 2022. [DOI] [PMC free article] [PubMed]
- 12. European Commission Ethics Guidelines for Trustworthy AI. https://ec.europa.eu/futurium/en/ai-alliance-consultation.1.html. Accessed October 20, 2022.
- 13. Solomonides AE, Koski E, Atabaki SM, et al. Defining AMIA’s artificial intelligence principles. J Am Med Inform Assoc 2022; 29 (4): 585–91. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14. Morley J, Machado CC, Burr C, et al. The ethics of AI in health care: a mapping review. Soc Sci Med 2020; 260: 113172. [DOI] [PubMed] [Google Scholar]
- 15. Vayena E, Blasimme A, Cohen IG.. Machine learning in medicine: addressing ethical challenges. PLoS Med 2018; 15 (11): e1002689. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Bakken S. The imperative of applying ethical perspectives to biomedical and health informatics. J Am Med Inform Assoc 2022; 29 (8): 1317–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17. Stenmark CK, Antes A, Thiel E, et al. Consequences identification in forecasting and ethical decision-making. J Emp Res Hum Res Ethics 2011; 6 (1): 25–32. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Obermeyer Z, Powers B, Vogeli C, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science 2019; 366 (6464): 447–53. [DOI] [PubMed] [Google Scholar]
- 19. Rajkomar A, Hardt M, Howell MD, et al. Ensuring fairness in machine learning to advance health equity. Ann Intern Med 2018; 169 (12): 866–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20. Shah NH, Milstein A, Bagley SC.. Making machine learning models clinically useful. JAMA 2019; 322 (14): 1351–8. [DOI] [PubMed] [Google Scholar]
- 21. Miller K. How Do We Ensure that Healthcare AI is Useful? Stanford HAI. Retrieved September 17, 2022. https://hai.stanford.edu/news/how-do-we-ensure-healthcare-ai-useful.
- 22. Char DS, Abràmoff MD, Feudtner C.. Identifying ethical considerations for machine learning healthcare applications. Am J Bioethics 2020; 20 (11): 7–17. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23. Shilton K. Values and ethics in human-computer interaction. Found Trends Hum Comput Interact 2018; 12 (2): 107–71. [Google Scholar]
- 24. Brey PAE. Anticipatory ethics for emerging technologies. Nanoethics 2012; 6: 1–13. [Google Scholar]
- 25. Dumanovsky T, Augustin R, Rogers M, et al. Special report the growth of palliative care in U.S. hospitals: a status report. J Palliat Med 2016; 19 (1): 8–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Palmer M, Jacobson M, Enguidanos S.. Advance care planning for Medicare beneficiaries increased substantially, but prevalence remained low. Health Affairs 2021; 40: 4. [DOI] [PubMed] [Google Scholar]
- 27. Lindvall C, Cassel C, Pantilat S, et al. Ethical considerations in the use of AI mortality predictions in the care of people with serious illness. Health Affairs 2020; 20200911.401376. [Google Scholar]
- 28. Farrell T, Francis L, Brown T, et al. Rationing limited healthcare resources in the COVID19 era an beyond: ethical considerations regarding older adults. JAGS 2020; 68: 1143–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Avati A, Jung K, Harman S, et al. Improving palliative care with deep learning. BMC Med Inform Decis Mak 2018; 18: 122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30. Sim I. Two ways of knowing: big data and evidence-based medicine. Ann Intern Med 2016; 164 (8): 562. [DOI] [PubMed] [Google Scholar]
- 31. Obermeyer Z, Emanuel EJ.. Artificial intelligence and the augmentation of health care decision-making. N Engl J Med 2016; 375: 1216–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. White N, Kupeli N, Vickerstaff V, et al. How accurate is the ‘Surprise Question’ at identifying patients at the end of life? A systematic review and meta-analysis. BMC Med 2017; 15: 139. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Ullström S, Andreen Sachs M, Hansson J, et al. Suffering in silence: a qualitative study of second victims of adverse events. BMJ Qual Saf 2014; 23: 325–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34. Olson PTJ, Brasel KJ, Redmann AJ, et al. Surgeon-reported conflict with intensivists about postoperative goals of care. JAMA Surg 2013; 148: 29–35. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. Christensen JF, Levinson W, Dunn PM.. The heart of darkness: the impact of perceived mistakes on physicians. J Gen Intern Med 1992; 7 (4): 424–31. [DOI] [PubMed] [Google Scholar]
- 36. Yoon JD, Rasinski KA, Curlin FA.. Conflict and emotional exhaustion in obstetricians-gynaecologists: a national survey. J Med Ethics 2010; 36 (12): 731–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Lemaire JB, Wallace JE.. Not all coping strategies are created equal: a mixed methods study exploring physicians' self reported coping strategies. BMC Health Serv Res 2010; 10: 208. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Poulin M. Reporting on first sexual experience: the importance of interviewer-respondent interaction. Demogr Res 2010; 22 (11): 237–88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Feveile H, Olsen O, Hogh A.. A randomized trial of mailed questionnaires versus telephone interviews: response patterns in a survey. BMC Med Res Methodol 2007; 7: 27. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Strauss A, Corbin J.. Basics of Qualitative Research. California: Sage Publications; 1990. [Google Scholar]
- 41. Clarke A. Situational Analysis: Grounded Theory after the Postmodern Turn. New York: Sage Books; 2005. [Google Scholar]
- 42. Charmaz K. Constructing Grounded Theory. 2nd ed. California: Sage Publications; 2014. [Google Scholar]
- 43. Ryan GW, Bernard HR.. Techniques to identify themes. Field Methods 2003; 15 (1): 85–109. [Google Scholar]
- 44. Ancker JS, Benda NC, Reddy M, et al. Guidance for publishing qualitative research in informatics. J Am Med Inform Assoc 2021; 28 (12): 2743–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Obermeyer Z, Ezekiel EJ.. Predicting the future — big data, machine learning, and clinical medicine. N Engl J Med 2016; 375 (13): 1216–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46. Rajkomar A, Dean J, Kohane I.. Machine learning in medicine. N Engl J Med 2019; 380 (14): 1347–58. [DOI] [PubMed] [Google Scholar]
- 47. Maddox T, Rumsfeld J, Payne P.. Questions for artificial intelligence in health care. JAMA 2019; 321 (1): 31–2. [DOI] [PubMed] [Google Scholar]
- 48. Matheny MS, Thadaney I, Ahmed M, et al. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. The Learning Health System Series. Washington, DC: National Academy of Medicine; 2019. [PubMed] [Google Scholar]
- 49. Matheny MEWhicher D, Israni ST.. Artificial intelligence in health care: a report from the National Academy of Medicine. JAMA 2020; 323 (6): 509–10. [DOI] [PubMed] [Google Scholar]
- 50. Bostrom N, Yudkowski E.. The ethics of artificial intelligence. In: Frankish K, Ramsey WM, eds. The Cambridge Handbook of Artificial Intelligence. Cambridge, UK: Cambridge University Press; 2011: 316–34. [Google Scholar]
- 51. Rosenberg M, Frenkel S. “Facebook’s Role in Data Misuse Sets Off Storms on Two Continents.” The New York Times, March 18, 2018, sec. U.S. 2018. https://www.nytimes.com/2018/03/18/us/cambridge-analytica-facebook-privacy-data.html.
- 52. Nicas J, Glanz J, Gelles D. In Test of Boeing Jet, Pilots Had 40 Seconds to Fix Error. The New York Times, March 25, 2019, sec. Business. 2019. https://www.nytimes.com/2019/03/25/business/boeing-simulation-error.html.
- 53. Angwin J, Larson J, Mattu S, et al. Machine Bias: There’s Software Used across the Country to Predict Future Criminals. And It’s Biased against Blacks. ProPublica. May 23, 2016. 2016. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing.
- 54. Comfort N. The overhyping of precision medicine. The Atlantic.2016. https://www.theatlantic.com/health/archive/2016/12/the-peril-of-overhyping-precision-medicine/510326/.
- 55. Ornstein C, Thomas K. Sloan Kettering’s Cozy Deal with Start-Up Ignites a New Uproar. The New York Times, September 20, 2018, sec. Health. 2018. https://www.nytimes.com/2018/09/20/health/memorial-sloan-kettering-cancer-paige-ai.html.
- 56. Ross C, Swelitz I. IBM Pitched Watson as a Revolution in Cancer Care. It’s Nowhere Close. Stat, September 5, 2017. 2017. https://www.statnews.com/2017/09/05/watson-ibm-cancer/.
- 57. Wang HEE, Landers M, Adams R, et al. A bias evaluation checklist for predictive models and its pilot application for 30-day hospital readmission models. J Am Med Inform Assoc 2022; 29 (8): 1323–33. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Estri H, Strasser ZH, Rashidian S, et al. An objective framework for evaluating unrecognized bias in medical AI models predicting COVID-19 outcomes. J Am Med Inform Assoc 2022; 29 (8): 1334–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59. Connors AF, Dawson NV, Desbiens NA, et al. A controlled trial to improve care for seriously III hospitalized patients: the study to understand prognoses and preferences for outcomes and risks of treatments (SUPPORT). JAMA 1995; 274 (20): 1591–8. [PubMed] [Google Scholar]
- 60. Irwig L, McCaffery K, Salkeld G, et al. Informed choice for screening: implications for evaluation. BMJ 2006; 332: 1148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61. Jacobs J. Doctor on Video Screen told a Man He was near Death Leaving Relatives Aghast. The New York Times. Retrieved August 15, 2022. https://www.nytimes.com/2019/03/09/science/telemedicine-ethical-issues.html.
- 62. Collingridge D. The Social Control of Technology. London: St. Martin's Press; 1980. [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The data underlying this article are available in the article and will be shared on reasonable request to the corresponding author.