Abstract
Built-in decision thresholds for AI diagnostics are ethically problematic, as patients may differ in their attitudes about the risk of false-positive and false-negative results, which will require that clinicians assess patient values.
Recent years have seen a surge of interest in medical applications of machine learning and artificial intelligence (AI) — and in the ethics of these applications1,2. In the near future, it is unlikely that AI technology will replace human decision-makers, but it is likely to assist human decision-makers in many contexts3. For example, machine learning can assist with the diagnosis of cancer from imaging data4, the mapping of the boundaries of tumors5, the treatment of sepsis6, the classification of field-based rapid HIV tests7 and the diagnosis of cognitive-motor dissociation in unresponsive patients with brain injuries8.
Probabilistic classifiers
Many AI methods with strong promise in medical imaging are probabilistic classifiers: their native output is a probability. This includes naive Bayes classifiers, decision trees and multiple neural-network-based approaches9. For example, a probabilistic classifier for diagnosing cancer from oncological image data will generate a probability that cancer is present. This leads to an important design ‘choice point’. The software provided to clinicians may hide the raw probabilities and build in decision thresholds for converting probabilities into recommendations, so that a probability of cancer above 70% (for instance) gives the output ‘recall recommended’. Alternatively, the software may provide the raw probabilities, with the burden falling on clinicians to take the probabilities into account when making decisions. A third possibility is that the software might incorporate information about the patient’s own attitudes about risk and the value (positive or negative) that they assign to different possible outcomes, together with the probabilities estimated by the algorithm, so as to generate a recommendation tailored to this particular patient.
The first option may seem appealing on the grounds of simplicity and ease of use. However, there are ethical reasons that developers of medical AI should take one of the other two options. Clinical decision-makers should be provided either with probabilistic outputs or with a recommendation that takes both the probabilities and the patient’s values and tolerance for risk into account. This is because in clinical settings, there can be no one-size-fits-all decision threshold. From an ethical point of view, it is appropriate for the decision threshold to be sensitive to the values and attitudes toward risk that this particular patient holds.
Decision thresholds
The case against built-in decision thresholds can be made with the example of breast cancer screening. When a human reads a mammogram as part of an initial screening protocol, their output is a decision: recall the patient (or not) for further investigation, typically a biopsy. By contrast, the output of several AI systems is a “continuous score that represents the likelihood of cancer being present”4. This probability from the AI algorithm can be converted to a decision in a range of ways, depending on where (between 0% and 100%) the critical threshold is set, which triggers a recall decision.
Different decision thresholds can be imposed for different purposes. A lower threshold could lead to more false-positive diagnoses, whereas a higher threshold could lead to more false-negative diagnoses, whereby cancer is missed. Choosing the decision threshold requires weighing of the disvalue for each kind of error, which in turn requires context-sensitive value judgements10–12.
The programmer of the algorithm could choose a single threshold, set at the same level for all patients. This would be suboptimal for patient autonomy, as the patient’s own values would be ignored13. Instead, the algorithm could produce a raw continuous score, leaving the threshold judgement between the clinician and the patient (pathway A in Fig. 1). Alternatively, the algorithm could take account of the patient’s own values and risk tolerances when recommending a threshold, perhaps by calculating a utility measure that takes risk tolerance into account14 (pathway B in Fig. 1).
Risk-value profiles
There are various ways to acquire information about a patient’s values. One would be for patients, at the time of screening or earlier, to be given a risk-profiling questionnaire that probes their values and attitudes about different outcomes, which would allow the construction of a risk-value profile for the patient. This idea is inspired by the risk-profiling questionnaires commonly used in finance. These questionnaires typically ask investors for their level of agreement with statements about risk, such as “Generally, I prefer investments with little or no fluctuation in value, and I’m willing to accept the lower return associated with these investments”15.
A risk-profiling questionnaire suitable for cancer screening would probe the patient’s attitudes about the risk of over-diagnosis, false-positive and false-negative results, and over-treatment versus under-treatment, and the expected value to the patient of additional years of life of varying quality levels. The questionnaire might also ask patients to respond to statements such as ‘I would rather risk surgical complications to treat a benign tumor than risk missing a cancerous tumor’.
Patient support
Recently, a large study investigating the Dutch population’s view on the use of AI for the diagnostic interpretation of screening mammograms16 found that the general population currently does not support fully independent use of AI without the involvement of a human radiologist as well. This suggests that it would not be appropriate, at the present time, to pursue a decision pathway in which there is no role for a human reader, even if the AI does take the patient’s values into account (as in pathways E and F in Fig. 1).
In principle, the attitudes of patients, who have relevant lived experience and a personal stake in the matter, could differ from those of the general population. However, a survey of United States–based patient-advocacy groups, carried out by the authors of this Comment (with institutional review board approval from Washington University in St Louis), also found resistance to clinical decisions being made without human input17. Most respondents were comfortable with some involvement for AI, but only when there was also a human reader. Feedback included such comments as “I see AI as a tool to assist clinicians in medical decisions. I do not see it as being able to make decisions that effectively weigh my personal input or really have the clinical experience and intuition of a good physician.” Another respondent commented that AI “could be a valuable tool, but combined with physicians’ expertise and consultation with patients.” There were many other comments along the same lines.
In addition to their concerns about the potential loss of human input with the advent of AI, most respondents were concerned by the thought that uncertainty about their screening image might be hidden from them, either by a human reader or by an algorithm applying a decision threshold. Moreover, most were concerned by the prospect of the uncertainty being managed (either by a human reader or by an algorithm) without any consideration of their values and preferences (as in pathways C and D in Fig. 1). One of the risks of increased use of AI is that these fears (of uncertainty being hidden, and of one’s values being ignored) will become more acute. But there is also a corresponding opportunity, because AI allows the possibility of varying decision thresholds in a transparent way that demonstrates sensitivity to these concerns.
Implementation challenges
There are a number of implementation challenges with the proposed decision pathways A and B (Fig. 1) that will require further research. First, the design of any survey to elicit a risk-value profile would require consultation and trialing, given the potential for framing effects to influence the patient’s answers. A good design would also build in appropriate over-rides. Returning to the financial analogy, a good risk-profiling questionnaire will allow an investor to communicate that even though they usually have an appetite for high risk, they want to be cautious on this particular occasion (or vice versa). Similarly, a good questionnaire for probing patients’ values would allow a patient to state that even though they would usually assign great disvalue to a false-positive result, their priority on this occasion is to avoid a false-negative result (or vice versa). Moreover, patients must be properly informed about the relevant concepts. Many patients are unfamiliar with the concept of over-diagnosis and therefore may be unable to weigh the relative risk of unnecessary diagnosis and treatment against the risk of failing to discover a cancer18. Moreover, patients may not always have preferences about such outcomes. There must still be a sensible default decision threshold that can be used in cases in which patients choose to withhold their attitudes or simply have no preferences.
There is also a danger of exaggerating the precision of the probabilities. If the dataset used to train the algorithm was small or non-representative, a probability range may be a more reasonable output than a precise probability. There is also a risk that clinicians will be unwilling or unprepared to take the patient’s risk-value profile into account. These recommendations would create a more complex decision task for clinicians than reliance on a pre-programmed threshold; this emphasizes the importance of training in the use of AI in clinical settings and the co-design of diagnostic devices with physicians and patients.
These challenges notwithstanding, tailoring decision thresholds to the patient through the use of information about the patient’s values and attitude about risk is vastly preferable to leaving them to be fixed by the software developer in a one-size-fits-all manner.
Table 1 |.
Decision pathway from Fig. 1 | Sensitive to patients’ values and preferences? | Likely to have patient support? | Concerns |
---|---|---|---|
A | Yes | Yes | Implementation challenges (discussed in main text) |
B | Yes | Yes | Implementation challenges (discussed in main text) |
C | No | No | Despite human input, patients’ values and preferences are not considered when managing uncertainty, eroding patient trust |
D | No | No | Despite human input, patients’ values and preferences are not considered when managing uncertainty, eroding patient trust |
E | No | No | Loss of human input, combined with failure to consider patients’ values and preferences, is likely to severely undermine trust |
F | Yes | No | Although patients’ values and preferences are considered, loss of human input is likely to undermine patient trust at the present time |
Acknowledgements
We thank S. Bhalla, L. Kofi Bright, A. Houston, L. Hudetz, R. Short, J. Swamidass, K. Vredenburgh, Z. Ward, K. Wright and patient groups at Washington University in St Louis, Stanford University and Johns Hopkins University for their input and advice. A.K.J. acknowledges support from the National Institute of Biomedical Imaging and Bioengineering of the US National Institutes of Health (R01-EB031051 and R56-EB028287).
Footnotes
Competing interests
The authors declare no competing interests.
References
- 1.McCradden MD et al. Nat. Med 26, 1325–1326 (2020). [DOI] [PubMed] [Google Scholar]
- 2.Grote T & Berens PJ Med. Ethics 46, 205–211 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Tschandl P et al. Nat. Med 26, 1229–1234 (2020). [DOI] [PubMed] [Google Scholar]
- 4.McKinney SM et al. Nature 577, 89–94 (2020). [DOI] [PubMed] [Google Scholar]
- 5.Liu Z et al. Phys. Med. Biol 66, 124002 (2021). [Google Scholar]
- 6.Komorowski M et al. Nat. Med 24, 1716–1720 (2018). [DOI] [PubMed] [Google Scholar]
- 7.Turbé V et al. Nat. Med 27, 1165–1170 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Claassen J et al. N. Engl. J. Med 380, 2497–2505 (2019). [DOI] [PubMed] [Google Scholar]
- 9.Hastie T The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009). [Google Scholar]
- 10.Plutynski A in Exploring Inductive Risk: Case Studies of Values in Science (eds. Elliott KC & Richards T) 149–170 (Oxford University Press, 2017). [Google Scholar]
- 11.Douglas HE Science, Policy, and the Value-free Ideal (University of Pittsburgh Press, 2009). [Google Scholar]
- 12.Bright LK Synthese 195, 2227–2245 (2017). [Google Scholar]
- 13.Wenner DM Int. J. Fem. Approaches Bioeth 13, 28–48 (2020). [Google Scholar]
- 14.Buchak LJ Med. Ethics 43, 90–95 (2016). [DOI] [PubMed] [Google Scholar]
- 15.Pan CH & Statman MJ Invest. Consult 13, 54–63 (2012). [Google Scholar]
- 16.Ongena YP et al. J. Am. Coll. Radiol 18, 79–86 (2021). [DOI] [PubMed] [Google Scholar]
- 17.Birch J, Creel K, Jha A & Plutynski A Zenodo 10.5281/zenodo.5589207 (2021). [DOI] [Google Scholar]
- 18.Nagler RH et al. Med. Care 55, 879–885 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]