Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2025 Jul 1.
Published in final edited form as: Am J Bioeth. 2024 Jun 24;24(7):32–34. doi: 10.1080/15265161.2024.2353803

Parrots at the Bedside: Making Surrogate Decisions with Stochastic Strangers

Jonathan Herington 1, Benzi Kluger 1
PMCID: PMC11289735  NIHMSID: NIHMS2003246  PMID: 38913490

In their recent paper, Earp and co-authors argue for the ethical desirability of personalized patient preference predictors (P4s): large-language models (LLMs) finetuned on a patient’s “own prior writing and other digitally recorded behaviour” (p5). They claim that P4s will be “digital psychological twins” (p4) of the patient that “is more likely (than a surrogate) to infer the underlying structure of an individual’s preferences as applied across a range of situations” (p12). Thus, they argue that a P4 can provide surrogates with access to the patient’s treatment preferences, values, and reasoning, in “the absence of sufficient advance information to determine first-order treatment preferences directly” (p10).

In this commentary, we argue that P4s are better understood as “stochastic parrots” (Bender et al. 2021) that generate “convincingly similar” (p9) utterances to the patient’s corpus of prior writing without developing underlying representations of their beliefs or values. The bias of P4s towards patients values prior to recent critical illness events may make their input an unreliable gauge of patient wishes, and prior research suggests that simply adding information to these crucial conversations (even if accurate) does little to improve important outcomes (White 2011). Moreover, they risk damaging the relational context of surrogate decision-making, by appearing as an additional “stranger at the bedside” whose uncanny (but unreliable) mimicry of the patient may downplay the important role of dialogue with healthcare professionals and exacerbate feelings of guilt, shame and alienation in surrogates.

Can patients predict their future preferences?

Earp and co-authors hinge the ethical desirability of these systems on the claim that they will be better predictors of patient preferences than similarly informed surrogates. While it’s true that many surrogates have difficulty identifying patient preferences, it’s also true that patients themselves are often poor predictors of their future preferences (Morrison, Meier, and Arnold 2021). This is for at least two reasons. First, many surrogate decisions involve choices that the patient themselves has not contemplated. This can occur because patients (and their families) consciously or unconsciously avoid acknowledging potentially distressing future scenarios as a coping strategy (Serrano-Gemes et al. 2021), or from a lack of adequate guidance from clinicians who delay these conversations (Mack and Smith 2012). Second, patient’s preferences often change radically between (and within) admissions. Patients, even those who have undergone advanced care planning (ACP), often have serious misconceptions about the nature, necessity or subjective experience of complex medical interventions prior to experiencing them (Fischer et al. 1998). In fact, it is often a person’s recognition that further interventions will not improve their quality of life and/or negative experiences with those interventions that leads them to choose hospice or comfort when the next crisis arises. When a surrogate steps in, the patient is likely to be sicker, more fragile, more intervention-weary, and older than when they produced most of the corpus upon which a P4 relies. In this respect, preferences expressed outside of an ACP process, and arguably even within one depending on its context and quality, are likely to be unreliable.

To solve these problems, Earp and co-authors propose using more tailored data, including end-of-life questionnaires, discrete choice surveys about medical preferences, and even recordings of advanced care conversations with physicians (Table 1). We agree that knowledge of a patient’s explicit medical preferences expressed as part of an ACP-like process would likely improve accuracy of the system. Unfortunately, it is unclear what additional benefit a P4 offers over and above simply providing these ACP-like documents to the surrogate as part of the shared decision-making process.

Digital psychological twins or stochastic parrots?

If previously stated preferences are unreliable, perhaps P4s can give us insight into “the underlying structure of an individual’s preferences” (p12). On this view, P4s are more than predictors of what a patient would say: they are “digital psychological twin” (p4) that possess detailed insight into the implicit, unstated beliefs or values of the patient. Thus, Earp et al claim, they may be able to guide us when we lack explicit information about patient preferences.

This relies, however, on a highly controversial view of the capacities of LLMs. Recall that LLMs are designed and trained to be “stochastic parrots” (Bender et al. 2021) that identify the (very complex) statistical associations between words (or other “tokens”) in the training data. Evidence is thin for the hypothesis that LLMs internally simulate the mental content (beliefs, values, preferences) implied by the utterances they generate. As London (2024) notes, where studies carefully differentiate syntactic pattern-recognition (i.e. parroting) from the representation of underlying constructs (i.e. digital psychology), LLMs apparent capacity for causal reasoning, planning (Valmeekam et al. 2023), and theory of mind (Kim et al. 2023) disappear. Thus, if P4s are simply stochastic parrots that simulate a person’s (past) utterances, then it’s hard to see how they can infer or represent our deeper values.

Of course, if given text that explicitly identifies reasons and values – P4s will perform well at re-articulating those reasons, values and claims in other contexts. Two examples (p9) discussed by Earp et al. involve training on the academic writing of professional philosophers, which typically involves explicit identification of reasons and values. It’s no surprise that when they provide “convincingly similar” (p9) outputs to the relevant writer, they also include explicit reference to values and reasoning. Earp et al don’t provide a mechanistic explanation for how such inference could occur for more general writing, or why we should expect the inferences about preferences to be reliable when applied to cases outside those explicitly addressed in the training data.

Another Stranger at the Bedside

Finally, even if these systems are more accurate than (similarly informed) surrogates, they may undermine the relational goals of shared decision-making which include not only capturing the patient’s voice, but also developing trust in healthcare professionals, and minimizing regret and distress for surrogates. Prior randomized controlled trials suggest that surrogate decision interventions that are focused solely on information are unlikely to improve decision-making and may neglect other important outcomes such as surrogate distress (White 2011). Good decision-making involves both cognitive and emotional processes, and interventions that emphasize these dual processes show promise for improving the concordance of surrogate decisions with patient wishes. Ongoing reflection and discussion is crucial as a patient’s experience of medical interventions and current quality of life evolves. Moreover, the process does more than simply identify patient preferences: it aims to help the surrogate (and their supporters) feel that their decision is comprehensible, in line with their understanding of the patient, and avoid feelings of regret, guilt, alienation, and even trauma (White 2011).

P4s insert themselves into this process in novel ways. They have an uncanny similarity to the patient’s voice, but are previously unknown to the surrogate or family. They present as authorities on the patient’s preferences, but have no interest in the patient’s flourishing. They are “strangers at the bedside” who are not there to facilitate discussion, but instead supersede family and friends as they offer declarations about the patient’s wishes. Thus, even if they are more accurate predictors of patient preferences, there are several ways in which they may distort the relational context of surrogate decision-making.

First, if a P4 is discordant with the surrogate’s interpretation of what the patient would have wanted, the expression of those preferences may alienate the surrogate from the person that they (thought) they knew. This should be especially concerning if, like us, you have questions about the unreliability of P4s. Second, contradicting (or even just begrudgingly accepting) the P4s’ judgements about the patient’s preferences is likely to turbocharge uncertainty about the decision, and associated feelings of regret. Third, whether or not a P4 is concordant with the surrogate’s understanding of the patient, there is a real risk that it cuts short an exploration of the patient’s values. Lastly, there is evidence that family surrogates pay more attention to the current wellbeing and suffering of the patient than professional guardians, a much needed perspective that may be undermined by P4 (Jox et al. 2012). The hype around these systems, and the fidelity with which they parrot the patient’s voice, risk them taking on the status of oracles rather than stochastic narrators of the patient’s past self. The temptation – perhaps framed as an obligation – will be to defer to these systems without reflection. While Earp et al are careful to frame the use of these systems as voluntary, it’s easy to see how surrogates and physicians will feel immense pressure to defer to their judgements.

In summary, the use of these systems may undermine the relational process of surrogate decision-making for no clear benefit. While the problem of predicting patient preferences cries out for innovation, we should carefully consider the risks of inviting these uncanny parrots to participate in the delicate cognitive, emotional, and social process of surrogate decision-making.

References

  1. Bender Emily M., Gebru Timnit, McMillan-Major Angelina, and Shmitchell Shmargaret. 2021. “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?.” In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610–23. FAccT ‘21. New York, NY, USA: Association for Computing Machinery. 10.1145/3442188.3445922. [DOI] [Google Scholar]
  2. Fischer Gary S., Tulsky James A., Rose Mary R., Siminoff Laura A., and Arnold Robert M.. 1998. “Patient Knowledge and Physician Predictions of Treatment Preferences after Discussion of Advance Directives.” Journal of General Internal Medicine 13 (7): 447–54. 10.1046/j.1525-1497.1998.00133.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Jox Ralf J., Denke Eva, Hamann Johannes, Mendel Rosmarie, Förstl Hans, and Borasio Gian Domenico. 2012. “Surrogate Decision Making for Patients with End-Stage Dementia.” International Journal of Geriatric Psychiatry 27 (10): 1045–52. 10.1002/gps.2820. [DOI] [PubMed] [Google Scholar]
  4. Kim Hyunwoo, Melanie, Sclar Xuhui Zhou Ronan Bras Gunhee Kim Yejin Choi, and Maarten Sap. 2023. “FANToM: A Benchmark for Stress-Testing Machine Theory of Mind in Interactions.” In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, edited by Bouamor Houda, Pino Juan, and Bali Kalika, 14397–413. Singapore: Association for Computational Linguistics. 10.18653/v1/2023.emnlp-main.890. [DOI] [Google Scholar]
  5. London Alex John. 2024. “‘Emergent Abilities,’ AI, and Biosecurity: Conceptual Ambiguity as an Obstacle to Sound Science and Policy.” Manuscript. Carnegie Mellon University. [Google Scholar]
  6. Mack Jennifer W., and Smith Thomas J.. 2012. “Reasons Why Physicians Do Not Have Discussions About Poor Prognosis, Why It Matters, and What Can Be Improved.” Journal of Clinical Oncology 30 (22): 2715–17. 10.1200/JCO.2012.42.4564. [DOI] [PubMed] [Google Scholar]
  7. Morrison R. Sean, Meier Diane E., and Arnold Robert M.. 2021. “What’s Wrong With Advance Care Planning?” JAMA 326 (16): 1575–76. 10.1001/jama.2021.16430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Serrano-Gemes Gema, Gil Isabel, Coelho Adriana, and Serrano-del-Rosal Rafael. 2021. “Avoidant Coping of the Decision-Making Process on the Location of Care in Old Age: A Possible Conspiracy of Silence?” International Journal of Environmental Research and Public Health 18 (24). 10.3390/ijerph182412940. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Valmeekam Karthik, Marquez Matthew, Sreedharan Sarath, and Kambhampati Subbarao. 2023. “On the Planning Abilities of Large Language Models - A Critical Investigation.” In Thirty-Seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=X6dEqXIsEW. [Google Scholar]
  10. White Douglas B. 2011. “Rethinking Interventions to Improve Surrogate Decision Making in Intensive Care Units.” American Journal of Critical Care : An Official Publication, American Association of Critical-Care Nurses 20 (3): 252–57. 10.4037/ajcc2011106. [DOI] [PMC free article] [PubMed] [Google Scholar]

RESOURCES