Skip to main content
Journal of Medical Internet Research logoLink to Journal of Medical Internet Research
letter
. 2025 Jan 20;27:e69742. doi: 10.2196/69742

Advantages and Inconveniences of a Multi-Agent Large Language Model System to Mitigate Cognitive Biases in Diagnostic Challenges

Cedric Bousquet 1,2,, Divà Beltramin 3
Editors: Tiffany Leung, Libby Beri
PMCID: PMC11791434  PMID: 39832364

We read with great interest a recent article in the Journal of Medical Internet Research entitled “Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study” by Ke et al [1]. Large language models (LLMs) have advanced reasoning skills but most studies have focused on assessing their ability to answer questions [2]. Ke et al [1] evaluated the ability of LLMs to avoid reproducing certain biases that physicians experience when making a diagnosis. The authors simulated decision-making using a multi-agent system in which each agent models the reasoning of a physician using an LLM. The experiment shows a significant increase in performance compared to a human, with an odds ratio of 3.91, which is highly remarkable. This is made possible by a particularly low score for humans, which leaves considerable room for improvement by LLMs.

Previously published studies have often shown disappointing results using LLMs, with performance levels that give cause for concern when used in a clinical context. Some studies have observed performance similar to that of physicians and marginally better in a limited number of cases. We are thus surprised to see that the multi-agent approach has led to such superiority compared to the physicians. This superiority may be explained by using a multi-agent framework, but it could also be limited to the context of addressing cognitive biases.

To investigate this hypothesis, it would be interesting to compare the performance of the multi-agent system with that of a single LLM. In particular, a prompt engineering method known as tree of thought allows several reasoning paths to be explored, enabling the initial choice to be modified according to diverse alternatives [3].

Furthermore, a recent study found no added value in associating an LLM with a physician to improve performance when making a diagnosis [4]. Moreover, the quantity of text and the number of interventions produced by multi-agent systems may be overwhelming, and the physician could ignore the suggestions. This was the case for many studies on computerized decision support systems for drug prescriptions, where too many alerts have led physicians to override relevant alerts [5].

The multi-agent system can lead to significantly higher costs and requires the mobilization of a greater number of resources. This depends on the number of agents involved in the multi-agent system and, above all, on the amount of text generated. The multi-agent system is likely to produce numerous arguments and counterarguments to make a fair diagnosis. It would be interesting to know how many tokens are produced in the application logs compared to a simpler application.

The authors suggest that similar approaches could revolutionize medical practice by making clinical decisions more reliable and more consistent with the levels of evidence available. We subscribe to the idea that LLMs could help physicians, and the authors’ approach seems promising. However, there is still a lack of evidence on how these multi-agent systems would perform if they were generalized to other diagnostic challenges.

Abbreviations

LLM

large language model

Footnotes

Conflicts of Interest: None declared.

Editorial Notice

The corresponding author of “Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study” declined to respond to this letter.

References

  • 1.Ke Y, Yang R, Lie SA, Lim TXY, Ning Y, Li I, Abdullah HR, Ting DSW, Liu N. Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study. J Med Internet Res. 2024 Nov 19;26:e59439. doi: 10.2196/59439. https://www.jmir.org/2024//e59439/ v26i1e59439 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Bedi S, Liu Y, Orr-Ewing L, Dash D, Koyejo S, Callahan A, Fries JA, Wornow M, Swaminathan A, Lehmann LS, Hong HJ, Kashyap M, Chaurasia AR, Shah NR, Singh K, Tazbaz T, Milstein A, Pfeffer MA, Shah NH. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2024 Oct 15;:e2421700. doi: 10.1001/jama.2024.21700.2825147 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yao S, Yu D, Zhao J, Shafran I, Griffiths T, Cao Y, Narasimhan K. Tree of thoughts: deliberate problem solving with large language models. In: Oh A, Naumann T, Globerson A, Saenko K, Hardt M, Levine S, editors. Advances in Neural Information Processing Systems 36 (NeurIPS 2023) San Diego, CA: Neural Information Processing Systems Foundation, Inc; 2023. [Google Scholar]
  • 4.Goh E, Gallo R, Hom J, Strong E, Weng Y, Kerman H, Cool JA, Kanjee Z, Parsons AS, Ahuja N, Horvitz E, Yang D, Milstein A, Olson APJ, Rodman A, Chen JH. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open. 2024 Oct 01;7(10):e2440969. doi: 10.1001/jamanetworkopen.2024.40969. https://jamanetwork.com/journals/jamanetworkopen/fullarticle/10.1001/jamanetworkopen.2024.40969 .2825395 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Slight SP, Seger DL, Franz C, Wong A, Bates DW. The national cost of adverse drug events resulting from inappropriate medication-related alert overrides in the United States. J Am Med Inform Assoc. 2018 Sep 01;25(9):1183–1188. doi: 10.1093/jamia/ocy066. https://europepmc.org/abstract/MED/29939271 .5043159 [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Journal of Medical Internet Research are provided here courtesy of JMIR Publications Inc.

RESOURCES