Skip to main content
Journal of General Internal Medicine logoLink to Journal of General Internal Medicine
. 2024 Nov 7;40(4):790–795. doi: 10.1007/s11606-024-09177-9

Bias Sensitivity in Diagnostic Decision-Making: Comparing ChatGPT with Residents

Henk G Schmidt 1, Jerome I Rotgans 2,3, Silvia Mamede 3,
PMCID: PMC11914423  PMID: 39511117

Abstract

Background

Diagnostic errors, often due to biases in clinical reasoning, significantly affect patient care. While artificial intelligence chatbots like ChatGPT could help mitigate such biases, their potential susceptibility to biases is unknown.

Methods

This study evaluated diagnostic accuracy of ChatGPT against the performance of 265 medical residents in five previously published experiments aimed at inducing bias. The residents worked in several major teaching hospitals in the Netherlands. The biases studied were case-intrinsic (presence of salient distracting findings in the patient history, effects of disruptive patient behaviors) and situational (prior availability of a look-alike patient). ChatGPT’s accuracy in identifying the most-likely diagnosis was measured.

Results

Diagnostic accuracy of residents and ChatGPT was equivalent. For clinical cases involving case-intrinsic bias, both ChatGPT and the residents exhibited a decline in diagnostic accuracy. Residents’ accuracy decreased on average 12%, while the accuracy of ChatGPT 4.0 decreased 21%. Accuracy of ChatGPT 3.5 decreased 9%. These findings suggest that, like human diagnosticians, ChatGPT is sensitive to bias when the biasing information is part of the patient history. When the biasing information was extrinsic to the case in the form of the prior availability of a look-alike case, residents’ accuracy decreased by 15%. By contrast, ChatGPT’s performance was not affected by the biasing information. Chi-square goodness-of-fit tests corroborated these outcomes.

Conclusions

It seems that, while ChatGPT is not sensitive to bias when biasing information is situational, it is sensitive to bias when the biasing information is part of the patient’s disease history. Its utility in diagnostic support has potential, but caution is advised. Future research should enhance AI’s bias detection and mitigation to make it truly useful for diagnostic support.

Supplementary Information

The online version contains supplementary material available at 10.1007/s11606-024-09177-9.

INTRODUCTION

An accurate diagnosis is probably the most important step in restoring the health of a patient. At the same time, reaching such accurate diagnosis is a fragile process sensitive to all kinds of disturbances. First, the same disease may present itself differently in different patients. Moreover, complaints, signs, and symptoms produced by different diseases show considerable overlap, making diagnostic decision-making complex. Hence, physicians sometimes make mistakes. About 50% of these mistakes are diagnostic errors outright.1,2 A report from the National Academy of Medicine has underscored the critical nature of diagnostic errors within health care, identifying them as not only prevalent but also potentially among the most harmful patient safety issues.3 An important cause of diagnostic errors are biases. Studies conducted retrospectively have revealed that biases contribute to approximately three-quarters of real-life diagnostic errors.4 Lack of appropriate knowledge and sensitivity to irrelevant but prominent contextual cues are major sources of such biased decision-making, highlighting the need to understand and mitigate biases in diagnostic reasoning processes.57

Given that decision support through generative artificial intelligence is being explored as a potential solution to diagnostic errors,810 it is important to examine whether AI chatbots like the Chat Generative Pre-Trained Transformer (ChatGPT) are less susceptible to errors caused by biasing information. In this article, we present findings from a study comparing the diagnostic performance of medical residents with that of ChatGPT on the same clinical problems. The resident data are drawn from five previously published experiments that assessed diagnostic accuracy on clinical vignettes designed to induce bias versus vignettes without bias-triggering information. We will focus on comparing how ChatGPT and the residents handle clinical cases that are subject to bias versus those that are not.

ChatGPT, an AI language model developed by OpenAI, is based on the GPT architecture.11 It generates human-like text by analyzing context, language nuances, and patterns from its extensive training data. This model assists with various tasks such as answering questions, writing essays, providing explanations, and engaging in conversations. As a Large Language Model (LLM), it predicts the next word in a sequence by analyzing word patterns and frequencies within a vast textual dataset. The training process involves unsupervised learning, allowing the model to make predictions without explicit human feedback.10 However, two known issues are relevant to the validity of our study. First, the output of generative AI can be influenced by how the chatbot is prompted, leading some to advocate for “prompt engineering” to achieve optimal results.12,13 Second, chatbot outputs may be biased if the training data contains biased information.14 We will explore both issues while discussing our findings.

There have been several attempts to compare diagnostic accuracy of ChatGPT with performance of physicians.1526 These studies have compared the performance of ChatGPT with criterion diagnoses established by experts or with diagnoses from only a small group of physicians. The present study, however, differs in two significant ways. First, it compares ChatGPT’s diagnostic performance with that of 265 internal medicine and general practice residents who participated in five previously published experiments. Second, these experiments were specifically designed to investigate the negative impact of biasing factors on clinical reasoning. All the experiments revealed that biasing interventions negatively affected diagnostic accuracy compared to control groups not exposed to such interventions, indicating that the participating residents were highly susceptible to these contextual biases.

The main objective of the present study was therefore to find out the extent to which ChatGPT is also sensitive to biasing information, information that is such an important cause of diagnostic errors committed by human diagnosticians. To that end, we compared mean diagnostic performance of the physicians involved in the various conditions of these experiments with performance of ChatGPT 4.0 and its predecessor ChatGPT 3.5.

METHODS

Participants

Participants were 202 residents in internal and 63 residents in family medicine. These residents were generally juniors, in training for 1 or 2 years. They participated in five experiments aimed at studying the influence of biasing information on diagnostic performance.2731 More detailed information describing participants, materials, and procedures applied in these experiments can be found in Supplement 1.

Materials

Forty-two clinical vignettes were presented to the participants of these five experiments. The vignettes described a patient’s disease history and presenting complaints. In addition, results of physical examination and laboratory findings were part of the text rarely exceeding 300 words. Each of these vignettes was based on a real patient and had a confirmed diagnosis. See Supplement 1.

Procedure

The participants’ task was to provide a most likely diagnosis for these clinical vignettes in an open-ended format. One experiment29 examined the biasing effect of salient distracting features (SDFs), information that strongly suggests a particular diagnosis that ultimately turns out to be irrelevant to the patient’s problem. In a within-subjects design, participants were confronted with vignettes that either contained SDFs at the beginning of the text, at the end, or no SDFs. Clinical vignettes containing an SDF were diagnosed significantly poorer than cases without an SDF. The number of SDF-related errors increased significantly depending on condition of the experiment, demonstrating that it was the biasing role of the SDF that caused the mistakes.

A second and a third experiment studied the effect on diagnostic accuracy of disruptive patient behaviors such as aggression, lack of trust, care avoidance, or contempt.27,31 Participants were confronted in a within-subjects design with either disruptive or neutral versions of the same patient. Compared with responses to the neutral version, residents made significantly more diagnostic mistakes when confronted with a patient displaying disruptive behaviors.

Two experiments studied the effect of availability bias on diagnostic accuracy.28,30 In one experiment,28 participants had to judge the accuracy of a given diagnosis for a particular disease before being confronted with clinical cases that looked very much like the initial case but in fact harbored different diseases. The expectation (and finding) was that the initial judgement task would bias participants into providing a similar diagnosis for the look-alike clinical cases. The second study had participants first judge the accuracy and comprehensiveness of a Wikipedia page describing characteristics of a particular disease (e.g., Q-fever) before having them diagnose cases that looked like the one on the Wikipedia page but in fact had a different diagnosis.30 It is important to note that in the first three experiments the biasing information was part of the clinical cases presented, either as SDFs or as a description of a neutral versus disruptive patient. The biasing information was here intrinsic to the patient history. In the availability bias experiments, the biasing information preceded the actual clinical cases and can therefore be considered extrinsic to these cases. This distinction is reflected in the ways we analyzed the results of ChatGPT.

The clinical vignettes used in these experiments were submitted to ChatGPT 4.0 and 3.5. The latter was only included because some contributors to the OpenAI Developer Forum consider version 4 worse than 3.5 (https://community.openai.com/t/chatgpt-4-is-worse-than-3-5/588078). For the vignettes in which the biasing information was intrinsic to the case, the following prompt was submitted to ChatGPT: “Please generate a most-likely diagnosis for each of the following clinical cases. I require diagnoses only, no need to provide explanations,” followed by the relevant vignettes. The primary outcome was whether ChatGPT provided the correct diagnosis. Its performance was scored as either 1 (correct diagnosis present) or 0 (correct diagnosis not present). Subsequently, a mean proportion of accurate diagnoses was computed similar to the mean proportion of accurate diagnoses provided by the human participants.

The vignettes involved in the Mamede et al. (2010) availability bias experiment were preceded by the following prompt: “Today you will first be asked to assess the likelihood of a particular diagnosis for a particular case. It concerns 2 cases. I do not need explanations; I just want you to estimate its likelihood in the form of a percentage. Subsequently you will be asked to diagnose 8 cases. For each case you will be asked to generate the most-likely diagnosis. I require diagnoses only, no need to provide explanations,” followed by the judgement task and the 8 cases to diagnose.

The prompt for the Schmidt et al. (2014) study was: “More and more often, patients consult sites such as Wikipedia, a popular online encyclopedia, to check what kind of disease they might have. It is therefore important that the information relayed via this medium be accurate and comprehensive. You are therefore requested to evaluate the quality of information that laypersons would encounter in a Wikipedia entry. Subsequently you will be asked to diagnose 4 cases. For each case you will be asked to generate the most-likely diagnosis. But first: how accurate and comprehensive is the Wikipedia entry for (depending on treatment group either Q-fever or Legionnaire’s) disease? The entry can be found here: (depending on treatment group a link was provided to the Wikipedia entry for either Q-fever or Legionnaire’s disease),” followed by the 4 cases to diagnose. All these prompts were identical to the prompts given to the residents. (ChatGPT has the tendency to provide an explanation even if you do not ask for it. We therefore instructed it explicitly that we would not need an explanation.)

Statistical Analysis

Since responses of ChatGPT are equivalent to the responses of only one person or measurement unit, and therefore by definition lack within-group variance, almost all the studies comparing diagnostic performance between humans and ChatGPT are strictly descriptive lacking inferential statistics.16,17,21,22,2426 A notable exception is a study by Jarou and colleagues.20 Given that ChatGPT uses probability models to generate responses, it does not always answer the same prompt with the same response. Therefore, Jarou et al. presented 50 times the same clinical vignettes to ChatGPT. We have chosen the same approach. This allowed us to compute chi-square goodness-of-fit statistics testing for significance. The resident data were considered the expected data. Under the null hypothesis, the observed ChatGPT data would not be significantly different from the expected data indicating that ChatGPT is equally sensitive to bias. Rejection of the null hypothesis would be a sign that ChatGPT was differently affected by bias, showing either more or less susceptibility to the biasing information than the residents.

RESULTS

Table 1 presents the results of the analyses involving the experiments in which biasing information intrinsic to the history of the patient was manipulated. Each of the experiments included a condition in which no biasing information was presented.

Table 1.

Mean Diagnostic Accuracy Proportions for Residents, ChatGPT 4.0, and ChatGPT 3.5 in Experiments in Which Biasing Information Intrinsic to the Patient History Was Manipulated

Experiment Treatment Number of residents Residents’ actual performance ChatGPT 4.0 performance ChatGPT 3.5 performance
Mamede, 201429 Cases containing no salient distracting features 72 0.54 0.58 0.50
Cases containing salient distracting features at the beginning 72 0.39 0.25 0.33
Cases containing salient distracting features at the end 72 0.44 0.34 0.33
Schmidt, 201731 Cases describing a neutral patient 63 0.64 0.67 0.67
Cases describing a disruptive patient 63 0.54 0.50 0.67
Mamede, 201727 Cases describing a neutral patient 74 0.51 0.63 0.63
Cases describing a disruptive patient 74 0.41 0.50 0.50

In the experiments summarized in Table 1, the presence of biasing information in a case caused a significant decrease in diagnostic accuracy among residents. The weighted mean drop in accuracy was 12%. Accuracy of ChatGPT 4.0 decreased by 21% while accuracy of ChatGPT 3.5 decreased 9%. Chi-square tests comparing performance of residents with that of ChatGPT led to the following results. For the 2014 Mamede et al. study29: chi-square (2, 72) = 5.47, p > .05. For the Schmidt et al. study31: chi-square (1, 63) = 0.28, p > .05. And for the 2017 Mamede et al. study27: chi-square (1, 74) = 3.55, p > .05. These findings suggest that, like human diagnosticians, ChatGPT is sensitive to bias when the biasing information is part of the patient history.

Table 2 contains the results of the analyses involving the experiments in which biasing information extrinsic to the patient history was manipulated.

Table 2.

Mean Diagnostic Accuracy Proportions for Residents, ChatGPT 4.0, and ChatGPT 3.5 in Experiments in Which Biasing Information Extrinsic to the Patient History Was Manipulated

Experiment Treatment Number of residents Residents’ actual performance ChatGPT 4.0 performance ChatGPT 3.5 performance
Mamede, 201028 Not subject to availability bias 18 0.55 0.75 0.25
Subject to availability bias 18 0.39 0.75 0.25
Schmidt, 201430 Not subject to availability bias 19 0.70 0.50 --*
Subject to availability bias 19 0.56 0.50 --*

*ChatGPT 3.5 did not allow the use of websites as the triggering material used in this experiment

When the biasing information was extrinsic to the case, residents’ accuracy decreased between 14 and 16% with a weighted mean decrease of 15%. By contrast, ChatGPT’s performance was not affected by the biasing information. For the Mamede et al. study28: chi-square (1, 18) = 7.29, p < .05, suggesting that expected and observed data differ significantly. For the Schmidt et al. study30: chi-square (1, 19) = 1.21, p > .05. The latter result suggests that the null hypothesis could not be rejected despite of the fact that ChatGPT was clearly not affected by the biasing information. This was probably due to the small difference between conditions in the residents group.

DISCUSSION

Diagnostic error is a ubiquitous threat to the quality of health care, 3,4 and we have argued that many of such mistakes are the result of bias; physicians are led astray because they are sensitive to contextual information that is irrelevant to the task at hand.57 Artificial intelligence is often perceived as having the potential to improve on the accuracy of diagnoses because of its access to large numbers of data and its ability to integrate these data into coherent diagnostic decisions.810

In the present study, we have attempted to assess to what extent ChatGPT, like physicians, is sensitive to biasing information. To that end, we compared performance of ChatGPT with that of residents in five experiments aimed to study the negative effects of biasing information on diagnostic decision-making.2731 We made a distinction between information that was intrinsic to the case and information that was extrinsic to it. Intrinsic biasing information is part of the patient’s disease history, such as the presence of salient distracting features or information about disruptive behaviors of a patient. Extrinsic biasing information is not part of the patient disease history. The example studied in this article was availability bias, the earlier confrontation with a look-alike patient.28,30 See for a review of experiments testing effects of intrinsic and extrinsic biases Schmidt et al.32

Our findings can be summarized as follows.

First, when biasing information is part of the patient’s disease history or describes disruptive behaviors, ChatGPT, like humans, makes more mistakes than when bias is absent. The question is why this is so. The reader is reminded that the cases used were experimentally altered to trigger the biased diagnoses. For instance, in the Mamede at al. (2014) experiment, the vignette of a patient with stomach cancer was altered by including the following salient distracting features in the two experimental versions: “(the patient is) a smoker with a history of chronic NSAID medication for osteoarthritis and with a father who suffered from a stomach ulcer, as shown by the family history,” leading some residents to conclude that this patient in fact suffered from a peptic ulcer. Since ChatGPT operates upon probabilistic dependencies among concepts, the occurrence of NSAID medication and stomach ulcer in the family might have led to increasing the probability of peptic ulcer as a likely diagnosis and therefore decreasing the probability of stomach cancer. And indeed, such SDF-driven diagnoses appear in the responses of ChatGPT, adding to diagnostic error. Even more surprising is the fact that a mere description of the behaviors of a disruptive patient, in the absence of a hint to a particular disease, is sufficient for ChatGPT to produce diagnostic errors. This suggests that ChatGPT conjectures that these behaviors are also a sign of disease and takes them into account.

Second, when bias is situational, such as a look-alike patient seen before, then ChatGPT is not sensitive to the biasing information. Since the vignettes themselves were the same under the different conditions of the experiments, we found that ChatGPT clearly only responds to the case as presented, ignoring biasing information in the context. We made several attempts to “nudge” ChatGPT into taking the context into account. In one attempt regarding the Mamede et al. (2010) availability bias experiment, we added to the prompt displayed in the “METHODS” section: “Please consider the 8 cases with the result of the confirmation task in mind.” This did not change ChatGPT’s response. A second attempt consisted of the following additional prompt: “I would like you to indicate whether or not the 8 cases are similar to the 2 cases you have encountered previously.” Even by explicitly being asked whether the criterion cases were similar to the biasing diagnosis, ChatGPT responded by suggesting the unbiased diagnoses as its most likely choice. It seems that only if ChatGPT ponders the biasing information to be an integral part of the case, it computes diagnoses that demonstrate the influence of bias.

Our findings give rise to some further considerations.

ChatGPT’s responses are as good as the materials on which it was trained. If the materials are biased in some way, for instance regarding gender, race, or culture, ChatGPT’s output would tend to reflect this.14,33 Since it is not known which clinical case data are included in its training, it is difficult to check whether the mistakes made by ChatGPT in our study derive from errors in the training materials rather than from our experimental manipulations. We have checked our clinical vignettes against internet sources, notably Wikipedia and the National Library of Medicine, but could not find cases containing the same biasing information as applied in our experiments. In addition, our vignettes are not in the public domain. Nevertheless, a possible bias in some of ChatGPT’s training materials puts limits on our findings.

Some authors maintain that ChatGPT’s accuracy is a function of the nature of the prompts used.12,13 It is possible that if we would have used prompts different from the ones described in the “METHODS” section ChatGPT would have made fewer mistakes, another possible limitation. However, to be able to directly compare the performance of the residents with ChatGPT, it was mandatory to use in both cases the same prompts. In addition, when we tried to nudge ChatGPT into considering different information while deciding on the most likely diagnoses, its output was not affected.

Our sample consisted solely of junior residents. This is an additional limitation to our findings because these residents may not be representative for the wider population of physicians. However, other studies show that more experienced physicians are also influenced by biasing information.34,35 A final limitation is that we presented these residents with textual vignettes rather than more realistic clinical cases. On a different note, clinical vignettes are extensively used in the study of diagnostic error. Peabody and colleagues have demostrated that vignettes are as valid as standardized patients or chart abstraction if one wishes to study the quality of health care decisions.36

A final consideration is the following. Like humans, ChatGPT is error prone. Even under conditions in which experimental biasing information is absent, it performs only slightly (if consistently) better than the junior residents involved in these experiments. These findings are in general agreement with other studies involving ChatGPT in diagnostic reasoning.1519,2126,37 These findings imply that for now, applying ChatGPT to diagnostic problems should only be done under caution. It is fallible, like we are.

Supplementary Information

Below is the link to the electronic supplementary material.

Author Contributions:

Henk G. Schmidt and Jerome I. Rotgans conducted the analysis using ChatGPT and wrote subsections for the manuscript, together with Silvia Mamede. Henk G. Schmidt wrote the final draft. All authors take responsibility for the manuscript as a whole.

Declarations:

Ethics Approval:

The present study did not involve testing of human participants. Therefore, no ethics approval was asked.

Conflict of Interest:

The authors declare that they do not have a conflict of interest.

Footnotes

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

  • 1.Cabot RC. Diagnostic pitfalls identified during a study of three thousand autopsies. J Am Med Assoc. 1912;59(26):2295-2298. [Google Scholar]
  • 2.Graber M. Diagnostic errors in medicine: a case of neglect. Joint Comm J Qual Patient Safe. 2005;31(2):106-113. [DOI] [PubMed] [Google Scholar]
  • 3.Balogh EP MB, Ball JR, ed. Committee on Diagnostic Error in Health Care; Board on Health Care Services; Institute of Medicine; The National Academies of Sciences, Engineering, and Medicine. Improving Diagnosis in Health Care. National Academies Press; 2015. [PubMed]
  • 4.Graber ML, Franklin N, Gordon R. Diagnostic error in internal medicine. Arch Int Med. 2005;165(13):1493-1499. 10.1001/archinte.165.13.1493 [DOI] [PubMed] [Google Scholar]
  • 5.Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med. 2003;78(8):775-780. [DOI] [PubMed] [Google Scholar]
  • 6.Mamede S, de Carvalho MA, de Faria RMD, et al. Immunising’ physicians against availability bias in diagnostic reasoning: a randomised controlled experiment. BMJ Qual Safe. 2020;29(7):550-559. 10.1136/bmjqs-2019-010079 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Norman GR, Monteiro SD, Sherbino J, Ilgen JS, Schmidt HG, Mamede S. The Causes of Errors in Clinical Reasoning: Cognitive Biases, Knowledge Deficits, and Dual Process Thinking. Acad Med. 2017;92(1):23-30. 10.1097/acm.0000000000001421 [DOI] [PubMed] [Google Scholar]
  • 8.Beam AL, Drazen JM, Kohane IS, Leong T-Y, Manrai AK, Rubin EJ. Artificial Intelligence in Medicine. New England J Med. 2023;388(13):1220-1221. 10.1056/NEJMe2206291 [DOI] [PubMed] [Google Scholar]
  • 9.Editor. AI in medicine: creating a safe and equitable future. The Lancet. 2023;402:503. [DOI] [PubMed]
  • 10.Raza MM, Venkatesh KP, Kvedar JC. Generative AI and large language models in health care: pathways to implementation. npj Digi Med-Nat. 2024;7(1):62. 10.1038/s41746-023-00988-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.OpenAI. ChatGPT (Mar 14 version) [Large language model]. 2023;
  • 12.Ekin S. Prompt engineering for ChatGPT: a quick guide to techniques, tips, and best practices. Authorea Preprints. 2023;
  • 13.Giray L. Prompt engineering with ChatGPT: a guide for academic writers. Annals Biomed Eng. 2023;51(12):2629-2633. [DOI] [PubMed] [Google Scholar]
  • 14.Ray PP. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet Things Cyber-Phys Syst. 2023;3:121-154.
  • 15.Caruccio L, Cirillo S, Polese G, Solimando G, Sundaramurthy S, Tortora G. Can ChatGPT provide intelligent diagnoses? A comparative study between predictive models and ChatGPT to define a new medical diagnostic bot. Expert Syst Appl. 2024;235:121186. [Google Scholar]
  • 16.Goh E, Gallo R, Hom J, et al. Influence of a large language model on diagnostic reasoning: A randomized clinical vignette study. medrxiv. 2024:2024:2024-03. [DOI] [PMC free article] [PubMed]
  • 17.Hirosawa T, Kawamura R, Harada Y, et al. ChatGPT-Generated Differential Diagnosis Lists for Complex Case-Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med Inf. 2023;11e48808. 10.2196/48808 [DOI] [PMC free article] [PubMed]
  • 18.Horiuchi D, Tatekawa H, Oura T, et al. Comparison of the diagnostic accuracy among GPT-4 based ChatGPT, GPT-4V based ChatGPT, and radiologists in musculoskeletal radiology. medRxiv. 2023:2023.12.07.23299707. 10.1101/2023.12.07.23299707
  • 19.Horiuchi D, Tatekawa H, Shimono T, et al. Accuracy of ChatGPT generated diagnosis from patient’s medical history and imaging findings in neuroradiology cases. Neuroradiology. 2024/01/01 2024;66(1):73-79. 10.1007/s00234-023-03252-4 [DOI] [PubMed]
  • 20.Jarou ZJ, Dakka A, McGuire D, Bunting L. ChatGPT Versus Human Performance on Emergency Medicine Board Preparation Questions. Annals Emerg Med. 2024;83(1):87-88. 10.1016/j.annemergmed.2023.08.010 [DOI] [PubMed] [Google Scholar]
  • 21.Mehnen L, Gruarin S, Vasileva M, Knapp B. ChatGPT as a medical doctor? A diagnostic accuracy study on common and rare diseases. medRxiv. 2023:2023.04.20.23288859. 10.1101/2023.04.20.23288859
  • 22.Oon ML, Syn NL, Tan CL, Tan K-B, Ng S-B. Bridging bytes and biopsies: A comparative analysis of ChatGPT and histopathologists in pathology diagnosis and collaborative potential. Histopathology. 2024;84(4):601-613. 10.1111/his.15100 [DOI] [PubMed] [Google Scholar]
  • 23.Rao AS, Pang M, Kim J, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow. medRxiv. 2023:2023.02. 21.23285886. [DOI] [PMC free article] [PubMed]
  • 24.Shelmerdine SC, Martin H, Shirodkar K, Shamshuddin S, Weir-McCall JR. Can artificial intelligence pass the Fellowship of the Royal College of Radiologists examination? Multi-reader diagnostic accuracy study. BMJ. 2022;379e072826. [DOI] [PMC free article] [PubMed]
  • 25.Stoneham S, Livesey A, Cooper H, Mitchell C. ChatGPT versus clinician: challenging the diagnostic capabilities of artificial intelligence in dermatology. Clin Exp Dermatol. 2023;2023; 10.1093/ced/llad402 [DOI] [PubMed]
  • 26.Ueda D, Walston SL, Matsumoto T, Deguchi R, Tatekawa H, Miki Y. Evaluating GPT-4-based ChatGPT’s clinical potential on the NEJM quiz. BMC Digital Health. 2024;2(1):4. [Google Scholar]
  • 27.Mamede S, Van Gog T, Schuit SCE, et al. Why patients’ Disruptive Behaviours Impair Diagnostic Reasoning: a Randomised Experiment. BMJ Qual Safe. 2017;26(1):13-18. 10.1136/bmjqs-2015-005065 [DOI] [PubMed] [Google Scholar]
  • 28.Mamede S, van Gog T, van den Berge K, et al. Effect of availability bias and reflective reasoning on diagnostic accuracy among internal medicine residents. J Am Med Assoc. 2010;304(11):1198-203. 304/11/1198 [DOI] [PubMed] [Google Scholar]
  • 29.Mamede S, van Gog T, van den Berge K, van Saase JLCM, Schmidt HG. Why Do Doctors Make Mistakes? A Study of the Role of Salient Distracting Clinical Features. Acad Med. 2014;89(1):114-120. 10.1097/acm.0000000000000077 [DOI] [PubMed] [Google Scholar]
  • 30.Schmidt HG, Mamede S, van den Berge K, van Gog T, van Saase JLCM, Rikers RMJP. Exposure to Media Information About a Disease Can Cause Doctors to Misdiagnose Similar-Looking Clinical Cases. Acad Med. 2014;89(2):285-291. 10.1097/acm.0000000000000107 [DOI] [PubMed] [Google Scholar]
  • 31.Schmidt HG, van Gog T, Schuit SCE, et al. Do patients’ Disruptive Behaviours Influence the Accuracy of a Doctor’s Diagnosis? A Randomised Experiment. BMJ Qual Safe. 2017;26(1):19-23. 10.1136/bmjqs-2015-004109 [DOI] [PubMed] [Google Scholar]
  • 32.Schmidt HG, Norman GR, Mamede S, Magzoub M. The influence of context on diagnostic reasoning: A narrative synthesis of experimental findings. J Eval Clin Pract. 2024;30(6):1091-1101. 10.1111/jep.14023 [DOI] [PubMed] [Google Scholar]
  • 33.Caliskan A, Bryson JJ, Narayanan A. Semantics derived automatically from language corpora contain human-like biases. Science. 2017;356(6334):183-186. 10.1126/science.aal4230 [DOI] [PubMed] [Google Scholar]
  • 34.Frederick PD, Nelson HD, Carney PA, et al. The Influence of Disease Severity of Preceding Clinical Cases on Pathologists’ Medical Decision Making. Med Dec Making. 2017;37(1):91-100. 10.1177/0272989x16638326 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Hatala R, Norman GR, Brooks LR. Impact of a Clinical Scenario on Accuracy of Electrocardiogram Interpretation. J Gen Int Med. 1999;14(2):126-129. 10.1046/j.1525-1497.1999.00298.x [DOI] [PubMed] [Google Scholar]
  • 36.Peabody JW, Luck J, Glassman P, Dresselhaus TR, Lee M. Comparison of vignettes, standardized patients, and chart abstraction: a prospective validation study of 3 methods for measuring quality. Jama. 2000;283(13):1715-1722. [DOI] [PubMed] [Google Scholar]
  • 37.Rizwan A, Sadiq T. The Use of AI in Diagnosing Diseases and Providing Management Plans: A Consultation on Cardiovascular Disorders With ChatGPT. Cureus J Med Sci. 2023;15(8):e43106. 10.7759/cureus.43106 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Journal of General Internal Medicine are provided here courtesy of Society of General Internal Medicine

RESOURCES