Skip to main content
PLOS Digital Health logoLink to PLOS Digital Health
. 2024 Feb 14;3(2):e0000349. doi: 10.1371/journal.pdig.0000349

Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam

Stefan Morreel 1,*, Veronique Verhoeven 1, Danny Mathysen 1,2
Editor: Imon Banerjee3
PMCID: PMC10866461  PMID: 38354127

Abstract

Recently developed chatbots based on large language models (further called bots) have promising features which could facilitate medical education. Several bots are freely available, but their proficiency has been insufficiently evaluated. In this study the authors have tested the current performance on the multiple-choice medical licensing exam of University of Antwerp (Belgium) of six widely used bots: ChatGPT (OpenAI), Bard (Google), New Bing (Microsoft), Claude instant (Anthropic), Claude+ (Anthropic) and GPT-4 (OpenAI). The primary outcome was the performance on the exam expressed as a proportion of correct answers. Secondary analyses were done for a variety of features in the exam questions: easy versus difficult questions, grammatically positive versus negative questions, and clinical vignettes versus theoretical questions. Reasoning errors and untruthful statements (hallucinations) in the bots’ answers were examined. All bots passed the exam; Bing and GPT-4 (both 76% correct answers) outperformed the other bots (62–67%, p = 0.03) and students (61%). Bots performed worse on difficult questions (62%, p = 0.06), but outperformed students (32%) on those questions even more (p<0.01). Hallucinations were found in 7% of Bing’s and GPT4’s answers, significantly lower than Bard (22%, p<0.01) and Claude Instant (19%, p = 0.02). Although the creators of all bots try to some extent to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions.Bing was able to detect weak or ambiguous exam questions. Bots could be used as a time efficient tool to improve the quality of a multiple-choice exam.

Author summary

Artificial chatbots such as ChatGPT have recently gained a lot of attention. They can pass exams for medical doctors, sometimes they even perform better than regular students. In this study, we have tested ChatGPT and five other (newer) chatbots in the multiple-choice exam that students in Antwerp (Belgium) must pass to obtain the degree of medical doctor. All bots passed the exam with results similar or better than the students. Microsoft Bing Chat (name at the time of writing, at the time of publication called Microsoft Copilot) scored the best of all tested bots but still produces hallucinations (untruthful statements or reasoning errors) in seven percent of the answers. Bots performed worse on difficult questions but they outperformed students on those questions even more. Maybe they are most useful when humans don’t know the answer themselves? The creators of the bots try to some extent to avoid their bots being used as a medical doctor, none of the tested bots succeeded as none refused to answer all clinical case questions. Microsoft Bing also turned out to be useful to find weak questions and as such improved the studied exam.

Introduction

The development of AI applications announces a new era in many fields of society including medicine and medical education. Especially artificial intelligence (AI) chatbots based on large language models (further called bots) have promising features which could facilitate education by offering simulation training, by personalizing learning experiences with individualised feedback, or by acting as a decision support in clinical training situations. However, before adopting this technology in the medical curriculum, its capabilities have yet to be thoroughly tested [1,2].

Soon after the first bots became publicly available, higher medical education institutes started to report on their performance in medical exam simulations [3]. A scoping review listed its potential use in medical teaching: automated scoring, teaching assistance, personalized learning, research assistance, quick access to information, generating case scenarios and exam questions, content creation for learning facilitation, and language translation [4].

Whereas bots seem to be informative and logical in many of their responses, in others they answer with obvious, sometimes dangerous, hallucinations (confident responses which however contain reasoning errors or are unjustified by the current state of the art) [5]. They will reproduce flaws in the datasets they are trained by; they may reflect or even amplify societal inequality or biases or generate inaccurate or fake information [6].

Mostly, bots perform near the passing mark [69], although they outperform students in some reports [1012]. Performance is in general better on more easy questions and when the exam is written in English [13,14]. Notably their score is generally worse as exams at more advanced stages in the medical curriculum are offered. However, bots seem to learn rapidly, and new versions do considerably better than their prototypes [1517]. As bots evolve, their proficiency needs continuous monitoring and updating.

Whereas media articles state that higher education institutes already anticipate the dangers of bots in terms of possible exam fraud, they also offer opportunities to assist in developing exams, for example by identifying ambiguous or badly formulated exam questions.

Very few comparisons between different bots have been made, and those that do exist only compare two or three bots and do not report hallucination rates [18,19].

In this study, we use the final theory exam that all medical students need to pass to obtain the degree of Medical Doctor. It is followed by an oral exam which is not part of this study. The current exam was used in 2021 at the University of Antwerp, Belgium. It is similar to countrywide exams used in other countries, such as the United States Medical Licensing exam step 1 and step 2CK [20].

In this study we have tested the current performance of six publicly available bots on the University of Antwerp medical licensing exam. The primary outcomes concern the performance of each bot on the exam. Secondary outcomes include performance on subsets of questions, interrater variability, proportion of hallucinations and the detection of possible weak exam questions.

Material and methods

Ethics

This experiment has been approved by the Ethics Committee of the University of Antwerp and the Antwerp University Hospital (reference number MDF 21/03/037, amendment number 5462).

Materials

At the end of the undergraduate medical training at the University of Antwerp, medical students must pass a general medical knowledge examination before being licensed as medical doctor. Besides an oral viva examination, this general medical knowledge examination contains 102 multiple choice questions covering the entire range of curricular courses. In this study, the exam as it was presented to the students in their second master year (before their final year of clinical training) was used. The scoring system was adapted afterwards, so the student’s scores in this paper do not reflect the actual grades given to the students. The questions were not available online, so they were not used for the training of the studied bots.

Bot selection

Six bots that are publicly available and can currently be used by teachers and students were tested. The most widely used free bots were selected: ChatGPT (OpenAI), Bard (Google), and New Bing (Microsoft, called Bing Chat at the time of writing and Microsoft Copilot at the time of publication). Claude instant (Anthropic), Claude+(Anthropic) and GPT-4(OpenAI) were added to the list because they allow for an evaluation of the difference between a free and a paying version. Even though Bing is based on the GPT-4 large language model, it also uses other sources such as Bing Search so it is a customized version of the pure GPT-4 bot [21].

Data extraction

The exam was translated using Deepl (DeepL SE), a neural machine translation service. Clear translation errors were corrected by author SM, but the writing style and grammar were not improved in order to mimic an everyday testing situation. Questions containing images/tables (N = 2) and local questions were excluded (N = 5). Local questions were excluded because they concern theories, frameworks or models that have only been described in Dutch and are only applicable to Belgium and the Netherlands. Literal translation of these questions leads to nonsense questions in English.

Details on how and when the bots were used can be found in Table 1. By coincidence, the authors found out that when Bard refuses to answer a medical question, prompting it with “please regenerate draft” may force it to answer the question anyhow. This was not the case for the other bots. In all cases where Bard refused to answer, this additional prompt was used.

Table 1. Overview of the tested generative chat bots.

Bot Large Language Model Properties Avoiding memory retention Log in? Access dates Price
Bing GPT-4 Conversation style = More precise “New topic” function is used after each question Microsoft account 7-9/6/2023 Free
Bard PALM 2 Accessed using a virtual private network to emulate US location “Reset Chat” function is used after each question Google account 12-14/06/2023 Free
ChatGPT GPT-3.5 Accessed through Poe* A new chat is started using the broom button Poe* log in 12-26/06/2023 Free,
A paying version exists based on GPT-4.
Claude+ Claude version 1 Accessed through Poe* Broom button Poe* log in 12-26/06/2023 Free trial on Poe paying afterwards
Claude Instant Lighter version of Claude version 1 Accessed through Poe* Broom button Poe* log in 12-26/06/2023 Free trial on Poe paying afterwards
GPT-4 GPT-4 Accessed through Poe* Broom button Poe* log in 12-26/06/2023 Free trial on Poe paying afterwards

GPT: generative pre-trained transformer

PaLM: Pathways Language Model

*: Poe (Platform for Open Exploration, Quora) was used because it allows fluent testing of multiple bots at the same time. A trial subscription of one week was used.

Outcomes

The primary outcome was the performance on the exam expressed as a proportion of correct answers (score). This outcome was also measured in the same way as the students were rated on this exam (adapted score): eleven questions contained a second best answer (an acceptable alternative to the best answer), a score of 0.33 was awarded when this option was chosen; twenty questions contained a fatal answer (this option is dangerous for the patient) leading to a score of -1. For calculation of the student’s scores, the image, table, and local questions were excluded as well.

The primary outcomes were assessed in four subsets of answers. Firstly, the difficulty of the questions: thirteen questions were difficult (recorded P-value in question bank below 0.30 meaning that less than 30% of the students answered the question correct [22]), 36 easy (recorded P-value in question bank above 0.80) and 46 moderate (recorded P-value in question bank between 0.30 and 0.80). Secondly, the grammar of the questions: negative formulated questions (e.g., “which statement is not correct?”) vs positive statements. Five questions were negatively formulated. Thirdly, the type of question: theory (50 questions) or describing a patient (clinical vignette, 45 questions). Finally, questions with vs without fatal answers.

In those cases where a bot answered a question incorrectly with a fatal answer, the proportion of selected fatal answers among all wrong answers was calculated.

The primary outcome was also assessed for a virtual bot (called Ensemble Bot), the answer of this bot was the most common value (mode) of the answers of all six bots [23]. The reasoning behind an ensemble bot is that it enables possible improvements to a decision system’s robustness and accuracy by combing several bots and thus reducing variance [24].

Three additional outcomes were assessed. Firstly, the proportion of hallucinations as rated by the authors among the incorrect answers of the best scoring bot. Authors VV and DM read all incorrect answers and judged them as containing a hallucination or not. In case of discordance, author SM made a final decision. A hallucination was previously defined as content that is nonsensical or untruthful in relation to certain sources [25]. This definition is not usable for the current research so the authors defined a hallucination as content that either contains clear reasoning or is untruthful in relation to current evidence based medical literature. To detect reasoning errors, no medical knowledge is required. For example: “the risk is about 1 in 100 (3%)”. To detect untruthful answers, the authors had to use their own background knowledge combined with common online resources to verify the AI answers. One clear example of an untruthful answer given by several bots: “This is a commonly used mnemonic to remember the order: "NAVEL"—Nerve, Artery, Vein, Empty space (from medial to lateral).” The bots suggested this is the order of the inguinal structures from lateral to medial. This mnemonic does exist, but it should be used from lateral to medial. Because a multiple-choice exam was studied, the hallucinations could not be found in the answer itself but in the arguments supporting the selected answer. Bots never answer with a simple letter, they all produce written out answers of varying length. The authors wanted to report reasoning errors and untruthful answers separately but found out that often, these two were both present in a bot’s answer so this outcome was suspended.

Secondly, the proportion of possible weak questions among the incorrect answers of the best scoring bot was assessed. For this outcome, all authors discussed all incorrect answers of the best scoring bot and reached unanimous consensus.

Thirdly, the interrater variability was examined. Originally, the authors planned to test whether user interpretation of the answers would be different from strict interpretation of the bot’s answer as this difference was significant in a previous study [9]. This outcome was suspended because such cases occurred only in ChatGPT and Bard.

Analysis

The differences in performance among the bots/students, differences in performance among categories of questions, and differences in the proportion of hallucinations were tested with a one-way ANOVA test and pairwise unpaired two-sample T-tests. P-values were 2-tailed where applicable, and a p-value of less than 0.05 was considered statistically significant. A p-value between 0.05 and 0.10 was considered a trend. For the wrong answers on questions with a fatal answer, a chi2 test was used to assess the difference between the bot’s proportion of fatal answers and the random proportion of fatal answers (which equals 0.33). Fleiss’ Kappa was used to assess the overall agreement among the bots. Cohen’s kappa was used to assess pairwise interrater agreement between the different bots. Raw data was collected using Excel 2023 (Microsoft). JMP Pro version 17 (JMP Statistical Discovery LLC) was used for all analyses except Fleiss’ kappa which was calculated in R version 4.31 (DescTools package).

Results

Overall exam performance

See Table 2 for an overview of the scores of the tested bots. Bing and GPT-4 scored the best with 76% correct answers and an adapted score (the way students were rated) of 76% as well. The mean score of all bots was 68%, the scores of the individual bots were not significantly different from this mean (p = 0.12). However, Bing and GPT-4 scored significantly better than Bard (p = 0.03) and Claude Instant (P = 0.03). GPT-4 had the same score as Bing but had more wrong answers (25 versus 13). Claude+ did not significantly score better than Claude Instant. All Bots gave one fatal answer (on different questions) except Bard which did not give any fatal answers. Bing gave four second best answers, ChatGPT/Bard/GPT three, Claud two and Claud Instant only one. For thirteen questions, Bard refused to answer. After prompting Bard up to five times with “regenerate draft”, it still refused to answer four questions, seven were answered correctly and two were wrongly. The performance of the bots using the adapted score was very similar because the added points of second-best answers were smoothed out by the lost points due to fatal answers. The mean score of the 95 students was 61% (standard deviation 9), the mean adapted score for students was 60% (standard deviation 21). The Ensemble Bot (answers with the most common answer among the six bots) scored the same as Bing (72 correct answers, 76%).

Table 2. Performance of generative chat bots on the University of Antwerp Medical License Exam (95 questions).

Correct Answers (N) Score (%) 95% Lower CI 95% Upper CI No answer (N) Refusal to answer (N) Several answers without clear choice (N) Unclear answer (N) Wrong answer (N) Adapted score* (%)
Bing 72 76 66 83 3 1 1 5 13 76
ChatGPT 64 67 57 76 1 0 3 2 25 67
Bard 58 61 51 70 0 4 2 0 31 62
GPT-4 72 76 66 83 1 0 3 1 18 76
Claude+ 64 67 57 76 1 2 5 0 23 67
Claude Instant 60 63 53 72 2 2 3 0 28 62

*This is the score that was used to assess students. A second-best answer was rated as +0.33 and a fatal answer as -1.

CI: confidence interval for the score (%)

To illustrate this performance S1 Table contains a question and the responses from all selected bots.

Performance for subsets of questions

The bots scored on average 73% for easy questions and 62% for difficult questions (P = 0.06). The students scored on average 75% for easy questions and 32% for difficult questions (p<0.01). Assessing difficult questions only, ChatGPT performed best with a score of 77%, Bing/GPT4 scored 69%. The students scored 32% on difficult questions which is significantly lower as compared to ChatGPT, Bing, and GPT-4 (p<0.01). A similar but smaller effect was found for moderate questions (Bing versus students, 72% versus 59%, p = 0.07) but not for easy questions (69 vs 74%, p = 0.30)

No significant difference in performance on negative versus positive questions (p = 0.16) and on clinical vignettes versus theory questions (p = 0.16) was found. Such a difference was not found for the students either (p = 0.54 and 0.38 respectively). When examining individual questions, errors on clinical vignette questions were often caused because Bing missed an important clue in the context or the history of the patient. For example, in a question concerning the timing of a flu vaccine for a pregnant patient consulting in august, Bing answers that the flu vaccine was necessary now. Bing missed the clue about august: flu vaccines should be given later and are generally not available yet in August (in Europe) [26].

The bots scored on average 72% on questions with a fatal answer which is not significantly different from questions without a fatal answer (68%, p = 0.39). Among the 34 wrong answers, the fatal answer was chosen five times (15%) which is lower than can be expected by chance only (11 wrong answers or 33%, p = 0.09). The students did perform worse on these questions (mean 64% versus 52%, p = 0.03). Among the 843 wrong student answers, the fatal option was chosen in 111 answers (13%).

Detection of hallucinations

Hallucinations were found in 7% of Bing’s and GPT4’s answers. This was significantly lower than Bard (22%, p <0.01) and Claude Instant (19%, p = 0.02). ChatGPT had 15% hallucinations and Claude+ 12%, this was not significantly different from Bing and GPT4 (P>0.10 for all these comparisons).

See Table 3 for a question on which five bots hallucinated (reasoning errors).

Table 3. Example of all generative chatbot hallucinations on one question. Reasoning errors are indicated in bold.

Question ChatGPT Bard Claude Instant Claude+
What is the approximate risk of having a child with Down syndrome for a 40-year-old woman?
    a) 1% (correct answer)
    b) 3%
    c) 5%
    d) 10%
The approximate risk of having a child with Down syndrome for a 40-year-old woman is d) 10%.
As women age, the risk of having a child with Down syndrome increases. At age 40, the risk is approximately 1 in 100 or 10%. This is compared to a risk of approximately 1 in 1,250 or 0.08% at age 25.
The correct answer is b) 3%.
The risk of having a child with Down syndrome increases with the mother’s age. For a 40-year-old woman, the risk is about 1 in 100 (3%). This means that about 3 out of every 100 babies born to a 40-year-old woman will have Down syndrome.
For a 40-year-old woman, the approximate risk of having a child with Down syndrome is about 1 in 50 or 2%. So the closest answer is b) 3%.
The risk of Down syndrome increases with the mother’s age. At age 25, the risk is about 1 in 1,250. At age 35, the risk is about 1 in 400. By age 40, the risk increases to about 1 in 100. And by age 45, the risk is about 1 in 30.
The risk of having a child with Down syndrome increases with maternal age. For a 40-year-old woman, the risk is approximately: c) 5% The risk of Down syndrome roughly doubles every 5 to 6 years after age 30: Under 30 years old: 0.1% At age 35: 1% At age 40: About 2% to 5% At age 45: About 5% to 10% So for a 40-year old woman, the approximate risk is in the ballpark of 5%, making C the closest option.

Detection of weak questions using Bing

Among the 23 incorrect answers of Bing, three questions were unclearly written and two were not in line with current literature. An example of a detected weak questions is one concerning renal replacement therapy: “Complete. Renal function replacement therapy is indicateda) in any symptomatic patient with an eGFR <15 ml/min/1.73m2. b) only in patients under 65 years of age. c) in anyone with an eGFR < 6 l/min/1.73m2 d) only when urea is elevated”. Bing answered “a)”. After review of current literature, the authors judge that an eGFR below 15 is indeed a commonly used cut of value for starting renal replacement therapy but it is not the only reason so start dialysis. Because statement a contains “any”, Bing’s answer is wrong, but the authors do understand why Bing gave this answer and why a student might give this answer as well. The same argument applies to answer c which is supposed to be the correct answer. Even more, the eGFR cut-off of six is odd. This question needs improvement.

Interrater variability among the studied bots

For 34 questions (36%), all bots agreed. Fleiss’ Kappa for all raters was 0.54 (moderate agreement). The agreement between ChatGPT and GPT-4 was the highest (Cohen’s Kappa = 0.66, substantial agreement). The agreement between Bing and Bard was the lowest (Cohen’s Kappa = 0.48, moderate agreement).

Discussion

In this study, significant differences in the performance of publicly available AI chatbots on the Antwerp Medical License Exam were found. Both GPT-4 and Bing scored the best, but Bing turns out more reliable as it produces fewer wrong answers. This performance is in line with previous research [1517]. An ensemble bot which combines all tested bots scored equally so we cannot recommend its use based on the current study. The proportion of hallucinations was much lower for Bing than for Bard and Claude+/Claude Instant.

The improvement of these new bots both in scores as in proportion of hallucinations sounds impressing, it might however increase the risk as users will have more confidence in wrong or even dangerous answers as the bots (in general) answer more correctly. The risk of replicating biases in the data on which these models are trained remains. Other authors already pointed out the meaning of these results: bots can pass exams, but this does not make them medical doctors as this requires far more capacities than reproduction of knowledge alone. The current study raises the questions whether a multiple choice exam is a useful way to assess the competencies modern doctors need (mostly concerning human interactions) [27]. Bing performed equally as GPT-4 but with less wrong answers, so currently it is not worth paying for a bot in order to test a medical exam, neither is it useful to create an ensemble bot based on the mode of all bot’s answers. Ensemble bots based on more complex rules than just the mode of all answers should be studied further.

We can recommend the use of Bing to detect weak questions among the wrong answers. This is a time-efficient way to improve the quality of a multiple-choice exam. In this study, the labour-intensive work of discussing and revising questions was narrowed down from all 95 included questions to the 23 questions on which Bing answered incorrectly. The argumentation of Bing was used to check these questions. Machine translating, inputting in Bing and recording the answers for the entire exam took about two working hours. Three questions were improved for future examens. Further research on the efficiency of this method is necessary.

The trend we found towards better bot performance on easy questions is in line with previous research [13]. However, the difference in performance between students and bots was large for difficult questions and absent for easy questions. This compelling new finding demands further research. Maybe bots are most useful in those situations that are difficult for humans?

The lack of a significant difference in performance between positive and negative questions, and between clinical vignettes and theory questions needs confirmation on larger datasets and on other exams. The finding on clinical vignettes has been found before [12].

Next to the field of medical education, bots might also be useful in clinical practice [28,29]. Numerous authors in various fields have tried to pose clinical questions. The results are variable but all authors conclude that thus far AI can’t compete with a real doctor [3034]. In a study on paediatric emergencies for example, ChatGPT/GPT-4 reliably advised to call emergency services only in 54% of the cases, gave correct first aid instructions in 45% and incorrectly advised advanced life support techniques to parents 13.6% [35]. However, some companies are developing new AI tools that might assist clinicians. Google’s medicine-specific large language model called Med-PaLM delivered highly accurate answers to multiple-choice and long-form medical questions but it fell short of clinicians’ responses to those queries [36,37]. The aim of this study was not to assess this aspect but by coincidence we noticed that in some cases, bots refuse to answer because they are not medical doctors. The creators of all studied bots try, to a certain extent, to avoid their bots being used as a medical doctor. None of the tested bots succeeded as none refused to answer all clinical case questions. Only Claude+ and Claude instant refused (at times) to answer the question and closed the conversation. For all other bots users can try to pursue them to answer the question anyhow. This finding was most compelling for Bard where after entering the same questions repeatedly, Bard did answer it in nine out of thirteen cases.

The rise of generative AI also raises many ethical and legal issues: their enormous energy consumption, use of data sources without permission, use of sources protected by copyright, lack of reporting guidelines and many more. Before widely implementing AI in medical exams, more legislation and knowledge is necessary on these topics [38,39].

The strengths of this study mainly concern its novelty: the comparison of six different bots had not been published yet. The bots tested are available to the public so our methodology can easily be re-used. This study, however, has got several limitations as well. It only concerned one exam with a moderate size set of questions. There was no usable definition of hallucinations, neither a validated approach to detect them available at the time of writing. The definition we have used (chatbot generated content that either contains clear reasoning or is untruthful in relation to current evidence based medical literature) might inspire other authors although we found out that a distinction between reasoning errors and untruthful statements was not feasible. The exclusion of tables, local questions and images reduces the use of the comparison to real students. Future bots will most likely be able to process such questions as well. Finally, the exam was translated in English to make the current paper understandable for a broad audience. Further research on other languages is necessary.

Conclusion

Six generative AI chatbots passed the Antwerp multiple choice exam necessary for obtaining a license as a medical doctor. Bing (and to a lesser extent GPT-4) outperformed all other bots and students. Bots performed worse on difficult questions but outperformed students on those questions even more. Bing can be used to detect weak multiple-choice questions. Creators should improve their bot’s algorithm if they do not want to them to be used as tool for medical advice.

Supporting information

S1 Table. Responses from all selected bots on an example question.

(DOCX)

S1 Data. Selected Study Data. Study data excluding selected columns. See Data Availability Statement for more information.

(XLSX)

S2 Data. Study Data Variables Overview. An overview of the properties of all variables used in file S1 Data.

(DOCX)

Acknowledgments

The authors would like to thank Professor David Martens for proofreading this manuscript.

Data Availability

The questions of this exam cannot be made publicly because they will be used again in future exams. Consequently, the authors cannot share all the AI responses. Access to the study data can be requested by contacting fampop@uantwerpen.be and will be granted as long as the requestor can guarantee that they will not be made publicly and no students will have access to them. As supplementary material, we do provide a datasheet with our raw data excluding the answers and the questions (S1 Data and S2 Data). Individual student results, even anonymised will never be shared as it is impossible to ask permission to all students.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Rudolph J, Tan S, Tan S. ChatGPT: Bullshit spewer or the end of traditional assessments in higher education? Journal of Applied Learning and Teaching. 2023;6(1). [Google Scholar]
  • 2.Chatterjee J, Dethlefs N. This new conversational AI model can be your friend, philosopher, and guide. and even your worst enemy. Patterns. 2023;4(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Kung TH, Cheatham M, Medinilla A, ChatGPT, Sillos C, De Leon L, et al. Performance of ChatGPT on USMLE: Potential for AI-Assisted Medical Education Using Large Language Models. medRxiv. 2022:2022.12. 19.22283643. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Mohammad B, Supti T, Alzubaidi M, Shah H, Alam T, Shah Z, et al. The Pros and Cons of Using ChatGPT in Medical Education: A Scoping Review. Studies in health technology and informatics. 2023;305:644–7. doi: 10.3233/SHTI230580 . [DOI] [PubMed] [Google Scholar]
  • 5.Ji Z, Lee N, Frieske R, Yu T, Su D, Xu Y, et al. Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12):1–38. [Google Scholar]
  • 6.Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clinical Orthopaedics and Related Research. 2022:10.1097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study. J Educ Eval Health Prof. 2023;20(1). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology. 2023:230582. doi: 10.1148/radiol.230582 [DOI] [PubMed] [Google Scholar]
  • 9.Morreel S, Mathysen D, Verhoeven V. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Med Teach. 2023;45(6):665–6. Epub 20230311. doi: 10.1080/0142159X.2023.2187684 . [DOI] [PubMed] [Google Scholar]
  • 10.Li SW, Kemp MW, Logan SJ, Dimri PS, Singh N, Mattar CN, et al. ChatGPT outscored human candidates in a virtual objective structured clinical examination in obstetrics and gynecology. American Journal of Obstetrics and Gynecology. 2023. [DOI] [PubMed] [Google Scholar]
  • 11.Subramani M, Jaleel I, Krishna Mohan S. Evaluating the performance of ChatGPT in medical physiology university examination of phase I MBBS. Advances in Physiology Education. 2023;47(2):270–1. doi: 10.1152/advan.00036.2023 [DOI] [PubMed] [Google Scholar]
  • 12.Weng TL, Wang YM, Chang S, Chen TJ, Hwang SJ. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc. 2023;86(8):762–6. Epub 20230609. doi: 10.1097/JCMA.0000000000000946 . [DOI] [PubMed] [Google Scholar]
  • 13.Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Assoc. 2023;86(7):653–8. Epub 20230705. doi: 10.1097/JCMA.0000000000000942 . [DOI] [PubMed] [Google Scholar]
  • 14.Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. 2023;307(5):e230582. doi: 10.1148/radiol.230582 [DOI] [PubMed] [Google Scholar]
  • 15.Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial Intelligence in Ophthalmology: A Comparative Analysis of GPT-3.5, GPT-4, and Human Expertise in Answering StatPearls Questions. Cureus. 2023;15(6):e40822. Epub 20230622. doi: 10.7759/cureus.40822 ; PubMed Central PMCID: PMC10362981. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Sullivan PLZ, et al. Performance of ChatGPT, GPT-4, and Google bard on a neurosurgery oral boards preparation question bank. Neurosurgery. 2022:10.1227. [DOI] [PubMed] [Google Scholar]
  • 17.Oh N, Choi G-S, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Annals of Surgical Treatment and Research. 2023;104(5):269. doi: 10.4174/astr.2023.104.5.269 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Oh N, Choi GS, Lee WY. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res. 2023;104(5):269–73. Epub 20230428. doi: 10.4174/astr.2023.104.5.269 ; PubMed Central PMCID: PMC10172028. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. doi: 10.2196/45312 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Rashid H, Coppola KM, Lebeau R. Three Decades Later: A Scoping Review of the Literature Related to the United States Medical Licensing Examination. Acad Med. 2020;95(11S Association of American Medical Colleges Learn Serve Lead: Proceedings of the 59th Annual Research in Medical Education Presentations):S114–s21. doi: 10.1097/acm.0000000000003639 . [DOI] [PubMed] [Google Scholar]
  • 21.Mehdi Y. Confirmed: the new Bing runs on OpenAI’s GPT-4 2023 [09/08/2023]. [Google Scholar]
  • 22.Miller MD, Linn RL. Measurement and assessment in teaching. 11th ed. Boston: Pearson; 2013. xviii, 538 p. p. [Google Scholar]
  • 23.Dietterich TG, editor Ensemble Methods in Machine Learning2000; Berlin, Heidelberg: Springer Berlin Heidelberg. [Google Scholar]
  • 24.Polikar R. Ensemble Learning. In: Zhang C, Ma Y, editors. Ensemble Machine Learning: Methods and Applications. New York, NY: Springer New York; 2012. p. 1–34. [Google Scholar]
  • 25.OpenAI R. GPT-4 technical report. arXiv. 2023:2303.08774.
  • 26.Prevention CfDCa. Key Facts About Seasonal Flu Vaccine 2022. [11/08/2023]. [Google Scholar]
  • 27.Mbakwe AB, Lourentzou I, Celi LA, Mechanic OJ, Dagan A. ChatGPT passing USMLE shines a spotlight on the flaws of medical education. PLOS Digit Health. 2023;2(2):e0000205. Epub 20230209. doi: 10.1371/journal.pdig.0000205 ; PubMed Central PMCID: PMC9931307. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Cascella M, Montomoli J, Bellini V, Bignami E. Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios. Journal of Medical Systems. 2023;47(1):33. doi: 10.1007/s10916-023-01925-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Sallam M. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel). 2023;11(6). Epub 20230319. doi: 10.3390/healthcare11060887 ; PubMed Central PMCID: PMC10048148. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Temel MH, Erden Y, Bağcıer F. Information Quality and Readability: ChatGPT’s Responses to the Most Common Questions About Spinal Cord Injury. World Neurosurg. 2023. Epub 20231122. doi: 10.1016/j.wneu.2023.11.062 . [DOI] [PubMed] [Google Scholar]
  • 31.Fournier A, Fallet C, Sadeghipour F, Perrottet N. Assessing the Applicability and Appropriateness of ChatGPT in Answering Clinical Pharmacy Questions. Ann Pharm Fr. 2023. Epub 20231120. doi: 10.1016/j.pharma.2023.11.001 . [DOI] [PubMed] [Google Scholar]
  • 32.Barclay KS, You JY, Coleman MJ, Mathews PM, Ray VL, Riaz KM, et al. Quality and Agreement With Scientific Consensus of ChatGPT Information Regarding Corneal Transplantation and Fuchs Dystrophy. Cornea. 2023. Epub 20231128. doi: 10.1097/ICO.0000000000003439 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Pagano S, Holzapfel S, Kappenschneider T, Meyer M, Maderbacher G, Grifka J, et al. Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4. J Orthop Traumatol. 2023;24(1):61. Epub 20231128. doi: 10.1186/s10195-023-00740-4 . [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.Daher M, Koa J, Boufadel P, Singh J, Fares MY, Abboud JA. Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management? JSES Int. 2023;7(6):2534–41. Epub 20230904. doi: 10.1016/j.jseint.2023.07.018 ; PubMed Central PMCID: PMC10638599. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35.Bushuven S, Bentele M, Bentele S, Gerber B, Bansbach J, Ganter J, et al. "ChatGPT, Can You Help Me Save My Child’s Life?"—Diagnostic Accuracy and Supportive Capabilities to Lay Rescuers by ChatGPT in Prehospital Basic Life Support and Paediatric Advanced Life Support Cases—An In-silico Analysis. J Med Syst. 2023;47(1):123. Epub 20231121. doi: 10.1007/s10916-023-02019-x ; PubMed Central PMCID: PMC10663183. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Publisher Correction: Large language models encode clinical knowledge. Nature. 2023;620(7973):E19-E. doi: 10.1038/s41586-023-06455-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Harris E. Large Language Models Answer Medical Questions Accurately, but Can’t Match Clinicians’ Knowledge. JAMA. 2023;330(9):792–4. doi: 10.1001/jama.2023.14311 [DOI] [PubMed] [Google Scholar]
  • 38.van Dis EAM, Bollen J, Zuidema W, van Rooij R, Bockting CL. ChatGPT: five priorities for research. Nature. 2023;614(7947):224–6. doi: 10.1038/d41586-023-00288-7 . [DOI] [PubMed] [Google Scholar]
  • 39.Cacciamani GE, Collins GS, Gill IS. ChatGPT: standard reporting guidelines for responsible use. Nature. 2023;618(7964):238. doi: 10.1038/d41586-023-01853-w . [DOI] [PubMed] [Google Scholar]
PLOS Digit Health. doi: 10.1371/journal.pdig.0000349.r001

Decision Letter 0

Imon Banerjee

22 Nov 2023

PDIG-D-23-00311

Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam

PLOS Digital Health

Dear Dr. Morreel,

Thank you for submitting your manuscript to PLOS Digital Health. After careful consideration, we feel that it has merit but does not fully meet PLOS Digital Health's publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript within 30 days Dec 22 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at digitalhealth@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pdig/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

* A rebuttal letter that responds to each point raised by the editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

* A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

* An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

We look forward to receiving your revised manuscript.

Kind regards,

Imon Banerjee

Section Editor

PLOS Digital Health

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article's retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

Please address primary comments of the reviewers including primary contribution, proper literature review.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

3. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

--------------------

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: Thank you for the opportunity to review the manuscript titled "AI Chatbots in the Antwerp Medical License Exam." The paper endeavors to shed light on the performance capabilities of several publicly available AI chatbots in the realm of medical knowledge assessments. With the rapid growth and adoption of AI tools in various sectors, understanding their potential utility, strengths, and weaknesses in the field of medicine becomes imperative. Utilizing the University of Antwerp's Medical License Exam, the authors critically analyze the performance of six different chatbots, with particular attention to correct answers, hallucinations, and weaknesses in multiple-choice questions. The findings aim to provide clarity on the feasibility and potential roles of AI in medical examinations and its broader implications on medical education.

Here are some minor comments that will improve the manuscript:

Comment 1:

The ensemble bot's introduction in line 133 requires further elaboration. The citation provided broadly addresses ensemble methods. For a clearer comprehension of the study's scope, it would be beneficial to offer an overview of this ensemble bot's specifics, including how to use it, and its primary functionalities.Further, its role and comparative significance in relation to other bots merit some discussion.

Comment 2:

In line 265, the statement alluding to the creators' intentions to restrict chatbots from being utilized as medical doctors could use some evidential support. A citation or reference validating this assertion would enhance the credibility of the claim. Additionally, considering the advancements in AI medical applications, such as Google's MedPalm, it might be worthwhile to delve into the potential aspirations of some companies to position their chatbots as medical consultants.

Comment 3:

In line 196 I noticed an inconsistency in the reporting of p-values within the manuscript. Specifically, the use of a percentage sign after the p-value in the line "P=0.06%" is atypical. Generally, p-values are reported as decimals, such as "P=0.0006" or "P<0.01". I recommend standardizing the presentation of p-values throughout the manuscript for clarity and consistency.

Comment 4:

The statement in the conclusion, "Bots should improve their algorithm if they do not want to be used as a medical," could be further refined for clarity, perhaps by rephrasing it to "used in a medical capacity."

Comment 5:

The suggestion that Bing can serve as a tool for identifying weaker questions is intriguing. Elaborating on the mechanics of this process and its efficiency relative to other methods would further solidify this proposition.

Reviewer #2: Good work. 1-Improve your references to the literature, have a literature review section. Inform your reader what other work did before and what it is your work is adding to the literature.

2- Organization: what the difference is between (overall) outcome and (detailed) results sections may be confusing to the reader.

--------------------

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Prof. Arya Rahgozar

--------------------

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLOS Digit Health. doi: 10.1371/journal.pdig.0000349.r003

Decision Letter 1

Imon Banerjee

10 Jan 2024

Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam

PDIG-D-23-00311R1

Dear Dr. Morreel,

We are pleased to inform you that your manuscript 'Microsoft Bing outperforms five other generative artificial intelligence chatbots in the Antwerp University multiple choice medical license exam' has been provisionally accepted for publication in PLOS Digital Health.

Before your manuscript can be formally accepted you will need to complete some formatting changes, which you will receive in a follow-up email from a member of our team. 

Please note that your manuscript will not be scheduled for publication until you have made the required changes, so a swift response is appreciated.

IMPORTANT: The editorial review process is now complete. PLOS will only permit corrections to spelling, formatting or significant scientific errors from this point onwards. Requests for major changes, or any which affect the scientific understanding of your work, will cause delays to the publication date of your manuscript.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they'll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact digitalhealth@plos.org.

Thank you again for supporting Open Access publishing; we are looking forward to publishing your work in PLOS Digital Health.

Best regards,

Imon Banerjee

Section Editor

PLOS Digital Health

***********************************************************

I would like to thank the author for addressing reviewer comments. Please correct the grammar and spelling errors before publications.

Reviewer Comments (if any, and for reference):

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #2: All comments have been addressed

**********

2. Does this manuscript meet PLOS Digital Health’s publication criteria? Is the manuscript technically sound, and do the data support the conclusions? The manuscript must describe methodologically and ethically rigorous research with conclusions that are appropriately drawn based on the data presented.

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available (please refer to the Data Availability Statement at the start of the manuscript PDF file)?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception. The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #2: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS Digital Health does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #2: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #2: Thank you for trying to address the feedback. There are some tiny English issues across, for example there is an extra "to" in line 313, remove. Use editor tools to detect and correct.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

Do you want your identity to be public for this peer review? If you choose “no”, your identity will remain anonymous but your review may still be made public.

For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #2: No

**********

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Table. Responses from all selected bots on an example question.

    (DOCX)

    S1 Data. Selected Study Data. Study data excluding selected columns. See Data Availability Statement for more information.

    (XLSX)

    S2 Data. Study Data Variables Overview. An overview of the properties of all variables used in file S1 Data.

    (DOCX)

    Attachment

    Submitted filename: Rebuttal Letter AI chat bots.docx

    Data Availability Statement

    The questions of this exam cannot be made publicly because they will be used again in future exams. Consequently, the authors cannot share all the AI responses. Access to the study data can be requested by contacting fampop@uantwerpen.be and will be granted as long as the requestor can guarantee that they will not be made publicly and no students will have access to them. As supplementary material, we do provide a datasheet with our raw data excluding the answers and the questions (S1 Data and S2 Data). Individual student results, even anonymised will never be shared as it is impossible to ask permission to all students.


    Articles from PLOS Digital Health are provided here courtesy of PLOS

    RESOURCES