Abstract
Objective
Large language models (LLMs) such as ChatGPT are being developed for use in research, medical education and clinical decision systems. However, as their usage increases, LLMs face ongoing regulatory concerns. This study aims to analyse ChatGPT’s performance on a postgraduate examination to identify areas of strength and weakness, which may provide further insight into their role in healthcare.
Design
We evaluated the performance of ChatGPT 4 (24 May 2023 version) on official MRCP (Membership of the Royal College of Physicians) parts 1 and 2 written examination practice questions. Statistical analysis was performed using Python. Spearman rank correlation assessed the relationship between the probability of correctly answering a question and two variables: question difficulty and question length. Incorrectly answered questions were analysed further using a clinical reasoning framework to assess the errors made.
Setting
Online using ChatGPT web interface.
Primary and secondary outcome measures
Primary outcome was the score (percentage questions correct) in the MRCP postgraduate written examinations. Secondary outcomes were qualitative categorisation of errors using a clinical decision-making framework.
Results
ChatGPT achieved accuracy rates of 86.3% (part 1) and 70.3% (part 2). Weak but significant correlations were found between ChatGPT’s accuracy and both just-passing rates in part 2 (r=0.34, p=0.0001) and question length in part 1 (r=−0.19, p=0.008). Eight types of error were identified, with the most frequent being factual errors, context errors and omission errors.
Conclusion
ChatGPT performance greatly exceeded the passing mark for both exams. Multiple choice examinations provide a benchmark for LLM performance which is comparable to human demonstrations of knowledge, while also highlighting the errors LLMs make. Understanding the reasons behind ChatGPT’s errors allows us to develop strategies to prevent them in medical devices that incorporate LLM technology.
Keywords: Health informatics, Clinical Decision-Making, MEDICAL EDUCATION & TRAINING, QUALITATIVE RESEARCH
STRENGTHS AND LIMITATIONS OF THIS STUDY.
Analysing ChatGPT’s performance within an SBA examination provides an objective assessment of its clinical knowledge and reasoning abilities.
This study builds on previous work comparing ChatGPT performance on question banks, by analysing the number and classification of errors made by ChatGPT which may provide insight into their role and limitations within healthcare.
Evaluating responses from ChatGPT using the framework and language derived from the established literature on cognitive biases and clinical reasoning provides a structure for users to evaluate responses for potential errors using concepts familiar to clinicians.
The classification of errors was performed by consensus between two independent authors using a predetermined set of potential errors. There may be additional errors present which were not captured by the predetermined list. Without an automated method of detecting and classifying errors, the generalisability and credibility of the results may be limited.
A large proportion of questions needed to be excluded from our analysis due to containing images or ECGs.
Introduction
The application of large language models (LLMs) in healthcare has sparked widespread interest in the medical community. LLMs are artificial neural networks that are trained on massive textual datasets, allowing them to comprehend and generate human-like language patterns. One of the prominent examples of LLMs is GPT (Generative Pre-trained Transformer), developed by OpenAI.1
GPT has been developed in a number of iterations, building from version 1 in 2018 trained on a structured collection of text with around 117 million parameters, to version 3.5 when a web chat interface was added (‘ChatGPT’) in 2022 using 175 billion parameters. Most recently, version 4 in 2023 uses reinforcement learning from human feedback, fine-tuning of responses and further adversarial training on a dataset rumoured to be over 1 trillion parameters.2
The potential roles for LLMs within healthcare have been expounded,3 4 with potential applications in research, medical education, information transfer and retrieval, as well as in clinical decision systems.
The chatbot feature of ChatGPT allows for interactive conversations with users, enabling successive replies to refine and improve its responses. This iterative feedback loop helps the model learn from its mistakes and continuously enhance its performance over time. As a result, ChatGPT can provide increasingly accurate and relevant answers to users’ inquiries.
While ChatGPT has not been specifically trained for medical applications, earlier this year ChatGPT made headlines when it was shown to have performed at a passing level for the US Medical Licensing Examination (USMLE) in some domains.5 6 The USMLE is a comprehensive assessment that evaluates medical students and graduates’ readiness to practice medicine in the USA. It encompasses various question formats, including single-best answer (SBA) questions and short-answer questions, making it a challenging milestone to have surpassed.
Following its success in the USMLE, ChatGPT’s capabilities have been further explored and compared in various medical assessments. Recent publications have delved into its proficiency in a wide range of evaluations from medical student exams all the way through to specialty certificate examinations which are taken prior to completion of postgraduate training in the UK (table 1). It should be noted that many of the assessments investigated use practice questions produced by third parties, rather than official questions produced by the examining board/college.
Table 1.
Performance of ChatGPT on medical examination
| Model | Country | Examination | Score | Pass mark | Result |
| Galactica | USA | USMLE (MedQA)23 | 52.9% | 60% | Fail |
| Flan-PaLM | USA | USMLE (MedQA)9 | 67.6% | 60% | Pass |
| ChatGPT 3.5 | USA | USMLE5 | 44%–64.4% | 60% | Mixed |
| ChatGPT 3.5 | USA | USMLE6 | 42.1%–65.2% | 60% | Mixed |
| ChatGPT 3.5 | USA | American Heart Association life support exams24 | 68%–76.3% | 84% | Fail |
| ChatGPT 3.5 & 4 | USA | Plastic Surgery In-Service Training Exam25 | ChatGPT 3.5: 3rd (2021) and 8th (2022) decile | – | – |
| ChatGPT 4: 99th (2021) and 88th (2022) decile | – | – | |||
| ChatGPT 4 | USA | USMLE12 | 85% | 60% | Pass |
| ChatGPT 3.5 | UK | General Practitioner (GP) AKT26 | 60.17% | 70.42% | Fail |
| ChatGPT 3.5 & 4 | USA | Ophthalmology Board Exam27 | ChatGPT 3.5%–63.1% | 65% | Fail |
| ChatGPT 4%–76.9% | Pass | ||||
| Med-Palm 2 | USA | USMLE (MedQA)17 | 86.5% | 60% | Pass |
| ChatGPT 3.5 and 4 | USA | Neurosurgical Board Exam28 | ChatGPT 3.5 62.4% | – | – |
| ChatGPT 4 85.2% | |||||
| ChatGPT 3.5 | UK | FRCA Primary29 | 69.7% | 71.3% | Fail |
| ChatGPT 3.5 and 4 | UK | Dermatology SCE30 | ChatGPT 3.5%–63.1% | 70%–72% | Fail |
| ChatGPT4 90.5% | Pass | ||||
| ChatGPT 3.5 and 4 | UK | Neurology SCE31 | ChatGPT 3.5%–57% | 58% | Fail |
| ChatGPT 4%–64% | Pass | ||||
| ChatGPT 3.5 | USA | Neonatal Board Exams32 | 45.3% | – | – |
AKT, Applied Knowledge Test; FRCA, Fellow of the Royal College of Anaesthetists; SCE, Specialty Certificate Examination; USMLE, United States Membership Licensing Examination.
ChatGPT as a medical device
The use of LLMs in medical applications is currently restricted due to various regulatory concerns, including their lack of explainability, potential biases and the need for robust oversight.7 Nevertheless, we must determine how the medical accuracy and performance of LLMs may be best assessed in an objective and structured manner prior to use in clinical settings.
While existing studies have compared the efficacy of ChatGPT against question banks as benchmarks (table 1) and used automated natural language processing measures to compare open-answer responses,8 limited work has been performed in evaluating the reasons for inaccuracy. However, work on other LLMs has examined the scientific accuracy, reasoning and potential for harm and bias.9 10
UK Membership of the Royal College of Physicians examinations
The UK Membership of the Royal College of Physicians (MRCP) is a series of postgraduate assessments required for progression of doctors to higher specialty training in internal medical specialties.11 It consists of two written SBA exams (parts 1 and 2) and a practical assessment of clinical skills (PACES). Part 1 focuses on basic medical science knowledge recall and clinical science while part 2 integrates clinical knowledge with interpretation of clinical scenarios. PACES integrates practical assessment of clinical examination skills with ability to recognise signs and form a diagnosis and management plan.
MRCP SBA questions are an ideal assessment tool to gauge ChatGPT’s medical knowledge-based and evaluative abilities. These questions often involve complex clinical scenarios that require a synthesis of medical knowledge, critical thinking and decision-making, often through multistep reasoning. This draws parallels with the clinical landscape where there is often a large amount of patient information to assimilate and a degree of uncertainty to navigate. By offering one correct answer but other plausible distractors SBAs also test skills of clinical judgement and prioritisation. By subjecting ChatGPT to a curated set of MRCP SBA questions, we aimed to highlight any areas of weakness and analyse these to identify the sources of the error or potential bias. This could potentially inform the model’s developers and medical educators about specific clinical scenarios or decision-making domains where further improvement is needed. This process is analogous to training medical students and doctors where identifying areas of weakness enables targeted interventions to enhance competency.
Methods
We assessed the performance of ChatGPT 4 (24 May 2023 version) on the official practice questions for MRCP parts 1 and 2 produced by MRCPUK.11 The question and any supplied information (laboratory test results, descriptions of imaging, endoscopy or histology) was used directly as the chat prompt without any further supplement to the prompt (‘zero-shot’ prompting). ChatGPT does not handle images so questions containing an image were inputted without the image included. The answer that ChatGPT indicated was correct was selected. If ChatGPT did not express a clear choice of answer or otherwise failed to process the question, no answer was selected and the question was marked incorrect.
Microsoft Excel was used for data entry and Python with scientific and statistical libraries for analysis and graphing.
Questions containing images or ECGs as part of the prompt were excluded from all further quantitative and qualitative analysis. The remaining questions were classified by one of the authors (SM) into 1 of 15 MRCP curriculum medical specialties according to the specialty of the correct answer. Part 2 questions include data on the percentage of just-passing (as defined by MRCP) candidates who correctly answered the question the last time it was used. These data were used as a measure of the comparative difficulty of a question to assess the correlation (Spearman rank) between difficulty and likelihood of ChatGPT answering correctly.
To assess whether any of the MRCP practice questions were in ChatGPT’s training data set, the Memorisation Effects Levenshtein Detector (MELD)12 was applied. This involves splitting text into two parts, inputting half into ChatGPT and asking it to complete the second half. This output is then compared with the actual second half using a text similarity metric (Levenshtein distance ratio, via Python library levenshtein13). We performed this on each of the part 1 and part 2 questions.
All questions which ChatGPT answered incorrectly were collated for further qualitative analysis. Two authors (AM and RF) independently evaluated each question and identified the number and type of errors made by ChatGPT. The errors were categorised using a predetermined list adapted from established literature describing types of cognitive bias and clinical reasoning errors made by humans when performing clinical reasoning tasks.14 Additional errors specific to the evaluation of responses from an LLM were also included, for example, confabulation. Where the categorisation of errors was discordant between the two independent evaluators, a discussion took place to reach a consensus. If the authors failed to meet a consensus then the overall decision would be made by a third independent author.
Patient and public involvement
Neither patients nor the public were involved in our research.
Results
Quantitative analysis
For part 1, ChatGPT provided responses for 196 out of 197 questions, achieving an accuracy rate of 86.3% with 170 questions answered correctly (figure 1). In part 2, it provided responses for 127/128 questions, correctly answering 9/22 (40.1%) image-based questions and 81/106 (76.4%) non-image questions.
Figure 1.
Study flow chart.
The average length of questions differed significantly (two-sample t-test p<0.005) between exams, with 985 characters in part 1 (SD=403.1), and 1649 (SD=20.5) characters in part 2. ChatGPT’s performance was significantly worse in part 2 compared with part 1 (p=0.01).
A weak but highly significant correlation (r=0.34, p=0.0001) was observed between the percentage of just-passing candidates correctly answering a part 2 question and ChatGPT’s probability of answering it correctly. A further weak but significant negative correlation existed between question length and probability of ChatGPT answering question correctly for part 1 questions (r=−0.19, p=0.008) but this was not the case for part 2 questions (r=0.05, p=0.575).
A breakdown of performance by question specialty is shown in figure 2. There were no specialties which performed significantly worse compared with others.
Figure 2.
ChatGPT performance on MRCP exam questions grouped by specialty. MRCP, Membership of the Royal College of Physicians.
To test whether questions existed in ChatGPTs training set, we measured the MELD ratio. If a question existed, its MELD ratio would exceed 95%. However, the average MELD ratio for the texts in the training set was only 24.4% (SD 7.4%), with the highest ratio among all questions being 33.7%, indicating that no questions sampled existed in the training set.
Qualitative analysis
Two authors (AM and RF) independently reviewed the 52 questions where ChatGPT answered incorrectly. Prior to review, definitions were agreed for categories of error in the response which were adapted from Croskerry and Rylander and Guerrasio,14 15 as outlined in box 1. In addition, a few categories of error were created to highlight specific issues related to evaluating responses from LLMs, for example, confabulation. Of the 52 questions which were answered incorrectly, the raters (AM and RF) initially agreed on 32 answers, with consensus met through discussion for the remainder. An independent rater was, therefore, not required.
Box 1. Clinical reasoning and diagnostic errors.
Anchoring bias
Focusing on a specific diagnosis early in the diagnostic process, based on initial features, and failing to incorporate further contradictory information.
Base-rate neglect
Ignoring or misjudging the true prevalence of a disease and therefore misperceiving the diagnostic likelihood.
Confirmation bias
Acknowledging evidence that confirms a particular diagnosis and ignoring or not looking for contradictory information.
Context errors
Distortion of the importance of clinical information due to the context in which it is presented.
Premature closure
Acceptance of a diagnosis before all necessary confirmatory evidence is provided.
Misunderstanding the question
Not answering the specific question asked but instead answering a different, possibly related, question.
Omission error
Ignoring or not acknowledging a crucial piece of information.
Confabulation (or hallucination)
Inventing information that is not present in the information given.
Factual error
An assertion made that is factually incorrect.
Adapted from Croskerry and Rylander and Guerrasio.14 15
There was a single error identified in 41 of the answers, 2 errors in 9 of the answers, and 1 question in which 3 errors were made. The types of error are laid out within figure 1.
The examples provided below are illustrative of the types of errors in the ChatGPT responses. The questions and ChatGPT answers in full are laid out in online supplemental file 1.
bmjopen-2023-080558supp001.pdf (154KB, pdf)
Factual error: prescribing penicillin to an allergic patient
Part 1: question 158
This question described a young man with severe cellulitis and a previous anaphylactic reaction to amoxicillin. The lead-in asked, ‘what is the most appropriate empirical treatment?’
ChatGPT noted the ‘history of a severe allergic reaction to amoxicillin’ but reasoned ‘piperacillin with tazobactam is a broad-spectrum antibiotic…suitable choice for severe infections requiring hospitalisation and intravenous antibiotics, like this case.’ Therefore, this has been categorised as a factual error in clinical reasoning as a penicillin antibiotic has been prescribed despite a clear penicillin allergy being described.
Omission error: inappropriate antihypertensive prescribed by ignoring key information
Part 2: question 116
This question asked ‘what is the most appropriate initial antihypertensive agent?’ in a woman with haematuria, high creatinine, renal asymmetry and intermittent claudication in her leg.
ChatGPT noted ‘hypertension and kidney disease as evidenced by the elevated serum creatinine and haematuria.’ However, it has not mentioned the intermittent claudication or renal asymmetry which both point towards a diagnosis of renal artery stenosis. Therefore, by omitting these details, it selects ramipril as an answer, which would be contraindicated.
Context error: assuming fungal infection due to immunosuppression
Part 2: question 21
This question described a man on prednisolone and azathioprine (immunosuppression) for microscopic polyangiitis. He presented with chest pain, a productive cough and ‘patchy shadowing at the right base’ on chest X-ray. The question asked ‘what is the most appropriate treatment?’
ChatGPT identified that the clinical and radiological signs were indicative of a ‘lung infection’ but did not select the standard antibiotics for a community-acquired pneumonia. It focuses specifically on the background of immunosuppression, and therefore, selects a regimen that covers for fungal infections rather than focusing on the clinical information presented.
Confabulation error: insisting a normal test result was elevated
Part 1: question 192
This question described a young woman with erythema nodosum and an anterior mediastinal mass. ChatGPT reasons that the most likely diagnosis is sarcoidosis given the ‘elevated serum ACE level’ but the question stem actually states that the ACE level is 78 which is within the provided normal range of 25–82. The false claim of an elevated ACE level is used by ChatGPT to rule out the other diagnosis options, including the correct answer of Hodgkin’s disease.
Not providing an answer to questions
There was one question (part 2, question 50) where ChatGPT provided no answer without any reasoning given despite multiple attempts at inputting the question. There was one question (part 1, question 117) where ChatGPT processed the question but did not provide an answer. This question asked, ‘what is most predictive of completed suicide in the next week?’ and outlined five possible risk factors based on the clinical vignette. ChatGPT described the three most immediate predictors of suicide risk (previous suicide attempts, specific plans to commit suicide and the means to carry out plans), none of which were in the answer choices. It would, therefore, not select a different answer as ‘none stands out as the single most predictive factor,’ rather than deciding between the answer options given.
Discussion
Understanding the reasons behind ChatGPT’s errors allows us to develop strategies to prevent them in medical devices that incorporate LLM technology. By addressing and mitigating errors and biases, we can work towards increasing the safety and reliability of LLMs in medical applications.
LLMs are known to ‘hallucinate’ plausible sounding, but false, facts, information and even references.16 This is because they are trained to produce the next token in a sequence (usually analogous to the next word in a sentence), not to establish truth. Previous attempts to benchmark the accuracy of LLMs have focused on subjective ratings by experts of responses given by humans and LLMs to questions, with a clear preference observed for LLMs.9 17 Given that LLMs are trained on textbooks and internet sources, their writing style is inevitably more formal and akin to a textbook in comparison to the variable and potentially conversational explanations given by a human. Therefore, the authoritative-sounding clear responses provided by ChatGPT may be trusted, even though they may be incorrect and potentially even unsafe.
Assessing the performance of LLMs in specific domains essentially becomes a Chinese Room Problem.18 This thought experiment asks how we may distinguish a fluent Chinese translator (‘strong’ AI) from a person with an extensive rulebook on which symbols to output in response to given inputs with no knowledge of Chinese (‘weak’ AI). Differentiating weak from strong AI means measuring the depth of understanding AI can demonstrate. In medical education, various frameworks or learning paradigms are used to evaluate learning and performance, examples include Bloom’s taxonomy,19 Miller’s pyramid20 and Dreyfus and Dreyfus’ developmental framework.21 Applying the same learning paradigms to LLMs provides a consistent framework and language to use when comparing with medical students or clinicians who are in the comparator group.
Measuring LLM performance needs to be objective to the accuracy and breadth of information conveyed, regardless of the format of the text. Single-best answer questions are used throughout undergraduate and postgraduate examinations as a reliable assessment of knowledge-based competency. Given the likely knowledge-based applications for LLMs, written SBA examinations provide a good benchmark for model performance which is readily comparable to human performance. However, although ChatGPT reached the examination pass mark, the relatively high proportion of factual and context errors is unlikely to be acceptable for a trustworthy clinical decision system.
Furthermore, SBAs do not assess the higher levels of clinical competency essential for a medical registrar. Miller’s pyramid is a hierarchical division of elements of clinical competence with factual recall (‘knows’) and application of knowledge (‘knows how’) ranked below demonstrable behaviours required for clinical competence. These are typically assessed in practical examinations and the workplace environment (‘shows how’ and ‘does’).20 As such, to pass MRCP doctors must also pass a PACES and complete a portfolio of achieved competencies in keeping with the higher tiers of Miller’s pyramid.
ChatGPT is built from, and recalls, vast amounts of text data mined from internet sources. Given that the questions utilised are available online, it is reasonable to question whether ChatGPT simply recalled the answer from its extensive dataset, rather than attempting to understand the clinical vignette. We feel this is unlikely, however, as the practice questions are only available after logging in with an email address and solving a CAPTCHA. Furthermore, the answers are only available after completing the exam and are not otherwise reproduced elsewhere on the internet. This was nonetheless verified by use of MELD which indicated no presence of sampled questions within ChatGPT’s training dataset. It is important however, to note that MELD is specific for finding data in a training set, but sensitivity is unknown since the original datasets are proprietary and have not been independently tested.
Despite this, it is possible that ChatGPT may be led by cues within questions to select the correct answer (‘painless jaundice’=‘pancreatic cancer’), but this is an exam technique also used by doctors and would be difficult to eliminate. This may also have led to a proportion of the context and omission errors where ChatGPT appears to focus on certain aspects of the vignette while neglecting other crucial parts of information.
Conclusions
ChatGPT is able to answer MRCP written examination questions, without additional prompts, to a level that would equate with a comfortable pass for a human candidate. However, it is still prone to making errors, and sometimes multiple errors, within its responses. If LLMs are to be useful and trusted as medical devices or in medical education, it is crucial that the frequency and types of errors that are possible are fully understood. This will facilitate iterative improvements in the technology but also allow users to be informed how to use them safely.
Evaluating responses from ChatGPT using the framework and language derived from the established literature on cognitive biases and clinical reasoning provides a structure for users to evaluate responses for potential errors, including the relative likelihood of specific types of error, using concepts familiar to clinicians. This could allow users to critically appraise responses efficiently, using the types of errors described, so that LLMs augment the users’ clinical practice or education while minimising the risk of error.
Study limitations
The approach that we have used requires meticulous appraisal of the justification provided to questions. To scale over a wider dataset, automated ways of detecting the types of errors we have found should be developed, and an ‘examiner’ LLM may be required as an adversarial companion (widely used in general-purpose GPT development1 4) to ensure improvement.
Interpreting images (ECGs, X-rays, biopsy samples) remains an important part of postgraduate examinations and clinical reasoning. ChatGPT remains a language model, with limited image-processing capabilities at present, and therefore, a proportion of the examination questions have had to be excluded. Recently, several multimodal models have been published,1 9 22 indicating that this is unlikely to remain a long-term limitation.
Future research
There remains much work to explore the role of LLMs such as ChatGPT in medical assessment.
ChatGPT is a general-purpose LLM, with no specific training on medical datasets. While its performance is significantly better than earlier, smaller LLMs exclusively trained on medical datasets such as Galactica,23 newer models combining large-scale pretraining on diverse datasets, followed by fine-tuning on medical question and answer datasets have improved performance,17 and it is possible that this domain-specific training will lead to greater performance than general-purpose LLMs in the near future.
Future model development and scrutiny could be provided in partnership with royal medical colleges, entering medical models as candidates in their exams and providing feedback on areas of weakness to ensure maintenance of standards and performance. The regulatory benchmark for a future LLM in the UK may, therefore, be membership of the relevant college.
Supplementary Material
Acknowledgments
We would like to thank the Federation of the Royal Colleges of Physicians for allowing reproduction of a select sample of practice MRCP questions for this paper.
Footnotes
Twitter: @StuMaitland
Contributors: AM and SM were involved in study conceptualisation, data curation, analysis. RF was involved in analysis. All authors contributed to writing. SM is the guarantor for the study.
Funding: The NIHR Newcastle Biomedical Research Centre (BRC) is a partnership between Newcastle Hospitals NHS Foundation Trust, Newcastle University and Cumbria, Northumberland and Tyne and Wear NHS Foundation Trust and is funded by the National Institute for Health and Care Research (NIHR). This paper presents independent research supported by the NIHR Newcastle Biomedical Research Centre (BRC).
Competing interests: None declared.
Patient and public involvement: Patients and/or the public were not involved in the design, or conduct, or reporting, or dissemination plans of this research.
Provenance and peer review: Not commissioned; externally peer reviewed.
Supplemental material: This content has been supplied by the author(s). It has not been vetted by BMJ Publishing Group Limited (BMJ) and may not have been peer-reviewed. Any opinions or recommendations discussed are solely those of the author(s) and are not endorsed by BMJ. BMJ disclaims all liability and responsibility arising from any reliance placed on the content. Where the content includes any translated material, BMJ does not warrant the accuracy and reliability of the translations (including but not limited to local regulations, clinical guidelines, terminology, drug names and drug dosages), and is not responsible for any error and/or omissions arising from translation and adaptation or otherwise.
Data availability statement
Data are available on reasonable request. All ChatGPT chat-logs and questions and answers will be made publicly available.
Ethics statements
Patient consent for publication
Not applicable.
Ethics approval
Ethical approval was obtained from Newcastle University Ethics Committee (ref 39385/2023).
References
- 1. OpenAI . GPT-4 technical report. 2023. Available: 10.48550/arXiv.2303.08774 [DOI]
- 2. Cay Y. All OpenAI’s GPT models: from GPT-1 to GPT-4 explained [ChatGPT Plus]. 2023. Available: https://chatgptplus.blog/all-gpt-models/ [Accessed 16 Aug 2023].
- 3. Lee P, Bubeck S, Petro J. Limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med 2023;388:1233–9. 10.1056/NEJMsr2214184 [DOI] [PubMed] [Google Scholar]
- 4. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med 2023;29:1930–40. 10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
- 5. Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023;9:e45312. 10.2196/45312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198. 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Gilbert S, Harvey H, Melvin T, et al. Large language model AI Chatbots require approval as medical devices. Nat Med 2023;29:2396–8. 10.1038/s41591-023-02412-6 [DOI] [PubMed] [Google Scholar]
- 8. Jin D, Pan E, Oufattole N, et al. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences 2021;11:6421. 10.3390/app11146421 [DOI] [Google Scholar]
- 9. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. arXiv 2022. 10.48550/arXiv.2212.13138 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Feng SY, Khetan V, Sacaleanu B, et al. CHARD: clinical health-aware reasoning across dimensions for text generation models. arXiv 2023. 10.48550/arXiv.2210.04191 [DOI] [Google Scholar]
- 11. MRCPUK . MRCP(UK) examinations. Available: https://www.mrcpuk.org/mrcpuk-examinations [Accessed 10 Jul 2023].
- 12. Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on medical challenge problems. arXiv 2023. 10.48550/arXiv.2303.13375 [DOI] [Google Scholar]
- 13. Bachmann M. Levenshtein: python extension for computing string edit distances and similarities. Available: https://github.com/maxbachmann/Levenshtein [Accessed 15 Nov 2023].
- 14. Croskerry P. The importance of cognitive errors in diagnosis and strategies to minimize them. Acad Med 2003;78:775–80. 10.1097/00001888-200308000-00003 [DOI] [PubMed] [Google Scholar]
- 15. Rylander M, Guerrasio J. Heuristic errors in clinical reasoning. Clin Teach 2016;13:287–90. 10.1111/tct.12444 [DOI] [PubMed] [Google Scholar]
- 16. Maynez J, Narayan S, Bohnet B, et al. On faithfulness and Factuality in abstractive summarization. arXiv 2020. 10.48550/arXiv.2005.00661 [DOI] [Google Scholar]
- 17. Singhal K, Tu T, Gottweis J, et al. Towards expert-level medical question answering with large language models. arXiv 2023. 10.48550/arXiv.2305.09617 [DOI] [Google Scholar]
- 18. Searle JR. Minds, brains, and programs. Behav Brain Sci 1980;3:417–24. 10.1017/S0140525X00005756 [DOI] [Google Scholar]
- 19. Bloom BS, Engelhart MD, Furst EJ, et al. Taxonomy of educational objectives: the classification of educational goals. In: Handbook 1: Cognitive domain. New York: McKay, 1956. [Google Scholar]
- 20. Miller GE. The assessment of clinical skills/competence/performance. Acad Med 1990;65:S63–7. 10.1097/00001888-199009000-00045 [DOI] [PubMed] [Google Scholar]
- 21. Dreyfus SE, Dreyfus HL. A five-stage model of the mental activities involved in directed skill acquisition; 1980.
- 22. Lin B, Chen Z, Li M, et al. Towards medical artificial general intelligence via knowledge-enhanced multimodal pretraining. arXiv 2023. 10.48550/arXiv.2304.14204 [DOI] [Google Scholar]
- 23. Taylor R, Kardas M, Cucurull G, et al. Galactica: a large language model for science. arXiv 2022. 10.48550/arXiv.2211.09085 [DOI] [Google Scholar]
- 24. Fijačko N, Gosak L, Štiglic G, et al. Can ChatGPT pass the life support exams without entering the American Heart Association course. Resuscitation 2023;185. 10.1016/j.resuscitation.2023.109732 [DOI] [PubMed] [Google Scholar]
- 25. Freedman JD, Nappier IA. GPT-4 to GPT-3.5: ‘Hold My Scalpel’ -- A look at the competency of OpenAI’s GPT on the plastic surgery in-service training exam. arXiv 2023. 10.48550/arXiv.2304.01503 [DOI] [Google Scholar]
- 26. Thirunavukarasu AJ, Hassan R, Mahmood S, et al. Trialling a large language model (ChatGPT) in general practice with the applied knowledge test: observational study demonstrating opportunities and limitations in primary care. JMIR Med Educ 2023;9:e46599. 10.2196/46599 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Lin JC, Younessi DN, Kurapati SS, et al. Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye (Lond) 2023;37:3694–5. 10.1038/s41433-023-02564-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Ali R, Tang OY, Connolly ID, et al. Performance of ChatGPT, GPT-4, and Google bard on a neurosurgery oral boards preparation question bank. Neurosurgery 2023;93:1090–8. 10.1227/neu.0000000000002551 [DOI] [PubMed] [Google Scholar]
- 29. Birkett L, Fowler T, Pullen S. Performance of ChatGPT on a primary FRCA multiple choice question bank. Br J Anaesth 2023;131:e34–5. 10.1016/j.bja.2023.04.025 [DOI] [PubMed] [Google Scholar]
- 30. Passby L, Jenko N, Wernham A. Performance of ChatGPT on dermatology specialty certificate examination multiple choice questions. Clin Exp Dermatol 2023:llad197. 10.1093/ced/llad197 [DOI] [PubMed] [Google Scholar]
- 31. Giannos P. Evaluating the limits of AI in medical specialisation: ChatGPT’s performance on the UK neurology specialty certificate examination - PMC. BMJ Neurol Open 2023;5:e000451. 10.1136/bmjno-2023-000451 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32. Beam K, Sharma P, Kumar B, et al. Performance of a large language model on practice questions for the neonatal board examination. JAMA Pediatr 2023;177:977–9. 10.1001/jamapediatrics.2023.2373 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
bmjopen-2023-080558supp001.pdf (154KB, pdf)
Data Availability Statement
Data are available on reasonable request. All ChatGPT chat-logs and questions and answers will be made publicly available.


