Skip to main content
Medicine logoLink to Medicine
. 2024 Mar 1;103(9):e37325. doi: 10.1097/MD.0000000000037325

Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank

Go Un Lee a, Dae Young Hong b,*, Sin Young Kim a, Jong Won Kim b, Young Hwan Lee b, Sang O Park b, Kyeong Ryong Lee b
PMCID: PMC10906566  PMID: 38428889

Abstract

Large language models (LLMs) have been deployed in diverse fields, and the potential for their application in medicine has been explored through numerous studies. This study aimed to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Emergency Medicine Board Examination question bank in the Korean language. Of the 2353 questions in the question bank, 150 questions were randomly selected, and 27 containing figures were excluded. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. The answers and explanations obtained by inputting the 123 questions into the LLMs were analyzed and compared. ChatGPT-4 (75.6%) and Bing Chat (70.7%) showed higher correct response rates than ChatGPT-3.5 (56.9%) and Bard (51.2%). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (75.6%, 68.3%, 52.8%, and 50.4%, respectively). ChatGPT-4 and Bing Chat outperformed ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language.

Keywords: artificial intelligence, Bard, Bing Chat, ChatGPT, emergency medicine, large language model

1. Introduction

Presently, the possibility of using artificial intelligence (AI) in various fields is in-creasing, and innovative changes in this technology are occurring at such a rapid pace that it is being called the era of AI. In the field of medicine, AI shows great potential in early disease diagnosis, individualized patient management, complex data analysis, medical drug production, and medical education.[1] In particular, large language models (LLMs), such as ChatGPT (OpenAI, San Francisco, CA), Bing Chat (Microsoft, Redmond, WA), and Bard (Google, Mountain View, CA), are attracting attention for their performance and usability because they show considerable potential in providing medical field information and advice and in interacting with patients.[24]

ChatGPT-3.5 is an LLM-based chatbot developed by OpenAI that was released in November 2022. It can be used freely by anyone with an OpenAI account. ChatGPT-4, which was released in March 2023, is equipped with a more advanced GPT-4 engine with multimodal performance and enables understanding of context and distinguishing of nuances, resulting in more accurate and coherent responses. Bing Chat is an AI chatbot integrated into the search engine Bing based on GPT-4. It was introduced by Microsoft in February 2023. Bard is an LLM-based chatbot developed by Google in March 2023 and is still under development.

Emergency medicine is a field that requires particularly rapid and accurate judgment. The role of LLMs in emergency medicine includes not only providing data-based insights but also helping clinicians make informed and rapid clinical decisions during the treatment process. Because emergency departments have become increasingly overcrowded, there is growing interest in the role of LLMs in assisting decision-making. However, few studies have evaluated the performance of LLMs in the context of emergency medicine, especially in languages other than English.[5,6]

The Emergency Medicine Board Examination in Korea is structured to simultaneously evaluate in-depth knowledge and rapid judgment ability. Therefore, by using this test as an evaluation tool, it would be possible to evaluate the professional medical knowledge and judgment ability of different LLMs. This study compared the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean Emergency Medicine Board Examination question bank, and sought to compare the appropriateness of the explanations for the answers provided by the 4 LLMs.

2. Materials and methods

2.1. Study setting and ethics statement

The Korean Emergency Medicine Board Examination comprises 150 single-choice questions with 5 possible options. It is divided into 23 detailed categories and covers: resuscitation; trauma; pediatrics; cardiovascular diseases; gastrointestinal disorders; toxicology; pulmonary disorders; neurology; critical care; environmental injuries; musculoskeletal disorders; endocrine disorders; infectious diseases; obstetrics and gynecology; renal and genitourinary disorders; dermatology; ophthalmology; disaster management; psychosocial disorders; ear, nose, throat, and oral disorders; hematologic and oncologic disorders; resuscitative procedures; and others. The total score is 100 points, and a score ≥ 60 points is required to pass this examination.

Because the examination questions are not made public according to the policy of the Korean Academy of Medical Sciences, a question bank is used to prepare for it. This question bank is a document file that is updated every year and passed on to junior emergency medicine residents, and anyone can use it freely. Among the 2353 questions in the bank, 150 single-choice questions were randomly selected according to the number of questions in each of the 23 categories. The RANDBETWEEN function in Microsoft Excel (Microsoft, Redmond, WA) was used for random selection, and additional extraction was performed when the same number was extracted. Questions that required abilities such as analysis, creative thinking, evaluation, and synthesis were classified as higher-order questions, and those that required only recall, memory, and factual information in response were classified as lower-order questions. Two physicians (DYH and KYL) with extensive educational experience in medical school participated in classifying types of problems. Questions containing images were excluded because the LLMs answered only the text portion of a question; thus, only text-based questions were included.

Because there were no human or animal research subjects in this study, approval by the institutional review board was not required.

2.2. Data collection and analysis

Four LLMs were used in this study: ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard. All questions included in the study were in Korean and inputted in the LLMs without converting them into other languages, such as English. The response data were obtained by manually inputting the latest version of each LLM on the same computer (by DYH) from September 8 to September 10, 2023, and this was performed only once. “More creative” was selected as the conversation style option for Bing Chat. The obtained response data were considered correct if they matched the answers provided by the question bank. If the LLM failed to respond to the question (e.g., if the LLM stated, “I’m just a language model, and I can’t help you because I don't have the ability to understand and respond to it.”), it was considered incorrect. Two experienced emergency medicine physicians (YHL and KRL) independently judged whether the responses of each LLM were appropriate or contained incorrect rationales or artificial hallucinations. Artificial hallucination was defined as the LLMs generating inaccurate information as if it were correct. If the 2 physicians had different opinions, the final decision was made after discussion among 3 physicians (DYH, YHL, and KRL).

The obtained data were entered into Microsoft Excel, and all statistical analyses were conducted using IBM SPSS Statistics 28 (IBM Corp., Armonk, NY). Categorical variables were expressed as frequencies and percentages, and the chi-square or Fisher exact test was used to compare distributions between each group. A 2-tailed test was used, and P < .05 was considered significant.

3. Results

3.1. Overall and category-specific performance

Of the 150 randomly selected questions, 27 questions containing figures were excluded, leaving a total of 123 questions. The overall and category-specific correct response rates for each LLM are shown in Table 1. The category-specific correct response rates of ChatGPT-4 and Bing Chat were 75.6% and 70.7%, respectively, which were higher than those for ChatGPT-3.5 (56.9%) and Bard (51.2%). No statistically significant difference in the overall correct response rate was observed between ChatGPT-4 and Bing Chat (P = .388).

Table 1.

Comparison of overall and category-specific correct response rates for each LLM.

ChatGPT-3.5
n (%)
ChatGPT-4
n (%)
Bing Chat
n (%)
Bard
n (%)
Total (n = 123) 70 (56.9)**,*** 93 (75.6)*,**** 87 (70.7)*,**** 63 (51.2)**,***
Resuscitation (n = 9) 5 (55.6) 8 (88.9) 7 (77.8) 4 (44.4)
Trauma (n = 9) 4 (44.4) 8 (88.9) 5 (55.6) 3 (33.3)
Pediatrics (n = 10) 3 (30.0) 7 (70.0) 6 (60.0) 4 (40.0)
Cardiovascular disease (n = 4) 2 (50.0) 3 (75.0) 3 (75.0) 1 (25.0)
Gastrointestinal disorders (n = 10) 8 (80.0) 8 (80.0) 8 (80.0) 6 (60.0)
Toxicology (n = 8) 3 (37.5) 5 (62.5) 5 (62.5) 2 (25.0)
Pulmonary disorders (n = 8) 4 (50.0) 8 (100) 6 (75.0) 3 (37.5)
Neurology (n = 8) 2 (25.0) 5 (62.5) 5 (62.5) 3 (37.5)
Critical care (n = 7) 6 (85.7) 5 (71.4) 6 (85.7) 3 (42.9)
Environmental injuries (n = 7) 5 (71.4) 4 (57.1) 3 (42.9) 5 (71.4)
Musculoskeletal disorders (n = 2) 1 (50.0) 1 (50.0) 2 (100) 2 (100)
Endocrine disorders (n = 4) 2 (50.0) 2 (50.0) 3 (75.0) 3 (75.0)
Infectious disease (n = 6) 5 (83.3) 6 (100) 6 (100) 5 (83.3)
Obstetrics and gynecology (n = 4) 3 (75.0) 4 (100) 4 (100) 4 (100)
Renal and genitourinary disorders (n = 6) 3 (50.0) 4 (66.7) 4 (66.7) 4 (66.7)
Dermatology (n = 2) 1 (50.0) 2 (100) 0 (0) 1 ((50.0)
Ophthalmology (n = 1) 1 (100) 1 (100) 1 (100) 1 (100)
Disaster management (n = 3) 2 (66.7) 1 (33.3) 0 (0) 0 (0)
Psychosocial disorders (n = 2) 2 (100) 2 (100) 2 (100) 1 (50.0)
Ear, nose, throat, and oral disorders (n = 2) 1 (50.0) 1 (50.0) 2 (100) 1 (50.0)
Hematologic and oncologic disorders (n = 4) 3 (75.0) 3 (75.0) 3 (75.0) 3 (75.0)
Others (n = 7) 4 (57.1) 5 (71.4) 6 (85.7) 4 (57.1)

LLM = large language model.

*

P < 0.05, vs ChatGPT-3.5;

**

P < 0.05, vs ChatGPT-4;

***

P < 0.05, vs Bing Chat;

****

P < 0.05, vs Bard.

All LLMs answered 34 questions (27.6%) correctly and 10 (8.1%) incorrectly. Only one LLM answered correctly: 5 questions for ChatGPT-3.5, 7 questions for ChatGPT-4, one question for Bing Chat, and one question for Bard. Only one LLM answered incorrectly: ChatGPT-3.5 had 13 questions, ChatGPT-4 had one question, Bing Chat had 7 questions, and Bard had 13 questions.

ChatGPT-4 and Bing Chat responded correctly to all questions in 6 categories, and ChatGPT-3.5 responded correctly to all questions in 2 categories. No significant difference was observed in the percentage of correct answers between all 4 LLMs by category (all P > .05).

3.2. Type-specific performance

A total of 102 (82.9%) questions were higher-order. Table 2 shows the correct response rates by question type. No difference was observed in correct response rates between higher-order and lower-order questions in each of the 4 LLMs (all P > .05). ChatGPT-4 showed the highest correct response rate for the higher-order questions at 76.5%, and Bard and Bing Chat showed the highest rate for the lower-order questions at 71.4%. Bard had the lowest correct response rate for higher-order questions, which was not significantly different from that of ChatGPT-3.5. No statistically significant difference was noted between the 4 LLMs for the lower-order questions.

Table 2.

Comparison of type-specific correct response rates for each LLM.

Higher-order questions
(n = 102)
Lower-order questions
(n = 21)
P value
ChatGPT-3.5, n (%) 58 (56.9)** 12 (57.1) .981
ChatGPT-4, n (%) 78 (76.5)*,**** 12 (57.1) .624
Bing Chat, n (%) 72 (70.6)**** 15 (71.4) .939
Bard, n (%) 55 (53.9)**,*** 15 (71.4) .186

LLM = large language model.

*

P < 0.05, vs ChatGPT-3.5;

**

P < 0.05, vs ChatGPT-4;

***

P < 0.05, vs Bing Chat;

****

P < 0.05, vs Bard.

3.3. Appropriateness of explanation

The appropriateness of the explanation for the answer was significantly higher for ChatGPT-4 and Bing Chat than for ChatGPT-3.5 and Bard (Table 3). No difference in the appropriateness of explanations was observed between ChatGPT-4 and Bing Chat (P = .201).

Table 3.

Comparison of the appropriateness of explanation for the responses of each LLM.

ChatGPT-3.5
n (%)
ChatGPT-4
n (%)
Bing Chat
n (%)
Bard
n (%)
Overall (n = 123) 65 (52.8)**,*** 93 (75.6)*,**** 84 (68.3)*,**** 62 (50.4)**,***

LLM = large language model.

*

P < 0.05, vs ChatGPT-3.5;

**

P < 0.05, vs ChatGPT-4;

***

P < 0.05, vs Bing Chat;

****

P < 0.05, vs Bard.

The explanation was appropriate but the answers were incorrect for 2, 2, 2, and zero questions in the case of ChatGPT-4, Bing Chat, Bard, and ChatGPT-3.5, respectively. The explanation was inappropriate, but the answers were correct in the case of 5 questions for ChatGPT-3.5, 2 for ChatGPT-4, 5 for Bing Chat, and 3 for Bard (Table 4).

Table 4.

Comparison of answers with the appropriateness of explanation for each LLM.

ChatGPT-3.5
n (%)
ChatGPT-4
n (%)
Bing Chat
n (%)
Bard
n (%)
Correct answer with appropriate explanation 65 (52.8) 91 (74.0) 82 (66.7) 60 (48.8)
Correct answer with inappropriate explanation 5 (4.1) 2 (1.6) 5 (4.1) 3 (2.4)
Incorrect answer with appropriate explanation 0 (0) 2 (1.6) 2 (1.6) 2 (1.6)
Incorrect answer with inappropriate explanation 53 (43.1) 28 (22.8) 34 (27.6) 58 (47.2)

LLM = large language model.

4. Discussion

The present study is the first to evaluate and compare the performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard in solving a random selection of Emergency Medicine Board Examination questions in the Korean language. ChatGPT-4 and Bing Chat performed better in this study than ChatGPT-3.5 and Bard. The overall performance rates of ChatGPT-4 and Bing Chat were 75.6% and 70.7%, respectively, which are considerably higher than the 60% score required to pass the Korean Emergency Medicine Board Examination. Similar to our findings, Toyama et al reported that ChatGPT-4 outperformed ChatGPT-3.5 and Bard in the Japan Radiology Society Board Examination (65.0%, 40.8%, and 38.8%, respectively).[7] In the Specialty Certificate Dermatology Examination in Polish and English, ChatGPT-4 performed significantly better than ChatGPT-3.5 in each examination regardless of the language.[8] Patil et al also showed that ChatGPT-4 answered significantly more accurately than Bard in the radiology examination (87.1% vs 70.4%; P < .001).[9] In a study using a question bank of the American College of Emergency Medicine, ChatGPT-4 showed a superior performance rate than ChatGPT-3.5 (82.1% vs 61.4%), which was almost similar to the human level.[5] These performance rates of ChatGPT-4 and ChatGPT-3.5 are similar to those reported in our study (75.6% and 56.9%, respectively). Previous studies, especially in non-English languages, have rarely included the performance of Bing Chat. Our study showed that Bing Chat had significantly superior performance compared with ChatGPT-3.5 and Bard, and no difference was noted between Bing Chat and ChatGPT-4 (P = .388). Raimondi et al reported that Bing Chat outperformed Bard and ChatGPT-3.5 in the Royal College of Ophthalmologists fellowship examination in English.[10] ChatGPT-4 and Bing Chat are based on the same GPT-4 model, and no difference was noted in their overall performance in our study. However, ChatGPT-4 and Bing Chat sometimes responded differently to the same question. This difference in responses may occur because different companies implement these 2 LLMs, and the interface used by each chatbot, user interaction method, and data collection and processing method for improvement may differ.

Interestingly, Bard responded to 2 questions as follows: “I’m just a language model, and I can’t help you because I don’t have the ability to understand and respond to it.” This may occur if the question is very complex or the LLM has not learned the information needed to respond. However, we found that even simple questions were sometimes not answered, suggesting that the ability to understand and process Korean prompts was incomplete.

A recent study reported that ChatGPT-4 performed better for higher-order radio-logic questions than ChatGPT-3.5,[11] although its performance was not better than ChatGPT-3.5 for lower-order radiologic questions. We also found that the correct response rate of ChatGPT-4 was significantly higher than that of ChatGPT-3.5 and Bard for higher-order questions (76.5%, 56.9%, and 53.9%, respectively). However, no difference was noted between the correct response rates of the 4 LLMs for lower-order questions (P = .103). In contrast, another study found that the correct response rate of ChatGPT-4 for lower-order Japanese radiologic questions was significantly higher than those of Bard and ChatGPT-3.5 and that no difference was found between the correct response rates of the LLMs for higher-order questions.[7] Because most of the information for medical training provided to LLMs is in English, the differences in the amount of learning and understanding of information in languages other than English, such as Korean and Japanese, may have partly contributed to these findings.

Some recent studies have reported differences in the performance of LLMs depending on the question category.[12,13] In our study, the number of questions per category was small, and no statistically significant difference was found. In future work, a larger number of questions per category is required.

Artificial hallucination occurs when an LLM generates factually inaccurate or nonsensical information as if it were correct and can be particularly common in generative models, such as LLMs. It may appear in various situations, such as when LLMs are not trained with sufficient data or are trained with incorrect data, when they perform complex tasks, or when the trained environment and the actual work environment are different. Hallucination and fabrication were reported to be reduced in ChatGPT-4 compared with ChatGPT-3.5. When short papers were produced using ChatGPT-3.5 and ChatGPT-4, 55% and 18% fabrication were noted in the citations of ChatGPT-3.5 and ChatGPT-4, respectively. In addition, ChatGPT-3.5 had 43% citation errors, and ChatGPT-4 had 24% citation errors for real citations.[14] In our study, 24.4% of the explanations by ChatGPT-4 were untrue or had errors, whereas 47.2% and 49.6% of the explanations by ChatGPT-3.5 and Bard were untrue or had errors, respectively.

Although appropriate explanations usually accompanied correct responses, sometimes the explanation was appropriate, but the answer was incorrect. By contrast, ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard correctly answered 5, 2, 5, and 3 questions, respectively, but provided inappropriate explanations. Hallucination and fabrication are expected to decrease as the capabilities of LLMs improve.

Natural language processing (NLP) is a component of AI with the ability to understand text and speech in a manner similar to that of humans. Given that LLMs are most commonly trained in English, their performance in English is better than that in other languages. The NLP ability of ChatGPT-3.5 is reported as being lower in non-English languages than in English.[15] For example, ChatGPT-3.5 passed the United States Medical Licensing Examination in English but failed Taiwan Family Medicine Board Exam in Chinese.[16,17] In the Taiwanese Pharmacist Licensing Examination, the ChatGPT-3.5 score in the English test was 1.4%–37% higher than that in the Chinese test.[18] Even for the Japanese Medical Licensing Examination in the original Japanese format, ChatGPT-3.5 did not meet the passing criteria; in contrast, ChatGPT-4 passed the acceptance criteria and showed advanced NLP capabilities.[19] In our study in the Korean language, because ChatGPT-4 showed higher correct response rates than ChatGPT-3.5, we can infer that the Korean NLP ability of ChatGPT-4 is better than that of ChatGPT-3.5. However, elsewhere, no difference was reported between the performance of ChatGPT-3.5 and ChatGPT-4 in either the English or Polish versions of the Specialty Certificate Examination in Dermatology.[8]

Some study limitations should be noted. First, it included a small number of questions and only one response to these questions from each of the LLMs. This may not fully reflect the ability of the LLMs to solve medical problems. Because the performance of LLMs is improving daily, different findings may be obtained if the study is conducted again using the same method soon. Second, some questions contained content that included the characteristics of the Korean medical environment. Korea has a national health insurance system that allows access to expensive diagnostic tests, such as computed tomography and magnetic resonance imaging, in emergency departments. However, certain recommended medications for specific diseases have not been introduced in Korea; hence, they cannot be used in emergency departments. Third, we excluded questions that included images, such as electrocardiograms, radiographs, CT images, and ultrasound images, which are commonly used for diagnosis in emergency departments in Korea.

5. Conclusion

ChatGPT-4 and Bing Chat showed better performance than ChatGPT-3.5 and Bard in answering a random selection of Emergency Medicine Board Examination questions in the Korean language. Although the performance of LLMs is improving on a daily basis, they still produce incorrect results, and their increasing use needs to be carefully evaluated when applied to clinical practice and education.

Author contributions

Conceptualization: Dae Young Hong.

Data curation: Dae Young Hong, Jong Won Kim.

Formal analysis: Sang O Park.

Investigation: Dae Young Hong, Young Hwan Lee, Kyeong Ryong Lee.

Methodology: Go Un Lee, Sin Young Kim.

Supervision: Dae Young Hong.

Writing – original draft: Go Un Lee, Dae Young Hong.

Writing – review & editing: Go Un Lee, Dae Young Hong.

Abbreviations:

AI
artificial intelligence
LLM
large language model
NLP
natural language processing

The authors have no funding and conflicts of interest to disclose.

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

How to cite this article: Lee GU, Hong DY, Kim SY, Kim JW, Lee YH, Park SO, Lee KR. Comparison of the problem-solving performance of ChatGPT-3.5, ChatGPT-4, Bing Chat, and Bard for the Korean emergency medicine board examination question bank. Medicine 2024;103:9(e37325).

Contributor Information

Go Un Lee, Email: lkrer@kuh.ac.kr.

Sin Young Kim, Email: 20130296@kuh.ac.kr.

Jong Won Kim, Email: 20130296@kuh.ac.kr.

Young Hwan Lee, Email: lkrer@kuh.ac.kr.

Sang O Park, Email: empso@kuh.ac.kr.

Kyeong Ryong Lee, Email: lkrer@kuh.ac.kr.

References

  • [1].Liu PR, Lu L, Zhang JY, et al. Application of Artificial Intelligence in medicine: an overview. Curr Med Sci. 2021;41:1105–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [2].Sharma S, Pajai S, Prasad R, et al. ChatGPT as a potential substitute for diabetes educators. Cureus. 2023;15:e38380. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [3].Seth I, Cox A, Xie Y, et al. Evaluating Chatbot efficacy for answering frequently asked questions in plastic surgery: a ChatGPT case study focused on breast augmentation. Aesthet Surg J. 2023;43:1126–35. [DOI] [PubMed] [Google Scholar]
  • [4].Haver HL, Lin CT, Sirajuddin A, et al. Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT’s answers to common questions on lung cancer and lung cancer screening. AJR Am J Roentgenol. 2023;221:701–4. [DOI] [PubMed] [Google Scholar]
  • [5].Jarou ZJ, Dakka A, McGuire D, et al. ChatGPT versus human performance on emergency medicine board preparation questions. Ann Emerg Med. 2024;83:87–8. [DOI] [PubMed] [Google Scholar]
  • [6].Dahdah JE, Kassab J, Helou MCE, et al. ChatGPT: a valuable tool for emergency medical assistance. Ann Emerg Med. 2023;82:411–3. [DOI] [PubMed] [Google Scholar]
  • [7].Toyama Y, Harigai A, Abe M, et al. Performance evaluation of ChatGPT, GPT-4, and Bard on the official board examination of the Japan Radiology Society. Jpn J Radiol. 2024;42:201–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [8].Lewandowski M, Łukowicz P, Świetlik D, et al. ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the dermatology specialty certificate examinations. Clin Exp Dermatol. 2023. [DOI] [PubMed] [Google Scholar]
  • [9].Patil NS, Huang RS, van der Pol CB, et al. Comparative Performance of ChatGPT and bard in a text-based radiology knowledge assessment. Can Assoc Radiol J. 2023:8465371231193716. [DOI] [PubMed] [Google Scholar]
  • [10].Raimondi R, Tzoumas N, Salisbury T, et al. Comparative analysis of large language models in the Royal College of Ophthalmologists fellowship exams. Eye (Lond). 2023;37:3530–3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [11].Bhayana R, Bleakney RR, Krishna S. GPT-4 in Radiology: improvements in advanced reasoning. Radiology. 2023;307:e230987. [DOI] [PubMed] [Google Scholar]
  • [12].Hoch CC, Wollenberg B, Lüers JC, et al. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023;280:4271–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [13].Ali R, Tang OY, Connolly ID, et al. Performance of ChatGPT, GPT-4, and Google Bard on a neurosurgery oral boards preparation question bank. Neurosurgery. 2023;93. [DOI] [PubMed] [Google Scholar]
  • [14].Walters WH, Wilder EI. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13:14045. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [15].Seghier ML. ChatGPT: not all languages are equal. Nature. 2023;615:216. [DOI] [PubMed] [Google Scholar]
  • [16].Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The implications of large language models for Medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [17].Weng TL, Wang YM, Chang S, et al. ChatGPT failed Taiwan’s family medicine board exam. J Chin Med Assoc. 2023;86:762–6. [DOI] [PubMed] [Google Scholar]
  • [18].Wang YM, Shen HW, Chen TJ. Performance of ChatGPT on the pharmacist licensing examination in Taiwan. J Chin Med Assoc. 2023;86:653–8. [DOI] [PubMed] [Google Scholar]
  • [19].Takagi S, Watari T, Erabi A, et al. Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: comparison Study. JMIR Med Educ. 2023;9:e48002. [DOI] [PMC free article] [PubMed] [Google Scholar]

Articles from Medicine are provided here courtesy of Wolters Kluwer Health

RESOURCES