Skip to main content
PLOS One logoLink to PLOS One
. 2024 Apr 4;19(4):e0301702. doi: 10.1371/journal.pone.0301702

Performance of ChatGPT on Chinese Master’s Degree Entrance Examination in Clinical Medicine

Ke-Cheng Li 1,, Zhi-Jun Bu 2,, Md Shahjalal 3, Bai-Xiang He 4, Zi-Fan Zhuang 5, Chen Li 6, Jian-Ping Liu 2, Bin Wang 1,*, Zhao-Lan Liu 2,*
Editor: Harpreet Singh Grewal7
PMCID: PMC10994287  PMID: 38573944

Abstract

Background

ChatGPT is a large language model designed to generate responses based on a contextual understanding of user queries and requests. This study utilised the entrance examination for the Master of Clinical Medicine in Traditional Chinese Medicine to assesses the reliability and practicality of ChatGPT within the domain of medical education.

Methods

We selected 330 single and multiple-choice questions from the 2021 and 2022 Chinese Master of Clinical Medicine comprehensive examinations, which did not include any images or tables. To ensure the test’s accuracy and authenticity, we preserved the original format of the query and alternative test texts, without any modifications or explanations.

Results

Both ChatGPT3.5 and GPT-4 attained average scores surpassing the admission threshold. Noteworthy is that ChatGPT achieved the highest score in the Medical Humanities section, boasting a correct rate of 93.75%. However, it is worth noting that ChatGPT3.5 exhibited the lowest accuracy percentage of 37.5% in the Pathology division, while GPT-4 also displayed a relatively lower correctness percentage of 60.23% in the Biochemistry section. An analysis of sub-questions revealed that ChatGPT demonstrates superior performance in handling single-choice questions but performs poorly in multiple-choice questions.

Conclusion

ChatGPT exhibits a degree of medical knowledge and the capacity to aid in diagnosing and treating diseases. Nevertheless, enhancements are warranted to address its accuracy and reliability limitations. Imperatively, rigorous evaluation and oversight must accompany its utilization, accompanied by proactive measures to surmount prevailing constraints.

1. Introduction

A large language model (LLM) is a computer program that employs artificial intelligence and natural language processing technology to comprehend and generate natural language text from extensive data, utilizing deep learning techniques [1]. Developed by OpenAI, ChatGPT is a substantial language model that interacting with users through dynamic dialogues, effectively responding to inquiries and requests. Notably, ChatGPT has garnered significant recognition for its exceptional performance, particularly within the medical domain [2, 3]. Presently, ChatGPT offers users two versions: ChatGPT3.5, available for free, and GPT-4, which requires payment. In comparison to ChatGPT3.5, GPT-4 elevates its model size from 175 billion to an impressive 170 trillion, incorporating a rule-based reward modeling (RBRM) methodology. Furthermore, it refines this methodology through human feedback reinforcement learning of the generated text, employing Reinforcement Learning Fine-tuning with Human Feedback (RLHF). These advancements contribute significantly to the reliability and security of GPT-4 [4, 5].

Impressively, ChatGPT has demonstrated an accuracy rate of 60%, approaching the pass threshold on the United States Medical Licensing Examination (USMLE), without the prerequisite of prior input of relevant background knowledge [6]. Additionally, its performance on the NBME-Free-Step1 dataset from the American Board of Medical Examiners surpasses the 60% pass threshold, indicative of skills comparable to a third-year medical student [7]. This accuracy underscores ChatGPT’s solid grasp of medical knowledge and remarkable proficiency in logical reasoning and disease diagnosis. ChatGPT also exhibits variable performance across different medical specializations. For instance, both ChatGPT3.5 and GPT-4 scored below the passing threshold and the average score of nephrology candidates in the American Society of Nephrology (ASN) Nephrology Self-assessesment Program and Renal Self-assessesment Program tests [8]. Furthermore, ChatGPT displayed an incorrect rate of 66% on PACES, the French medical school entrance exam [9]. While large-scale language models achieve profound semantic comprehension through extensive data and context, their efficacy in meeting user needs in managing intricate or detailed information and processing non-English language input. Moreover, their proficiency in handling medical information in non-English texts is currently insufficient and necessitates further enhancement and development.

In China, prospective Master’s degree students are mandated to undertake the Nationwide Master’s Program Unified Admissions Examination (NMPUA), a government-organized assessesment facilitating entry into their desired Master’s programs. The Comprehensive Examination of Clinical Medicine is obligatory for those pursuing a professional Master’s degree in clinical medicine, aiming to comprehensively evaluate the clinical performance of undergraduate graduates in clinical medicine within clinical scenarios. NMPUA represents a critically important examination for aspiring master’s degree students, aimed at assessesing the clinical reasoning, knowledge, diagnostic capabilities, and decision-making proficiency of medical undergraduates in a clinical context. The examination encompasses five question types: A1 (knowledge-based multiple choice), A2 (case-based multiple choice), A3 (case-group-based multiple choice), B (matching), and X (multiple choice), totaling 165 multiple-choice questions with a maximum score of 300. The exam spans six sections, covering physiology, biochemistry, pathology, internal and external medicine in clinical medicine, and medical humanities. Each single-choice question features one correct answer and three incorrect interfering options, while each multiple-choice question includes at least two correct answers. All questions and options were presented in text format, eliminating the need to analyze pictures or tables visually. Following rigorous review and screening by two independent researchers, all 330 questions met the study’s criteria and were included in the test battery.

The main objective of this study was to assesses the accuracy, robustness, and limitations of ChatGPT3.5 and GPT-4 in the context of the Chinese Master’s Comprehensive Examination in Clinical Medicine. This evaluation aimed to ascertain the effectiveness and reliability of ChatGPT within the Chinese medical domain and to provide guidance and references to assist Chinese medical students in their examination preparation. On the one hand, ChatGPT proves beneficial in aiding candidates to identify areas of fundamental knowledge that require improvement, thereby enhancing their performance in the final examination. On the other hand, educators can optimize the utility of ChatGPT by providing candidates access to more specialized test questions accompanied by custom-generated feedback. This approach facilitates greater automation and efficiency in the marking and feedback processes.

2. Methods and materials

2.1. Data collection

The test battery employed in this study comprised the Chinese Master’s Comprehensive Clinical Medicine Examinations (code 306) for the years 2021 and 2022, encompassing a total of 330 questions. These questions were distributed across various types, including 127 A1-type questions, 23 A2-type questions, 80 A3-type questions, 40 B-type questions, and 60 X-type questions. Notably, the Physiology, Biochemistry, and Pathology sections exclusively featured A1, A2, B, and X-type queries, whereas Medical Humanities solely consisted of A1-type questions. Internal Medicine and Surgery encompassed all question types.

The entirety of the paper was composed in Chinese, with occasional inclusion of English acronyms for specific medicines or proper nouns, which were preserved in their original untranslated form. Marks were allocated for questions 1–40 and 116–135 at a rate of 1.5 marks each, while questions 41–115 and 136–165 carried a weight of 2 marks each.

2.2. Study design

In this study, we replicated and sent 330 multiple-choice questions from the 2021 and 2022 Chinese Master of Clinical Medicine comprehensive examinations to both ChatGPT3.5 and GPT-4, in the order they appeared in the examination papers. The request was for them to simulate the role of a doctor and provide answers accordingly.

Each question was restricted to a single answer, irrespective of the accuracy of the response. We intentionally refrained from prompting ChatGPT to furnish an analysis of the options or an explanation for its choices. To minimize the influence of extraneous factors on results, participants were instructed to answer multiple-choice questions as medical professionals without offering any justifications. Responses were meticulously recorded using Excel and cross-verified against the correct answers to ensure precise evaluation of their performance in the Masters Comprehensive Clinical Medicine Examination. Through the computation of the percentage of accurate responses and the derivation of scores, our objective elucidate the potential advantages and challenges associated with the utilization of ChatGPT in the application of medical knowledge. The methodology employed in our study seeks to provide a comprehensive understanding of the practicality of ChatGPT’s current applications in the medical domain and shed light on its prospective role in medical education, as well as in the diagnosis and treatment of ailments.

In examining of ChatGPT’s performance features, our focus was directed towards understanding the impact of modifying temperature values on the reliability of generated responses. Temperature values play a crucial role in large language models as they directly influence the randomness of the generated content. Generally ranging from 0 to 1, lower temperatures (below 0.3) tend to produce more dependable and consistent outcomes, while higher temperatures (above 0.7) result in more varied and imaginative outputs [10, 11]. Notably, the default temperature setting in ChatGPT is typically 0.7. To gain insights into how temperature adjustments may impact answer correctness, we systematically tested the performance of both ChatGPT3.5 and GPT-4 concerning correctness at four temperature values (0, 0.3, 0.7, and 1). The experiment aimed to unveil how adjusting the temperature influences the reliability and diversity of generated answers using ChatGPT in the medical field. The findings from this study provide valuable insights for model optimization, shedding light on the role of temperature parameters in ChatGPT. Furthermore, it enhances our understanding of ChatGPT’s performance tuning in practical applications within the medical field.

We meticulously documented the accurate success rates of ChatGPT across various subjects and question types. The overarching objective was to deepen our understanding of ChatGPT’s proficiency in diverse knowledge domains and its ability to address different question types. Special attention was dedicated to evaluating whether ChatGPT adhered to the rules governing the answering of questions. Additionally, questions that received either entirely incorrect or correct responses underwent detailed analysis to comprehensively assesses ChatGPT’s answering abilities and distinctive features.

2.3. Statistical analysis

The data for this study were collected and analyzed using Microsoft Excel Mac 16.66.1 (Microsoft Corp., USA), and accuracy and scoring rate were presented in percentages. The Python programming language was employed for charting, visualization, and in-depth analysis to enhance the intuition and presentation of the findings.

3. Results

At temperatures of 0, 0.3, 0.7, and 1, GPT-4 demonstrated a notable advantage over ChatGPT3.5, exhibiting a significantly higher total accuracy rate (Fig 1). Additional data analysis from the years 2021 and 2022 revealed GPT-4’s consistent response to each question type across all temperature levels. This consistency indicates its reliability throughout the specified period and reflects its stable performance under varying temperature conditions (Fig 2).

Fig 1. The performance of ChatGPT at various temperatures.

Fig 1

(①:ChatGPT3.5(value 0),②:ChatGPT3.5(value 0.3),③:ChatGPT3.5(value 0.7), ④:ChatGPT3.5(value 1),⑤: GPT-4(value 0),⑥:GPT-4(value 0.3),⑦:GPT-4(value 0.7), ⑧:GPT-4(value 1)).

Fig 2. Correctness of ChatGPT3.5 and GPT-4 in different years at different temperatures.

Fig 2

Concerning variations in performance across question types, ChatGPT3.5 and GPT-4 exhibited distinct capabilities in answering specific questions. Specifically, ChatGPT3.5 specialise in responding to A1-type questions, while GPT-4 performs best on A3-type questions. Notably, both ChatGPT3.5 and GPT-4 demonstrated a similar weakness, scoring the lowest on X-type questions, providing insights into how the two models fare across diverse question types. This passage offers valuable insight into the divergences and similarities between the two models across various cognitive domains (Fig 3). The performance of ChatGPT remains relatively consistent across A1, A2, A3, and B-type questions, with only marginal effects observed from adjusting the temperature in relation to each question type. However, substituting ChatGPT3.5 with GPT-4 in all temperature settings substantially enhances accuracy across all question types (Fig 4).

Fig 3. Correctness of ChatGPT3.5 and GPT-4 on different question types.

Fig 3

Fig 4. Correctness of ChatGPT3.5 and GPT-4 on different question types at different temperatures.

Fig 4

(①:ChatGPT3.5(value 0),②:ChatGPT3.5(value 0.3),③:ChatGPT3.5(value 0.7), ④:ChatGPT3.5(value 1),⑤: GPT-4(value 0),⑥:GPT-4(value 0.3),⑦:GPT-4(value 0.7), ⑧:GPT-4(value 1)).

ChatGPT’s exceptional performance in the medical humanities section underscores its proficiency in managing medical ethics and humanistic care. Simultaneously, its comparable performance in the remaining five sections demonstrates its ability to handle multidisciplinary knowledge domains in a balanced manner (Fig 5). It is noteworthy that both ChatGPT3.5 and GPT-4 exhibited identical accuracy rates in the medical humanities section at all four temperatures, whereas the correct rates fluctuated to some degree in the remaining five sections with changes in temperature (Fig 6).

Fig 5. Correctness of ChatGPT3.5 and GPT-4 on different subjects.

Fig 5

Fig 6. Correctness of ChatGPT3.5 and GPT-4 on different subjects at different temperatures.

Fig 6

(①:ChatGPT3.5(value 0),②:ChatGPT3.5(value 0.3),③:ChatGPT3.5(value 0.7), ④:ChatGPT3.5(value 1),⑤: GPT-4(value 0),⑥:GPT-4(value 0.3),⑦:GPT-4(value 0.7), ⑧:GPT-4(value 1)).

In addition to the statistics on accuracy rates, we calculated the scores achieved by both ChatGPT3.5 and GPT-4 in the years 2021 and 2022, including the overall score, which was higher for GPT-4 than for ChatGPT3.5 (Fig 7). Furthermore, we analyzed of the violation of the answer rule in X-type multiple-choice questions. The higher percentage for ChatGPT3.5 (26.67%) compared to the lower percentage for GPT-4 (10%) underscores the latter’s superiority in adhering to normative questions. The outcomes of this analysis, which delved into the specific behaviors behind the data, contribute to our understanding of the advantages of ChatGPT3.5 and GPT-4 in comprehensive medical examinations and provide insights into future directions for improvement.

Fig 7. Overall score rate, 2021 score rate, 2022 score rate for ChatGPT3.5 vs GPT-4.

Fig 7

4. Discussion

In China, the qualifying examination for admission to the Master’s program in clinical medicine encompasses Comprehensive Clinical Medicine, Politics, and English—three subjects under investigation in this research. The national authorities determine the minimum thresholds for each examination and the total score, considering the test question difficulty and the average scores of candidates in the current year. Candidates are required to meet the corresponding thresholds for each subject and the total score to fulfill the admission requirements. To ensure educational equity, the government categorizes universities into Zone A and Zone B based on the economic development and educational resources of each province. This categorization leads to distinct admission requirements, as the economic and educational levels of Zone A are generally more advanced. Consequently, the admission scores for Zone A are higher than those for Zone B.

Official data reveals that the score threshold for the 2021 Clinical Medicine General Examination was 123 in Zone A and 114 in Zone B. Revised data from 2022 reports an increased score threshold of 129 in Zone A and 120 in Zone B. During the 2021 test questions, ChatGPT3.5 scored the lowest (135 points) at a temperature value of 0.7, while GPT-4 scored the highest (208 points) at a temperature of one. On the 2022 test, ChatGPT3.5 achieved the lowest score of 144.5 points at a temperature value of 0.7, while GPT-4 achieved the highest score of 209 points at a temperature of 0.3. In comparison with the national minimum score, ChatGPT met the eligibility requirements for admission at any temperature. Bhayana and colleagues [12] reported positive results for ChatGPT in the Royal College of Canada Diagnostic Radiology Examination. Similarly, Fuentes-Martín and colleagues [13] found that ChatGPT successfully passed the Andalusian Health Authority’s 2022 Competitive Examination for Thoracic Surgery Specialist Positions. These findings provide robust evidence supporting the successful implementation of ChatGPT.

When comparing the accuracy rates of ChatGPT3.5 and GPT-4 across different years, a noticeable improvement of approximately 20% is evident in the latter following its upgrade from ChatGPT3.5. Moreover, the extreme deviation of GPT-4 (0.61%) is smaller than that of ChatGPT3.5 (2.88%). These enhancements in accuracy and stability for GPT-4 signify OpenAI’s successful training and upgrading of the model, as highlighted in the source [14, 15].

In our comprehensive assessesment of ChatGPT, it becomes evident that variations exist in its responses to different types of queries. ChatGPT3.5 demonstrates relatively precise performance in answering explicit inquiries based on factual information, whereas GPT-4 exhibits superior performance in handling case-based questions, which demand a higher level of comprehensive judgment. However, when addressing intricate queries of type X, ChatGPT3.5 and GPT-4 display insufficient logical and critical thinking abilities, indicating the need for further refinement in forthcoming iterations and upgrades. This observation aligns with Miao ’s research [8], suggesting that ChatGPT excels at handling straightforward fact-based inquiries but remains inadequate when confronted with demanding inquiries requiring profound understanding and meticulous computations. This study underscores the challenges encountered in utilizing ChatGPT in the medical field and sets the stage for future technological advancements to facilitate a more all-encompassing and comprehensive adaptation of ChatGPT to the varied task demands specific to the medical domain.

Both ChatGPT3.5 and GPT-4 demonstrated a high level of accuracy (93.75%) in the medical humanities field, securing the top position among various disciplines. This accuracy showcases their proficiency not only in professional knowledge but also in effective communication with patients, along with their familiarity with medical ethics and relevant regulations. However, ChatGPT’s performance varied across sub-disciplines. ChatGPT3.5 exhibited the lowest accuracy percentage in Pathology (37.5%), while GPT-4 was least proficient in Biochemistry (60.23%). Notably, GPT-4 displayed the most significant improvement in Pathology (30.11%) compared to ChatGPT3.5. The pathology section primarily comprised A1 and X-type questions, which were relatively uncomplicated and direct in nature, primarily evaluating candidates’ ability to recall relevant information without necessitating complex logical reasoning and analysis. Consequently, once GPT-4 has been sufficiently trained with more data and has developed a more extensive foundation of background knowledge, it will be better equipped to assesses and answer these types of questions with greater precision. At this stage, when students utilize ChatGPT for revision and exam preparation, it is essential for them to engage in independent thought and critical evaluation of the provided answers. Blindly relying on ChatGPT is discouraged. Specifically, caution is advised when considering ChatGPT’s responses and insights for X-type questions and those pertaining to the Pathology and Biochemistry sections. Throughout the utilization of ChatGPT, students must apply their own logical reasoning and judgment to verify the accuracy and reliability of the information and answers, thereby enhancing the learning outcomes and preparation quality. In future ChatGPT training sessions, students should address their skill gaps by intensifying their training for A2-type questions, X-type questions, as well as the Pathology and Biochemistry sections. This approach will contribute to enhancing the accuracy and reliability of ChatGPT in these specific areas, rendering it more comprehensive and effective in handling intricate medical knowledge and question types.

Compared to ChatGPT3.5, GPT-4 demonstrated a reduced probability of answering all questions incorrectly (26.36% < 48.18%) and an increased likelihood of answering all questions correctly (64.24% > 43.94%) across all four temperatures. In X-type questions, GPT-4 exhibited a lower rate of anomalous single-option responses (10.00% < 26.67%). This highlights that GPT-4’s performance is significantly enhanced when adhering to user-prescribed responses. The resulting implications are highly promising for practical, high-accuracy applications, such as clinical medicine, presenting a more dependable and efficient option for intelligent medical assistants. Furthermore, students may utilise ChatGPT to obtain solutions and explanations for these medical examination questions, aiding in the consolidation of knowledge and enhancement of problem-solving skills. It is heartening to observe that ChatGPT has attained the minimum requirements for admission to the Chinese Master’s Degree Entrance Qualification Examination in Clinical Medicine without undergoing specific training tailored to target the test questions.

The study not only demonstrates the current capabilities of ChatGPT but also highlights its strengths and weaknesses, offering valuable insights for the future training of large language models in the medical domain. Concurrently, the study results underscore the importance of maintaining critical thinking when utilizing ChatGPT. Users should remain cognizant of the fact that its responses may not always be accurate, emphasizing the need to avoid blind reliance and exercise caution. This is a crucial aspect that demands our ongoing attention.

This study has several limitations. Firstly, the test questions used were dated 27th December 2020 and 26th December 2021, while the databases within ChatGPT3.5 and GPT-4 were last updated in January 2022. This creates uncertainty regarding whether OpenAI used these specific test questions for model training during ChatGPT’s training. Therefore, there exists a risk that the accuracy rate of ChatGPT, as determined by this test, may be artificially inflated. Secondly, although the study discovered that adjusting the temperature did not significantly impact the accuracy rate, a comprehensive summary of the temperature’s impact on the accuracy rate remains challenging. More experimental data is needed for further scrutiny. Thirdly, due to the incomplete disclosure of admission scores by each medical school, it was impossible to compare the ChatGPT scores with the admission score lines to determine whether they met specific school admission requirements. Fourth, as an advanced artificial intelligence tool, ChatGPT’s mechanisms of logical operation and decision-making in the process of analyzing matters and making decisions cannot always be fully understood by humans. Therefore, when applying ChatGPT to clinical practice, it is crucial to ensure that experienced doctors are involved and supervise the entire process [16]. Furthermore, the test questions used in this study were written in Chinese, while ChatGPT was primarily trained in English. Nuanced discrepancies in grammar rules and other aspects between the Chinese and English languages might affect ChatGPT’s effectiveness when used with Chinese. The current performance is restricted by the corpus, and further optimization and adjustment are required [17]. Consequently, the findings of this study provide an incomplete representation of ChatGPT’s overall performance level. However, it is anticipated that with more training on a Chinese corpus, the performance of ChatGPT will be further enhanced. Despite these limitations, our study is the first study assessesed the reliability and utility of ChatGPT in the field of medical education in China. Furthermore, our study provides insightful knowledge on using AI in medical field.

5. Conclusion

This study innovatively tested the performance of ChatGPT in the Chinese Master’s Degree Entrance Examination in Clinical Medicine, filling a knowledge gap in the intersection of medical education and artificial intelligence. While ChatGPT has met the requirements for passing the Chinese Master’s Comprehensive Examination in Clinical Medicine, it still fails to respond accurately to approximately 37% of the questions, indicating potential hazards of incorrect judgments in a healthcare environment. Medical students and clinicians should exercise caution when using ChatGPT, recognizing its imperfections and limitations. Simultaneously, when utilised, providing more detailed contextual information and relevant knowledge can enhance the accuracy of its responses [18]. Therefore, ChatGPT must continuously develop and enhance its precision to meet the increasingly rigorous clinical requirements. The strict supervision of medical professionals is pivotal at the end of the AI processing chain, ensuring ChatGPT’s safe and reliable application.

Furthermore, to better transform clinical practice and medical education, a closer collaboration between artificial intelligence companies and clinical practitioners is essential. This entails the meticulous development of specialised training corpora and the concerted effort to develop a medical professional version of ChatGPT [19]. Such initiatives could address the limitations identified in this study, thereby enhancing the effectiveness of support in medical education and healthcare sectors with greater precision and relevance.

Supporting information

S1 Data

(XLS)

pone.0301702.s001.xls (361.5KB, xls)

Acknowledgments

All authors have made significant contributions to this study and the field of medical education. As we reach the completion of this paper, I would like to express my sincere gratitude and extend my best wishes to those who have supported and guided me throughout this research and learning journey.

Data Availability

All relevant data are within the manuscript and its Supporting information files.

Funding Statement

This study is supported by a grant from the National Natural Science Foundation of China (Grant No. 82374298) and the Reserve Discipline Leader Funding of Beijing University of Chinese Medicine (Grant No. 90010960920033). There was no additional external funding received for this study.

References

  • 1.OpenAI R. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2023, 2.
  • 2.Biswas SS. Role of Chat GPT in Public Health. Ann Biomed Eng. 2023. May. doi: 10.1007/s10439-023-03172-7 [DOI] [PubMed] [Google Scholar]
  • 3.Kocoń J, Cichecki I, Kaszyca O, et al. ChatGPT: Jack of all trades, master of none[J]. Information Fusion, 2023: 101861. [Google Scholar]
  • 4.Koubaa, A. GPT-4 vs. GPT-3.5: A Concise Showdown. TechRxiv.2023.
  • 5.Haleem A, Javaid M, Singh R P. An era of ChatGPT as a significant futuristic support tool: A study on features, abilities, and challenges[J]. BenchCouncil transactions on benchmarks, standards and evaluations, 2022, 2(4): 100089. [Google Scholar]
  • 6.Kung TH, Cheatham M, Medenilla A,et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023. Feb 9 doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge assessesment. JMIR Med Educ. 2023. Feb 8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Miao J, Thongprayoon C, Garcia Valencia OA, et al. Performance of ChatGPT on Nephrology Test Questions. Clin J Am Soc Nephrol. 2023. Oct 18. doi: 10.2215/CJN.0000000000000330 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Guigue PA, Meyer R, Thivolle-Lioux G, et al. Performance of ChatGPT in French language Parcours d’Accès Spécifique Santé test and in OBGYN. Int J Gynaecol Obstet. 2023. Sep 1. [DOI] [PubMed] [Google Scholar]
  • 10.Ippolito D, Kriz R, Kustikova M, et al. Comparison of diverse decoding methods from conditional language models. arXiv preprint arXiv:1906.06362, 2019.
  • 11.Lo L S. The CLEAR path: A framework for enhancing information literacy through prompt engineering. The Journal of Academic Librarianship, 2023. [Google Scholar]
  • 12.Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023; 307: e230582. doi: 10.1148/radiol.230582 [DOI] [PubMed] [Google Scholar]
  • 13.Fuentes-Martín Á, Cilleruelo-Ramos Á, Segura-Méndez B, et al. Can an Artificial Intelligence Model Pass an Examination for Medical Specialists?. Archivos de Bronconeumologia, 2023: S0300-2896 (23) 00116. [DOI] [PubMed] [Google Scholar]
  • 14.Ray, P.P., ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems, 2023.
  • 15.Liu Y, Han T, Ma S, et al. Summary of chatgpt-related research and perspective towards the future of large language models[J]. Meta-Radiology, 2023: 100017. [Google Scholar]
  • 16.Grewal H, Dhillon G, Monga V, et al. Radiology Gets Chatty: The ChatGPT Saga Unfolds. Cureus. 2023. Jun 8;15(6):e40135. doi: 10.7759/cureus.40135 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Weng TL, Wang YM, Chang S, et al. ChatGPT failed Taiwan’s Family Medicine Board Exam. J Chin Med Assoc. 2023. Aug 1;86(8):762–766. doi: 10.1097/JCMA.0000000000000946 [DOI] [PubMed] [Google Scholar]
  • 18.Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023. Oct 2;6(10):e2336483. doi: 10.1001/jamanetworkopen.2023.36483 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Li J, Dada A, Puladi B, Kleesiek J, et al. ChatGPT in healthcare: A taxonomy and systematic review. Comput Methods Programs Biomed. 2024. Mar;245:108013. doi: 10.1016/j.cmpb.2024.108013 [DOI] [PubMed] [Google Scholar]

Decision Letter 0

Harpreet Singh Grewal

11 Mar 2024

PONE-D-24-06964Performance of ChatGPT on Chinese Master's Degree Entrance Examination in Clinical MedicinePLOS ONE

Dear Dr. Liu,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

==============================

ACADEMIC EDITOR:  Following are recommendations:1. ChatGPT 4 cannot process image input and assist with image interpretation directly. Were there any questions with images? If yes, how many?

2. Please look at this article to discuss limitations and ethical concerns as that would add to the study. (https://www.cureus.com/articles/161200-radiology-gets-chatty-the-chatgpt-saga-unfolds#!/). AI ethics is important as AI decision are not always intelligible to humans.

3. There have been a few articles about use of ChatGPT in medical examinations. In one of them, they compared ChatGPT to instructGPT3 and chatgpt did outperform it (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/). Do you have any such comparisons?

4. The abstract could be more succinct. In the Results section, the comparison between ChatGPT3.5 and GPT-4 could be presented in a more structured manner. Consider organizing the information chronologically or by thematic relevance to facilitate a smoother flow for the reader.

5. Furthermore, the Conclusion could be strengthened by offering specific recommendations for the identified enhancements in ChatGPT. Line 345. ‘Furthermore, the potential of close collaboration between AI companies and clinicians must not be overemphasized, ‘ Can you please elaborate on that ? Following are suggestions but not hard recommendations however I feel they will certainly enhance the clarity and strength of the study:

1.The introduction could provide more background information on the Chinese Master's Degree Entrance Examination in Clinical Medicine to help readers understand the context and significance of the study.

2.The methods section should provide more details on how the researchers inputted the questions into ChatGPT and how they recorded and verified the responses.

3.Discussion should include what are some concrete ways ChatGPT could be leveraged to assist students in exam preparation?

4.The limitations of the study could be expanded. For example, the authors note the discrepancy in time between when the test questions were administered vs. when ChatGPT's databases were last updated, but they could discuss more how this may have impacted the results.

5. Conclusion section should address some areas where knowledge gap was filled or knowledge was advanced with this study. Please ensure that your decision is justified on PLOS ONE’s publication criteria and not, for example, on novelty or perceived impact.

For Lab, Study and Registered Report Protocols: These article types are not expected to include results but may include pilot data. 

==============================

Please submit your revised manuscript by Apr 25 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Harpreet Singh Grewal

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf.

2. Thank you for stating in your Funding Statement:

“This study is supported by a grant from the National Natural Science Foundation of China (Grant No. 82374298 ) and the Reserve Discipline Leader Funding of Beijing University of Chinese Medicine (Grant No. 90010960920033).”

Please provide an amended statement that declares *all* the funding or sources of support (whether external or internal to your organization) received during this study, as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now.  Please also include the statement “There was no additional external funding received for this study.” in your updated Funding Statement.

Please include your amended Funding Statement within your cover letter. We will change the online submission form on your behalf.

3. PLOS requires an ORCID iD for the corresponding author in Editorial Manager on papers submitted after December 6th, 2016. Please ensure that you have an ORCID iD and that it is validated in Editorial Manager. To do this, go to ‘Update my Information’ (in the upper left-hand corner of the main menu), and click on the Fetch/Validate link next to the ORCID field. This will take you to the ORCID site and allow you to create a new iD or authenticate a pre-existing iD in Editorial Manager. Please see the following video for instructions on linking an ORCID iD to your Editorial Manager account: https://www.youtube.com/watch?v=_xcclfuvtxQ

4. Please be informed that funding information should not appear in the Acknowledgments section or other areas of your manuscript. We will only publish funding information present in the Funding Statement section of the online submission form. Please remove any funding-related text from the manuscript.

5. In the online submission form you indicate that your data is not available for proprietary reasons and have provided a contact point for accessing this data. Please note that your current contact point is a co-author on this manuscript. According to our Data Policy, the contact point must not be an author on the manuscript and must be an institutional contact, ideally not an individual. Please revise your data statement to a non-author institutional point of contact, such as a data access or ethics committee, and send this to us via return email. Please also include contact information for the third party organization, and please include the full citation of where the data can be found.

6. Please include a separate caption for each figure in your manuscript.

7. Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

Additional Editor Comments (if provided):

Here are a few points/questions

1. ChatGPT 4 cannot process image input and assist with image interpretation directly. Were there any questions with images? If yes, how many?

2. Please look at this article to discuss limitations and ethical concerns as that would add to the study. (https://www.cureus.com/articles/161200-radiology-gets-chatty-the-chatgpt-saga-unfolds#!/). AI ethics is important as AI decision are not always intelligible to humans.

3. There have been a few articles about use of ChatGPT in medical examinations. In one of them, they compared ChatGPT to instructGPT3 and chatgpt did outperform it (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/). Do you have any such comparisons?

4. The abstract could be more succinct. In the Results section, the comparison between ChatGPT3.5 and GPT-4 could be presented in a more structured manner. Consider organizing the information chronologically or by thematic relevance to facilitate a smoother flow for the reader.

5. Furthermore, the Conclusion could be strengthened by offering specific recommendations for the identified enhancements in ChatGPT. Line 345. ‘Furthermore, the potential of close collaboration between AI companies and clinicians must not be overemphasized, ‘ Can you please elaborate on that

6.The introduction could provide more background information on the Chinese Master's Degree Entrance Examination in Clinical Medicine to help readers understand the context and significance of the study.

7.The methods section should provide more details on how the researchers inputted the questions into ChatGPT and how they recorded and verified the responses.

8.Discussion should include what are some concrete ways ChatGPT could be leveraged to assist students in exam preparation?

9.The limitations of the study could be expanded. For example, the authors note the discrepancy in time between when the test questions were administered vs. when ChatGPT's databases were last updated, but they could discuss more how this may have impacted the results.

10. Conclusion section should address some areas where knowledge gap was filled or knowledge was advanced with this study.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: N/A

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The article provides a comprehensive evaluation of ChatGPT's reliability and utility in the realm of medical education, using the Chinese Clinical Medicine Master's Entrance Examination as a performance benchmark. Great job with the article. While the study sheds light on the strengths and weaknesses of ChatGPT, there are areas that could benefit from refinement to enhance the manuscript's clarity and precision.

Here are a few points/questions

1. ChatGPT 4 cannot process image input and assist with image interpretation directly. Were there any questions with images? If yes, how many?

2. Please look at this article to discuss limitations and ethical concerns as that would add to the study. (https://www.cureus.com/articles/161200-radiology-gets-chatty-the-chatgpt-saga-unfolds#!/). AI ethics is important as AI decision are not always intelligible to humans.

3. There have been a few articles about use of ChatGPT in medical examinations. In one of them, they compared ChatGPT to instructGPT3 and chatgpt did outperform it (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/). Do you have any such comparisons?

4. The abstract could be more succinct. In the Results section, the comparison between ChatGPT3.5 and GPT-4 could be presented in a more structured manner. Consider organizing the information chronologically or by thematic relevance to facilitate a smoother flow for the reader.

5. Furthermore, the Conclusion could be strengthened by offering specific recommendations for the identified enhancements in ChatGPT. Line 345. ‘Furthermore, the potential of close collaboration between AI companies and clinicians must not be overemphasized, ‘ Can you please elaborate on that

In summary, while the study presents valuable insights, refining the abstract for brevity, enhancing the organization of results, providing additional context for variations, and offering specific improvement recommendations in the conclusion will contribute to a more polished and impactful manuscript.

Reviewer #2: The introduction could provide more background information on the Chinese Master's Degree Entrance Examination in Clinical Medicine to help readers understand the context and significance of the study.

The methods section should provide more details on how the researchers inputted the questions into ChatGPT and how they recorded and verified the responses.

The discussion could delve deeper into the implications of the findings for medical education and the potential role of AI language models like ChatGPT in this field. What are some concrete ways ChatGPT could be leveraged to assist students in exam preparation?

The limitations of the study could be expanded. For example, the authors note the discrepancy in time between when the test questions were administered vs. when ChatGPT's databases were last updated, but they could discuss more how this may have impacted the results.

The conclusion would benefit from more specific recommendations for future research directions to address the identified limitations and knowledge gaps.

Cohesion:

the first sentence of the 4th paragraph in the Discussion seems abrupt.

The organization of the Discussion section could be improved by using subheadings to clearly delineate the different topics covered (e.g. comparison of ChatGPT versions, performance across question types, implications, limitations).

Grammar:

"This study assess" should be "This study assesses"

"their efficacy in meeting user needs is managing" should perhaps be "their efficacy in meeting user needs in managing"

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Gagandeep Dhillon

Reviewer #2: Yes: Ankit Virmani

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2024 Apr 4;19(4):e0301702. doi: 10.1371/journal.pone.0301702.r002

Author response to Decision Letter 0


18 Mar 2024

Dear Editors and Reviewers of the PLOS ONE,

On behalf of all the co-authors, I would like to express our profound gratitude for the

diligent work you have invested in the review process. We fully acknowledge the

difficulty of this task and extend our highest regards for your professionalism and

dedication.

After receiving your valuable feedback, our team engaged in thorough discussions

and made extensive revisions to our manuscript accordingly.

Furthermore, we would like to take this opportunity to convey our heartfelt new year wishes to you all, following the recent Lunar New Year celebrations. May the coming

year bring you prosperity, success, and health. At the end of this letter, we have attached a detailed response to each review comment and the corresponding revisions

made. Once again, we express our deepest appreciation for your invaluable contributions.

Yours sincerely,

Zhao-Lan Liu

Beijing University of Chinese Medicine

Respond

ACADEMIC EDITOR:

Following are recommendations:

1. ChatGPT 4 cannot process image input and assist with image interpretation directly. Were there any questions with images? If yes, how many?

2. Please look at this article to discuss limitations and ethical concerns as that would add to the study. (https://www.cureus.com/articles/161200-radiology-gets-chatty-the-chatgpt-saga-unfolds#!/). AI ethics is important as AI decision are not always intelligible to humans.

3. There have been a few articles about use of ChatGPT in medical examinations. In one of them, they compared ChatGPT to instructGPT3 and chatgpt did outperform it (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/). Do you have any such comparisons?

4. The abstract could be more succinct. In the Results section, the comparison between ChatGPT3.5 and GPT-4 could be presented in a more structured manner. Consider organizing the information chronologically or by thematic relevance to facilitate a smoother flow for the reader.

5. Furthermore, the Conclusion could be strengthened by offering specific recommendations for the identified enhancements in ChatGPT. Line 345. ‘Furthermore, the potential of close collaboration between AI companies and clinicians must not be overemphasized, ‘ Can you please elaborate on that ?

Following are suggestions but not hard recommendations however I feel they will certainly enhance the clarity and strength of the study:

1.The introduction could provide more background information on the Chinese Master's Degree Entrance Examination in Clinical Medicine to help readers understand the context and significance of the study.

2.The methods section should provide more details on how the researchers inputted the questions into ChatGPT and how they recorded and verified the responses.

3.Discussion should include what are some concrete ways ChatGPT could be leveraged to assist students in exam preparation?

4.The limitations of the study could be expanded. For example, the authors note the discrepancy in time between when the test questions were administered vs. when ChatGPT's databases were last updated, but they could discuss more how this may have impacted the results.

5. Conclusion section should address some areas where knowledge gap was filled or knowledge was advanced with this study.

Q1.ChatGPT 4 cannot process image input and assist with image interpretation directly. Were there any questions with images? If yes, how many?

Respond to Q1

Dear Reviewer, First and foremost, we would like to express our sincere gratitude for your meticulous review and your professional inquiry.Indeed, as you correctly pointed out, GPT-4 does not possess the capability to directly process or interpret image inputs. In alignment with this limitation, I can confirm that none of the questions we utilised involved any images, nor did they assess the examinees' ability to recognise or interpret images. Hence, concerns pertaining to image handling are not applicable in our context. We appreciate your vigilance in ensuring the integrity and relevance of our methodology.

Q2. Please look at this article to discuss limitations and ethical concerns as that would add to the study. (https://www.cureus.com/articles/161200-radiology-gets-chatty-the-chatgpt-saga-unfolds#!/). AI ethics is important as AI decision are not always intelligible to humans.

Respond to Q2

Dear Reviewer, thank you for your invaluable suggestion to delve into the limitations and ethical concerns surrounding artificial intelligence, especially considering the intelligibility of AI decisions to humans. Following your advice, we have thoroughly reviewed and cited the article you mentioned. Consequently, we have enriched our discussion section with an in-depth analysis of ChatGPT's limitations and the ethical considerations of AI. This addition not only enhances the content of our paper but also deepens the discourse on these crucial topics.We appreciate your guidance in making our study more comprehensive and thought-provoking.

Modify Details:

Fourth, as an advanced artificial intelligence tool, ChatGPT's mechanisms of logical operation and decision-making in the process of analyzing matters and making decisions cannot always be fully understood by humans. Therefore, when applying ChatGPT to clinical practice, it is crucial to ensure that experienced doctors are involved and supervise the entire process[16].

[16]Grewal H, Dhillon G, Monga V, et al. Radiology Gets Chatty: The ChatGPT Saga Unfolds. Cureus. 2023 Jun 8;15(6):e40135.

Page:16 Line:349-354

Q3. There have been a few articles about use of ChatGPT in medical examinations. In one of them, they compared ChatGPT to instructGPT3 and chatgpt did outperform it (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9947764/). Do you have any such comparisons?

Respond to Q3

Dear Reviewer, thank you for your insightful comments and for directing our attention to the recent articles comparing the use of ChatGPT in medical examinations, including its comparison with InstructGPT. Following your advice, we diligently reviewed the article you recommended and further explored the relevant literature to enhance our understanding.

InstructGPT is trained using a technique known as "contrastive learning," which involves comparing different responses generated by the model and selecting the one that best fits a given instruction. This process inherently involves human judgement in choosing the most appropriate answer provided by ChatGPT, a factor we aimed to minimize in our study. Our research was designed to focus more on the accuracy of large language models like ChatGPT in handling medical questions without professional training or human intervention.

Therefore, our study primarily concentrates on the comparison between ChatGPT and GPT-4, rather than including InstructGPT. This approach was chosen to better understand the capabilities and limitations of these models in a more controlled and less human-influenced context.

We hope this clarification addresses your query and appreciate your guidance in refining our study.

Q4. The abstract could be more succinct. In the Results section, the comparison between ChatGPT3.5 and GPT-4 could be presented in a more structured manner. Consider organizing the information chronologically or by thematic relevance to facilitate a smoother flow for the reader.

Respond to Q4

Dear Reviewer, thank you for your thorough review and constructive feedback. We acknowledge that the abstract was indeed somewhat verbose and have since revised it to be more concise.

The results section has been reorganized to present our findings in a structured manner, categorizing them based on several thematic aspects: the impact of temperature settings on ChatGPT's performance, the influence of different types of multiple-choice questions on ChatGPT, the effect of various knowledge domains on ChatGPT's accuracy, and ChatGPT's performance across different years. This reorganization aims to provide a clearer and more coherent flow of information, facilitating an easier comprehension for the reader.

Furthermore, we have added distinct titles to each figure in the article, enhancing the clarity and readability of our findings.

We greatly appreciate your guidance in improving the quality and readability of our work.

Modify Details:

Background: ChatGPT is a large language model designed to generate responses based on a contextual understanding of user queries and requests.This study utilised the entrance examination for the Master of Clinical Medicine in Traditional Chinese Medicine to assess the reliability and practicality of ChatGPT within the domain of medical education.

Methods: We selected 330 single and multiple-choice questions from the 2021 and 2022 Chinese Master of Clinical Medicine comprehensive examinations, which did not include any images or tables. To ensure the test's accuracy and authenticity, we preserved the original format of the query and alternative test texts, without any modifications or explanations.

Page:2 Line:29-37

At temperatures of 0, 0.3, 0.7, and 1, GPT-4 demonstrated a notable advantage over ChatGPT3.5, exhibiting a significantly higher total accuracy rate (Figure 1). Additional data analysis from the years 2021 and 2022 revealed GPT-4's consistent response to each question type across all temperature levels. This consistency indicates its reliability throughout the specified period and reflects its stable performance under varying temperature conditions (Figure 2).

Figure 1. The Performance of ChatGPT at Various Temperatures.

(①:ChatGPT3.5(value 0),②:ChatGPT3.5(value 0.3),③:ChatGPT3.5(value 0.7), ④:ChatGPT3.5(value 1),⑤: GPT-4(value 0),⑥:GPT-4(value 0.3),⑦:GPT-4(value 0.7), ⑧:GPT-4(value 1))

Figure 2. Correctness of ChatGPT3.5 and GPT-4 in different years at different temperatures.

Concerning variations in performance across question types, ChatGPT3.5 and GPT-4 exhibited distinct capabilities in answering specific questions. Specifically, ChatGPT3.5 specialise in responding to A1-type questions, while GPT-4 performs best on A3-type questions. Notably, both ChatGPT3.5 and GPT-4 demonstrated a similar weakness, scoring the lowest on X-type questions, providing insights into how the two models fare across diverse question types. This passage offers valuable insight into the divergences and similarities between the two models across various cognitive domains (Figure 3). The performance of ChatGPT remains relatively consistent across A1, A2, A3, and B-type questions, with only marginal effects observed from adjusting the temperature in relation to each question type. However, substituting ChatGPT3.5 with GPT-4 in all temperature settings substantially enhances accuracy across all question types (Figure 4).

Figure 3. Correctness of ChatGPT3.5 and GPT-4 on different question types.

Figure 4.Correctness of ChatGPT3.5 and GPT-4 on different question types at different temperatures.

(①:ChatGPT3.5(value 0),②:ChatGPT3.5(value 0.3),③:ChatGPT3.5(value 0.7), ④:ChatGPT3.5(value 1),⑤: GPT-4(value 0),⑥:GPT-4(value 0.3),⑦:GPT-4(value 0.7), ⑧:GPT-4(value 1))

ChatGPT's exceptional performance in the medical humanities section underscores its proficiency in managing medical ethics and humanistic care. Simultaneously, its comparable performance in the remaining five sections demonstrates its ability to handle multidisciplinary knowledge domains in a balanced manner (Figure 5). It is noteworthy that both ChatGPT3.5 and GPT-4 exhibited identical accuracy rates in the medical humanities section at all four temperatures, whereas the correct rates fluctuated to some degree in the remaining five sections with changes in temperature (Figure 6).

Figure 5.Correctness of ChatGPT3.5 and GPT-4 on different subjects.

Figure 6.Correctness of ChatGPT3.5 and GPT-4 on different subjects at different temperatures.

(①:ChatGPT3.5(value 0),②:ChatGPT3.5(value 0.3),③:ChatGPT3.5(value 0.7), ④:ChatGPT3.5(value 1),⑤: GPT-4(value 0),⑥:GPT-4(value 0.3),⑦:GPT-4(value 0.7), ⑧:GPT-4(value 1))

In addition to the statistics on accuracy rates, we calculated the scores achieved by both ChatGPT3.5 and GPT-4 in the years 2021 and 2022, including the overall score, which was higher for GPT-4 than for ChatGPT3.5 (Figure 7). Furthermore, we analyzed of the violation of the answer rule in X-type multiple-choice questions. The higher percentage for ChatGPT3.5 (26.67%) compared to the lower percentage for GPT-4 (10%) underscores the latter's superiority in adhering to normative questions. The outcomes of this analysis, which delved into the specific behaviors behind the data, contribute to our understanding of the advantages of ChatGPT3.5 and GPT-4 in comprehensive medical examinations and provide insights into future directions for improvement.

Figure 7.Overall Score Rate, 2021 Score Rate, 2022 Score Rate for ChatGPT3.5 vs GPT-4.

Page:9-11 Line:186-241

Q5. Furthermore, the Conclusion could be strengthened by offering specific recommendations for the identified enhancements in ChatGPT. Line 345. ‘Furthermore, the potential of close collaboration between AI companies and clinicians must not be overemphasized, ‘ Can you please elaborate on that ?

Respond to Q5

Dear Reviewer, thank you for your valuable feedback regarding our conclusion and the specific passage you referenced. We have taken your comments into consideration and have accordingly revised the conclusion to articulate specific improvements and recommendations for enhancing the accuracy of ChatGPT.

Regarding the line you mentioned, our apologies for any confusion caused by its previous vagueness. We have since rewritten this section for clarity.

We hope this clarification and the revisions made throughout the manuscript better address your concerns and contribute to a clearer understanding of our study's implications and recommendations.

Modify Details:

Simultaneously, when utilised, providing more detailed contextual information and relevant knowledge can enhance the accuracy of its responses[18].

[18]Goodman RS, Patrinely JR, Stone CA Jr, et al. Accuracy and Reliability of Chatbot Responses to Physician Questions. JAMA Netw Open. 2023 Oct 2;6(10):e2336483.

Page:17 Line:373-375

Furthermore, to better transform clinical practice and medical education, a closer collaboration between artificial intelligence companies and clinical practitioners is essential. This entails the meticulous development of specialised training corpora and the concerted effort to develop a medical professional version of ChatGPT[19]. Such initiatives could address the limitations identified in this study, thereby enhancing the effectiveness of support in medical education and healthcare sectors with greater precision and relevance.

[19]Li J, Dada A, Puladi B, Kleesiek J, et al. ChatGPT in healthcare: A taxonomy and systematic review. Comput Methods Programs Biomed. 2024 Mar;245:108013.

Page:18 Line:379-385

Q6.The introduction could provide more background information on the Chinese Master's Degree Entrance Examination in Clinical Medicine to help readers understand the context and significance of the study.

Respond to Q6

Dear Reviewer, thank you for your constructive feedback on the introduction of our manuscript. Acknowledging the importance of context for our international readers, we have expanded the introduction to include a detailed description of the examination's background, its significance in the field of medical education in China, and its role in assessing the competence and readiness of candidates aspiring to pursue a master's degree in clinical medicine.

We believe that these additions will greatly enhance the reader's comprehension of the study's context and the significance of our findings within the broader landscape of medical education and artificial intelligence.

Modify Details:

NMPUA represents a critically important examination for aspiring master's degree students, aimed at assessing the clinical reasoning, knowledge, diagnostic capabilities, and decision-making proficiency of medical undergraduates in a clinical context.

Page:5 Line:93-96

Q7.The methods section should provide more details on how the researchers inputted the questions into ChatGPT and how they recorded and verified the responses.

Respond to Q7

Dear Reviewer, thank you for your insightful feedback regarding the methods section of our manuscript. We recognize the importance of transparency and detail in describing our research process. Following your suggestion, we have now provided a comprehensive description of the procedures involved in inputting the questions into ChatGPT, as well as the methods we employed to record and verify the responses.

These enhancements to the methods section aim to provide readers with a clear and thorough understanding of our methodology, reinforcing the credibility and reliability of our findings.

We hope that these revisions adequately address your concerns and contribute to a more comprehensive and transparent presentation of our research process.

Modify Details:

In this study, we replicated and sent 330 multiple-choice questions from the 2021 and 2022 Chinese Master of Clinical Medicine comprehensive examinations to both ChatGPT3.5 and GPT-4, in the order they appeared in the examination papers. The request was for them to simulate the role of a doctor and provide answers accordingly.

Page:7 Line:137-140

Responses were meticulously recorded using Excel and cross-verified against the correct answers to ensure precise evaluation of their performance in the Masters Comprehensive Clinical Medicine Examination.

Page:7 Line:145-148

Q8.Discussion should include what are some concrete ways ChatGPT could be leveraged to assist students in exam preparation?

Respond to Q8

Dear Reviewer, thank you for your valuable suggestion to enhance the discussion section of our manuscript by including specific ways in which ChatGPT could be leveraged to assist students in their exam preparation. By including these concrete methods, we aim to provide readers with a clearer understanding of how AI, specifically ChatGPT, can be innovatively used to enrich medical education and examination preparation.

We appreciate your guidance, which has undoubtedly contributed to making our discussion more comprehensive and relevant to educators, students, and the broader academic community.

Modify Details:

Furthermore, students may utilise ChatGPT to obtain solutions and explanations for these medical examination questions, aiding in the consolidation of knowledge and enhancement of problem-solving skills.

Page:15 Line:325-327

Q9.The limitations of the study could be expanded. For example, the authors note the discrepancy in time between when the test questions were administered vs. when ChatGPT's databases were last updated, but they could discuss more how this may have impacted the results.

Respond to Q9

Dear Reviewer, thank you for your insightful recommendation to expand upon the limitations section of our study. We have taken your advice to heart and have elaborated on the discussion concerning the temporal discrepancies between the administration of the test questions and the last update of ChatGPT's databases.

By expanding the discussion on this limitation, we aim to provide a more comprehensive understanding of the factors that may influence the results of studies involving AI technologies in educational settings.

We appreciate your guidance in making our study more thorough and reflective of the complexities involved in integrating AI into educational contexts.

Modify Details:

Therefore, there exists a risk that the accuracy rate of ChatGPT, as determined by this test, may be artificially inflated.

Page:16 Line:342-343

Q10. Conclusion section should address some areas where knowledge gap was filled or knowledge was advanced with this study.

Respond to Q9

Dear Reviewer, thank you for your valuable feedback regarding the conclusion section of our manuscript. In response to your suggestion, we have carefully revised this section to underscore the significance of our study and the ways in which it has contributed to filling knowledge gaps and advancing knowledge within the relevant field.

These additions aim to provide a clear understanding of the contributions our study makes to the field, emphasizing its role in advancing our understanding of AI's potential and limitations in supporting medical education.

Modify Details:

This study innovatively tested the performance of ChatGPT in the Chinese Master's Degree Entrance Examination in Clinical Medicine, filling a knowledge gap in the intersection of medical education and artificial intelligence.

Page:17 Line:366-368

Attachment

Submitted filename: Response to Reviewers.docx

pone.0301702.s002.docx (23.6KB, docx)

Decision Letter 1

Harpreet Singh Grewal

20 Mar 2024

ChatGPT在中国临床医学硕士入学考试中的表现

PONE-D-24-06964R1

Dear Dr. Liu,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Harpreet Singh Grewal

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Harpreet Singh Grewal

25 Mar 2024

PONE-D-24-06964R1

PLOS ONE

Dear Dr. Liu,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Harpreet Singh Grewal

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Data

    (XLS)

    pone.0301702.s001.xls (361.5KB, xls)
    Attachment

    Submitted filename: Response to Reviewers.docx

    pone.0301702.s002.docx (23.6KB, docx)

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES