Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2023 Aug 29;18(8):e0290691. doi: 10.1371/journal.pone.0290691

ChatGPT versus human in generating medical graduate exam multiple choice questions—A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom)

Billy Ho Hung Cheung 1, Gary Kui Kai Lau 1, Gordon Tin Chun Wong 1, Elaine Yuen Phin Lee 1, Dhananjay Kulkarni 2, Choon Sheong Seow 3, Ruby Wong 4, Michael Tiong-Hong Co 1,*
Editor: Jie Wang5
PMCID: PMC10464959  PMID: 37643186

Abstract

Introduction

Large language models, in particular ChatGPT, have showcased remarkable language processing capabilities. Given the substantial workload of university medical staff, this study aims to assess the quality of multiple-choice questions (MCQs) produced by ChatGPT for use in graduate medical examinations, compared to questions written by university professoriate staffs based on standard medical textbooks.

Methods

50 MCQs were generated by ChatGPT with reference to two standard undergraduate medical textbooks (Harrison’s, and Bailey & Love’s). Another 50 MCQs were drafted by two university professoriate staff using the same medical textbooks. All 100 MCQ were individually numbered, randomized and sent to five independent international assessors for MCQ quality assessment using a standardized assessment score on five assessment domains, namely, appropriateness of the question, clarity and specificity, relevance, discriminative power of alternatives, and suitability for medical graduate examination.

Results

The total time required for ChatGPT to create the 50 questions was 20 minutes 25 seconds, while it took two human examiners a total of 211 minutes 33 seconds to draft the 50 questions. When a comparison of the mean score was made between the questions constructed by A.I. with those drafted by humans, only in the relevance domain that the A.I. was inferior to humans (A.I.: 7.56 +/- 0.94 vs human: 7.88 +/- 0.52; p = 0.04). There was no significant difference in question quality between questions drafted by A.I. versus humans, in the total assessment score as well as in other domains. Questions generated by A.I. yielded a wider range of scores, while those created by humans were consistent and within a narrower range.

Conclusion

ChatGPT has the potential to generate comparable-quality MCQs for medical graduate examinations within a significantly shorter time.

Introduction

The workload of university medical staff is a pressing issue that requires attention [1], Medical staff are often tasked with multiple responsibilities that can include, but are not limited to, patient care, teaching, student assessment, research, and administrative work [2]. This heavy workload can be especially challenging for medical academic staff, who are also required to uphold the standard of undergraduate medical exams. The demand for exam quality has increased in recent years, as students and other stakeholders expect assessments to be fair, accurate, unbiased and aligned with the predefined learning objectives [3, 4].

In this context, the development of artificial intelligence (A.I.), machine learning, and language models offers a promising solution. A.I. is a rapidly developing field that has the potential to transform many industries, including education [5]. A.I. is a broad field that encompasses many different technologies, such as machine learning, natural language processing, and computer vision [6]. Machine learning is a subset of A.I. that involves training algorithms to make predictions or decisions based on the database [7]. This technology has been used in a variety of applications, including speech recognition, image classification, and language generation.

A large language model is a type of A.I. that has been trained on a massive amount of data and can generate human-like text with impressive accuracy [8]. These models are called "large" because they have a huge number of parameters, which are the weights in the model’s mathematical equations that determine its behaviour. The more parameters a model has, the more information it can store and the more complex tasks it can perform.

ChatGPT is a specific type of large language model developed by OpenAI [9]. It is a state-of-the-art model that has been trained on a diverse range of texts, including news articles, books, and websites. This makes it highly versatile and capable of generating text on a wide range of topics with remarkable coherence and consistency. It is a pre-trained language model with knowledge up to 2021. However, it is possible to feed in relevant reference text by the operator, such that updated text or desired text outputs can be generated based on the operator’s command and preference.

The potential implications and possibilities of using ChatGPT for assessment in medical education are significant. A recent publication confirmed that the knowledge provided by ChatGPT is adequate to pass the United States Medical Licensing Exam [10]. It is also believed that ChatGPT could be used to generate high-quality exam questions, provide personalized feedback to students, and automate the grading process, hence reducing the workload of medical staff and improving the quality of assessments [11]. This technology has the potential to be a game-changer in the field of medical education, providing new and innovative ways to assess student learning and evaluate exam quality.

Meanwhile, multiple-choice questions (MCQs) have been used as a form of knowledge-based assessment since the early twentieth century [12]. MCQ is an important and integral component in both undergraduate and post-graduate exams, due to its standardization, equitability, objectiveness, cost-effectiveness, and reliability [13]. When compared to essay-type of questions or short answer questions, MCQs also allow assessment of a broader range of content—each exam paper can include large numbers of MCQs [14]. This makes the MCQ format particularly suitable for summative final examinations. The major drawback to the MCQ format, however, is that high-quality questions are difficult and time-consuming to draft.

We hypothesize that advanced large language models, such as ChatGPT, can reliably generate high-quality MCQs that are comparable to those of an experienced exam writer. Here, with an updated large language model A.I. available, we aim to evaluate the quality of exam questions generated by ChatGPT versus those drafted by university professoriate staff based on international gold-standard medical reference textbooks.

Methods

This is a prospective study to compare the quality of MCQs generated by ChatGPT versus those drafted by experienced university professoriate staff for medical exams (Fig 1). The study was conducted in February 2023. To allow a fair comparison, certain criteria are set for question developments.

Fig 1. Schematic diagram of the study design.

Fig 1

  1. The questions were designed to meet the standard for a medical graduate exam.

  2. Only four choices were allowed for each question.

  3. The questions were limited to knowledge-based questions only.

  4. The questions were text-based only.

  5. Topics regarding the exam context were set by an independent researcher before the design of the questions.

  6. Both the professorial staff who designed the questions and the research operating ChatGPT were not allowed to view questions from the other side.

  7. Standardized, internationally-used textbooks, with Harrison’s Principles of Internal Medicine 21th edition for medicine [15], and Baily and Love’s Short Practice of Surgery 27th Edition for surgery [16], were used as the reference for question generation.

  8. Distractors were allowed to refer from sources other than the original reference provided.

  9. No explanation was required for the question.

ChatGPT plus [17] was used for question creation. This version is a small update to the standard ChatGPT version on the 1st of February 2023. According to the official webpage, ChatGPT Plus carries the ability to answer follow-up questions, and challenge incorrect assumptions. This also provides better access during peak hours, faster response time and priority to updates.

Question construction

Fifty multiple choice questions (MCQs) with four options covering Internal Medicine and Surgery were generated by the ChatGPT with reference articles selected from the two reference medical textbooks.

ChatGPT was assessed in Hong Kong S.A.R. via a virtual personal network to overcome its geographical limitation [18]. The commander using the ChatGPT will only copy and paste the instruction “Can you write a multiple-choice question based on the following criteria, with the reference I am providing for you and your medical knowledge?” (also known as the prompt), the criteria and the reference text (selected from the reference textbook and input into ChatGPT interface as pure text input), to yield a question (S1 File). After sending the request to ChatGPT, a response will be automatically generated. Interaction is only allowed for clarification but not any modification. Questions provided by ChatGPT will be directly copied and used as the output for assessment by the independent quality assessment team. Moreover, a new chat session was started in ChatGPT for each entry to reduce memory retention bias.

The duration of work by A.I. was defined by the response time that ChatGPT needed to generate the questions once the prompt and the material were given, and the time for the operator to copy and paste the questions was not included. The duration of work by ChatGPT and humans was documented and compared.

Another 50 MCQs with four options were drafted by two experienced university professoriate staff (with more than 15 years of clinical practice and academic experience in Internal Medicine and Surgery, respectively) based on the same textbook reference materials.

They were given the exact instructions as the criteria mentioned above. Additionally, while they were allowed to refer back to additional textbooks or online resources, they were not allowed to use any A.I. tool to assist their question writing.

Blinded assessment by a multinational expert panel

All 100 MCQs were individually numbered and randomized using a computer-generated sequence.

A multiple-choice item consists of the stem, the options, and any auxiliary information [12]. The stem contains context, content, and/or the question the student is required to answer. The options include a set of alternative answers with one correct option and one or more incorrect options or distractors. Auxiliary information includes any additional content, in either the stem or option, required to generate an item. The incorrect options to an MCQ are known as distractors; they serve the purpose of diverting non-competent candidates away from the correct answer, which they serve as an important hallmark of a high-quality question [19].

The question set was then sent to five independent international blinded assessors (From the United Kingdom, Ireland, Singapore and Hong Kong S.A.R.) for assessment of the question quality (who were also the authors of this study as well). Members of the international panel of assessors are experienced clinicians with heavy involvement in medical education in their locality.

An assessment tool with five domains is specifically designed after literature review of relevant metrics for MCQ quality assessment in this study [2023]. Specific instructions were given to the assessors for the meaning of each domain. These include (I) Appropriateness of the question, defined as if the question is correct, appropriately constructed with appropriate length and well-formed; (II) Clarity and specificity, defined as if the question is clear and specific without ambiguity, its answerability and without being under- or over-informative; (III) Relevance, defined as the relevance to clinical context; (IV) Quality of the alternatives & discriminative power for the assessment of the alternative choices provided; and (V) Suitability for graduate medical school exam focused on the level of challenge including if the question higher order of learning outcomes such as application, analysis and evaluation, as elaborated by Bloom and his collaborators [24]. The quality of MCQs was objectively assessed by a numeric scale of 0–10. “0” is defined as extremely poor, while “10” means that it is at the quality of a gold-standard question. The total score of this section ranges from 0 to 50.

In addition, the assessors were also asked to determine if the questions were constructed by A.I. or by humans, in which they were blinded about the total number of questions created by each arm. In addition, G.P.T. -2 Output Detector, an A.I. output detector [25] was also used to predict as if the question was written by A.I. or a human. This detector assigns a score between 0.02 and 99.98% to each question, with a higher score indicating a greater likelihood that the question was constructed by A.I.

Statistical analysis

All data were prospectively collected by a research assistant and computerized into a database. All statistical analyses were performed with the Statistical Product and Service Solution (SPSS) version 29. A comparison was made between questions created by A.I. and by humans. Student T test or Mann-Whitney U test was used for the comparison of continuous variables for the five domains individually and combined. P value of less than 0.05 were considered statistically significant. Chi-squared test or Fisher’s exact test were used to compare discrete variables, namely the perception of the assessor whether a question was produced by A.I. or human.

A paired t-test was performed to assess for systematic differences between the mean measures of each rater (including G.P.T. -2 Output Detector).

Results

Question construction

The question writing was performed by ChatGPT on two separate dates, 11th and 17th February. The work was carried out with stable internet via Wifi at a minimum of 15.40 Mbps for downloads and 11.96 Mbps for update speed.

The total time required for ChatGPT to create the 50 questions was 20 minutes 25 seconds, while it took two human examiners a total of 211 minutes 33 seconds (84 minutes 56 seconds for the surgical examiner and 126 minutes 37 seconds for the medical examiner).

Assessment of question quality

The results of the assessment by the independent blinded assessors were summarized in Table 1. We can see that the overall score was satisfactory with a mean score of each domain above 7.

Table 1. Mean score of each domain.

Mean (± SD) Range
Appropriateness of the question 7.78 ± 0.74 5.40–9.80
Clarity and specificity 7.63 ± 0.69 5.60–9.20
Relevance 7.72 ± 0.77 5.60–9.20
Quality of the alternatives & discriminative power 7.31 ± 0.65 5.60–7.31
Suitability for graduate medical school exam 7.32 ± 0.84 5.00–9.20
Total score 37.76 ± 3.34 27.20–46.20

When a comparison of the mean score was made between the questions constructed by A.I. with those constructed by humans, only in the relevance category that the A.I. was inferior to humans (A.I.: 7.56 ± 0.94 vs human: 7.88 ± 0.52; p = 0.04, Table 2). There was no significant difference in other domains of question quality assessment. The same applies to the total scores between A.I. and humans (Table 1).

Table 2. Comparison of mean scores between questions generated by AI and human.

AI (± SD) Human (± SD) P
Appropriateness of the question 7.72 ± 0.83 7.84 ± 0.65 0.45
Clarity and specificity 7.56 ± 0.81 7.69 ± 0.55 0.34
Relevance 7.56 ± 0.94 7.88 ± 0.52 0.04
Quality of the alternatives & discriminative power 7.26 ± 0.68 7.36 ± 0.61 0.46
Suitability for graduate medical school exam 7.25 ± 0.94 7.40 ± 0.72 0.39
Total score 37.36 ± 3.92 38.16 ± 2.62 0.23

SD = standard deviation

Questions generated by A.I. yielded a wider range of scores, while those created by humans were consistent and within a narrower range (Fig 2). A similar distribution was also observed across all five domains (Fig 3).

Fig 2. Assessment scores of MCQ quality between A.I. and human.

Fig 2

Fig 3. Assessment scores across all five assessment domains.

Fig 3

A.I. vs human

The average scores of the questions based on the same reading material were compared between that generated by A.I. vs by human writers. A "win" denotes a higher average score. Questions generated by humans generally received a higher mark when compared to the A.I. counterpart (Table 3). However, AI-drafted questions outperformed human ones in 36% to 44% of cases across five assessment domains, including the total score.

Table 3. Comparison between AI vs human with the same reference.

AI wins Human wins Equal Mean difference
(AI–human)
(± SD)
Appropriateness of the question 18 (36%) 27 (54%) 5 (10%) - 0.11 ± 1.05
Clarity and specificity 18 (36%) 26 (52%) 6 (12%) - 0.13 ± 1.08
Relevance 18 (36%) 27 (54%) 5 (10%) - 0.32 ± 1.04
Quality of the alternatives & discriminative power 21 (42%) 26 (52%) 3 (6%) - 0.10 ± 0.94
Suitability for graduate medical school exam 22 (44%) 28 (56%) 2 (4%) - 0.14 ± 1.12
Total score 20 (40%) 30 (60%) 0 (0%) - 0.80 ± 4.82

In addition, when the questions generated by ChatGPT were reviewed, we observed few negative features, including minimal use of negative stem (only 14% (7/50), compared to 12% (6/50) by human examiners) with a lack of “except”, “All/none of the above”.

Blinded guess of the question writer by a panel of assessors

Assessors were asked to deduce if the question was written by A.I. or humans, and the results are shown in Table 4. The results of the assessment by G.P.T. -2 Output Detector were also shown. The percentage of correct guesses was consistently low. None of the five assessors could achieve a correct “guess” rate of 90% or above. Only G.P.T. -2 Output Detector could achieve a higher correct “guess” rate of 90% for human exam writers. In addition, our results showed that there is no correlation between the guess made by the assessor and the actual writer of the question.

Table 4. Blinded guess of question writer (i.e. AI vs human).

AI (total = 50) Human (total = 50) Correlation p
(Correct guess, %) (Correct guess, %)
Assessor A 24, 48% 23, 46% - 0.14–0.26 0.55
Assessor B 14, 28% 41, 82% - 0.38–0.10 0.24
Assessor C 33, 66% 24, 48% - 0.35–0.06 0.16
Assessor D 27, 53% 26, 52% - 0.26–0.14 0.55
Assessor E 26, 52% 32, 64% - 0.36–0.04 0.11
GPT-2 Output Detector 7, 14% 45, 90% - 0.40–0.21 0.54

Discussion

This is the first evidence in the literature showing that a commercially available, pre-trained A.I. can prepare exam material with a compatible quality with experienced human examiners.

MCQ is a crucial tool for assessment in education because they permit the direct measurement of many knowledge, skills, and competencies across a broad range of disciplines with the ability to test their concepts and principles, make judgments, drawing inferences, reasoning, interpretation of data, and information application [12, 14]. MCQs are also efficient to administer, easy to score objectively, and provide statistical information regarding the class performance on a particular question and assess if the question was appropriate to the context that was presented [21, 26]. A standard MCQ consists of the stem, the options, and occasionally auxiliary information [27]. The stem contains context, and content, and sets the question. The options include a set of alternative answers with one correct option and other incorrect options known as distractors [22]. Distractors are required to divert non-competent candidates away from the right answer, which serves as an essential hallmark of a high-quality question [19]. However, the major drawback to the MCQ format is that high-quality questions are difficult, time-consuming, and costly to write [28]. From our results, it is evident that even experienced examiners required more than ten minutes to prepare one question on average. With encouraging results demonstrated by the current study, we believe that A.I. could have the potential to generate quality MCQs for medical education. Artificial Intelligence (A.I.) has been used in education to improve learning and teaching outcomes for several years. The history of A.I. in education can be traced back to the 1950s when scientists and mathematicians explored the mathematical possibility of A.I. [29]. In recent years, A.I. has been used to analyze student learning abilities and history, allowing teachers to create the best learning program for all students [5].

The introduction of ChatGPT, an AI-powered chatbot, has also transformed the landscape of A.I. in education. ChatGPT was trained on a large dataset of real human conversations and online data, leading to its capability of song or poem writing, storytelling, list creation, and even passing exams [10]. In our study, no additional training is required for the user or extra tuning of ChatGPT to yield similar results except for the need to follow the commands. As demonstrated in our study, a reasonable MCQ can be written by ChatGPT using simple commands with a text reference provided. The questions written by humans were rated superior to those written by A.I. when we compare the questions with the same reference head-to-head (Table 3). However, the difference in our raters’ average scores was insignificant except in the relevance domain (Table 2). This shows that despite the apparent superiority of the MCQ written by humans, the score difference is actually narrow and mostly insignificant. This indicates a huge potential to explore the use of such tools as assistance in other educational scenarios.

However, ChatGPT and other AI-powered tools in education have also raised concerns about their negative impacts on student learning [11]. Including ChatGPT, most A.I. models are trained by the vast content available on the internet, and their reliability and credibility are questionable. Moreover, many A.I.s were found to have significant bias due to their training data [30]. Another major potential setback related to natural language generator A.I. is called hallucination [31]. Like hallucination described in humans, this condition refers to a phenomenon where the A.I. generates nonsensical, or unfaithful to the provided source input. This has led to immediate recall even for some initially promising A.I. from some of the largest internet companies, such as Galactica from Meta Inc [32]. Hence, our team proposed the use of the A.I. by educators, which demonstrate a feasible way of utilization for educational purpose. To minimize bias and hallucination, our proposed methodology consists of providing a reliable reference for the A.I. to generate questions instead of complete dependence on its database [33].

Compatible with the guidance from the Division of Education, American College of Surgeons, there were minimal negative features in both the MCQs written by ChatGPT and humans (14% (7/50) vs 12% (6/50)), and there was also no use of “except” or “All/none of the above”, which could create additional confusion to exam candidates [20]. However, we also acknowledge that there were intrinsic limitations with ChatGPT in generating MCQs for medical graduate examinations. First, as it is a pure language-based model, it cannot create any text and correlate it with clinical photos or radiological images. This is also an essential area in the exam, assessing candidates’ interpretation skills. In addition, our pre-study tests have found that ChatGPT performed poorly when it was instructed to generate a clinical scenario, possibly due to the high complexity of knowledge and experience required to create a relevant scenario, limiting its use to assess candidates’ ability of application of their knowledge. However, with the continuous effort from the OpenAI team and the rapid evolution of A.I. technology, these barriers might be solved in the near future [34].

Limitations

The first limitation is that the reference material used was obtained directly from a textbook, and the length of the text was limited by the A.I. platform, which is currently at 510 tokens, potentially leading to selection bias by the operator. In contrast, human exam writers can recall information associated with the text, resulting in higher quality questions. Another limitation is the limited number of people involved and the number of questions generated, which could limit the applicability of the results. Only two professoriate staff from Medicine and Surgery departments, but not experts in all fields, developed the MCQs in this study. This differs from the real-life scenario where graduate exam questions were generated by a large pool of question writers from various clinical departments, which were then reviewed and vetted by a panel of professoriate staff. Besides, only 50 questions were generated by humans and another 50 by A.I., and only five assessors were involved. Real-world performance, in particular, the differentiate index, was impossible to assess due to the lack of actual students doing the test. Hence, efforts were made to improve the generalizability, including comprehensive coverage of all areas in both medicine and surgery and the participation of a multinational panel for assessment. Hence, efforts were made to improve the generalizability, including comprehensive coverage of all areas in both medicine and surgery and the participation of a multinational panel for assessment. The third limitation concerns the absence of human interference in the question-writing process. A.I. generated the questions, and the first-available question was captured without any polishing, while ChatGPT is also known to be sensitive to small changes in the prompt and can provide various answers even with the same prompt. And with extra fine-tuning in the form of further conversation with ChatGPT, the output quality from ChatGPT can be significantly enhanced. In addition, this study only evaluated the use of A.I. in generating MCQs. The full application of A.I. in developing the entire medical exam questions is yet to be thoroughly assessed. Lastly, with the improvements in the newer generation ChatGPT AI platform and other adjuncts, A.I. may perform better than what was observed in this current study. Nonetheless, this study provided solid evidence of the ability of ChatGPT and its strong potential in assisting medical exam MCQ preparation.

Conclusion

This is the first study showing that ChatGPT, a widely available large language model, can be utilized as an exam question writer for graduate medical exams with comparable performance to experienced human examiners. Our study supports the continuous exploration of how large language model A.I. can assist academia in improving their efficiency while maintaining a consistently high standard. Further studies are required to explore additional applications and other limitations of the booming A.I. platforms to enhance reliability with minimal bias.

Supporting information

S1 Checklist. PLOS ONE clinical studies checklist.

(DOCX)

S1 File. ChatGPT webpage.

(DOCX)

Data Availability

All relevant data are within the paper and its Supporting information files.

Funding Statement

No.

References

  • 1.Nassar AK, Reid S, Kahnamoui K, Tuma F, Waheed A, McConnell M. Burnout among Academic Clinicians as It Correlates with Workload and Demographic Variables. Behavioral Sciences. 2020;10(6):94. doi: 10.3390/bs10060094 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Rao SK, Kimball AB, Lehrhoff SR, Hidrue MK, Colton DG, Ferris TG, et al. The Impact of Administrative Burden on Academic Physicians: Results of a Hospital-Wide Physician Survey. Academic Medicine. 2017;92(2):237–43. doi: 10.1097/ACM.0000000000001461 [DOI] [PubMed] [Google Scholar]
  • 3.Yeoh KG. The future of medical education. Singapore Med J. 2019;60(1):3–8. doi: 10.11622/smedj.2019003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Wong BM, Levinson W, Shojania KG. Quality improvement in medical education: current state and future directions. Med Educ. 2012;46(1):107–19. doi: 10.1111/j.1365-2923.2011.04154.x [DOI] [PubMed] [Google Scholar]
  • 5.Chen L, Chen P, Lin Z. Artificial Intelligence in Education: A Review. IEEE Access. 2020;8:75264–78. [Google Scholar]
  • 6.Scotti V. Artificial intelligence. IEEE Instrumentation & Measurement Magazine. 2020;23(3):27–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Jordan MI, Mitchell TM. Machine learning: Trends, perspectives, and prospects. Science. 2015;349(6245):255–60. doi: 10.1126/science.aaa8415 [DOI] [PubMed] [Google Scholar]
  • 8.Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, et al. Emergent abilities of large language models. arXiv preprint arXiv:220607682. 2022.
  • 9.AI O. ChatGPT: Optimizing Language Models for Dialogue San Francisco, Canada2023. https://openai.com/blog/chatgpt/.
  • 10.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digital Health. 2023;2(2):e0000198. doi: 10.1371/journal.pdig.0000198 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.O’Connor S, ChatGpt. Open artificial intelligence platforms in nursing education: Tools for academic progress or abuse? Nurse Education in Practice. 2023;66:103537. doi: 10.1016/j.nepr.2022.103537 [DOI] [PubMed] [Google Scholar]
  • 12.Haladyna TM, Downing SM, Rodriguez MC. A Review of Multiple-Choice Item-Writing Guidelines for Classroom Assessment. Applied Measurement in Education. 2002;15(3):309–33. [Google Scholar]
  • 13.Kilgour JM, Tayyaba S. An investigation into the optimal number of distractors in single-best answer exams. Adv Health Sci Educ Theory Pract. 2016;21(3):571–85. doi: 10.1007/s10459-015-9652-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Dion V, St-Onge C, Bartman I, Touchie C, Pugh D. Written-Based Progress Testing: A Scoping Review. Academic Medicine. 2022;97(5):747–57. doi: 10.1097/ACM.0000000000004507 [DOI] [PubMed] [Google Scholar]
  • 15.Loscalzo J, Fauci AS, Kasper DL, Hauser S, Longo D, Jameson JL. Harrison’s Principles of Internal Medicine, Twenty-First Edition (Vol.1 & Vol.2): McGraw Hill LLC; 2022. [Google Scholar]
  • 16.Williams NS, O’Connell PR, McCaskie AW. Bailey & Love’s Short Practice of Surgery: Taylor & Francis Group; 2018. [Google Scholar]
  • 17.OpenAI. Introducing ChatGPT Plus 2023. https://openai.com/blog/chatgpt-plus/.
  • 18.OpenAI. Supported countries and territories 2023. https://platform.openai.com/docs/supported-countries.
  • 19.Kumar D, Jaipurkar R, Shekhar A, Sikri G, Srinivas V. Item analysis of multiple choice questions: A quality assurance test for an assessment tool. Medical Journal Armed Forces India. 2021;77:S85–S9. doi: 10.1016/j.mjafi.2020.11.007 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Brame CJ. Writing good multiple choice test questions 2013. https://cft.vanderbilt.edu/guides-sub-pages/writing-good-multiple-choice-test-questions/.
  • 21.Iñarrairaegui M, Fernández-Ros N, Lucena F, Landecho MF, García N, Quiroga J, et al. Evaluation of the quality of multiple-choice questions according to the students’ academic level. BMC Med Educ. 2022;22(1):779. doi: 10.1186/s12909-022-03844-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Gierl MJ, Bulut O, Guo Q, Zhang X. Developing, Analyzing, and Using Distractors for Multiple-Choice Tests in Education: A Comprehensive Review. Review of Educational Research. 2017;87(6):1082–116. [Google Scholar]
  • 23.Shin J, Guo Q, Gierl MJ. Multiple-Choice Item Distractor Development Using Topic Modeling Approaches. Front Psychol. 2019;10:825. doi: 10.3389/fpsyg.2019.00825 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Adams NE. Bloom’s taxonomy of cognitive learning objectives. Journal of the Medical Library Association: JMLA. 2015;103(3):152. doi: 10.3163/1536-5050.103.3.010 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.OpenAI. GPT-2 Output Detector 2022. https://huggingface.co/openai-detector.
  • 26.Haladyna TM. Developing and validating multiple-choice test items: Routledge; 2004. [Google Scholar]
  • 27.Vegada B, Shukla A, Khilnani A, Charan J, Desai C. Comparison between three option, four option and five option multiple choice question tests for quality parameters: A randomized study. Indian J Pharmacol. 2016;48(5):571–5. doi: 10.4103/0253-7613.190757 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Epstein RM. Assessment in Medical Education. New England Journal of Medicine. 2007;356(4):387–96. doi: 10.1056/NEJMra054784 [DOI] [PubMed] [Google Scholar]
  • 29.Haenlein M, Kaplan A. A brief history of artificial intelligence: On the past, present, and future of artificial intelligence. California management review. 2019;61(4):5–14. [Google Scholar]
  • 30.Howard FM, Dolezal J, Kochanny S, Schulte J, Chen H, Heij L, et al. The impact of site-specific digital histology signatures on deep learning model accuracy and bias. Nat Commun. 2021;12(1):4423. doi: 10.1038/s41467-021-24698-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Maynez J, Narayan S, Bohnet B, McDonald R. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:200500661. 2020.
  • 32.Heaven WD. Why Meta’s latest large language model survived only three days online US: MIT Technology Review 2023. https://www.technologyreview.com/2022/11/18/1063487/meta-large-language-model-ai-only-survived-three-days-gpt-3-science/.
  • 33.Alkaissi H, McFarlane SI. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. Cureus. 2023;15(2):e35179. doi: 10.7759/cureus.35179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.OpenAI. GPT-4 2023. https://openai.com/research/gpt-4.

Decision Letter 0

Jie Wang

Transfer Alert

This paper was transferred from another journal. As a result, its full editorial history (including decision letters, peer reviews and author responses) may not be present.

17 Jul 2023

PONE-D-23-14620ChatGPT versus human in generating medical graduate exam questions – An international prospective studyPLOS ONE

Dear Dr. CO,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 31 2023 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Jie Wang, Ph.D.

Academic Editor

PLOS ONE

Journal requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. Thank you for including your ethics statement:  "N/A".  

a. For studies reporting research involving human participants, PLOS ONE requires authors to confirm that this specific study was reviewed and approved by an institutional review board (ethics committee) before the study began. Please provide the specific name of the ethics committee/IRB that approved your study, or explain why you did not seek approval in this case.

Once you have amended this/these statement(s) in the Methods section of the manuscript, please add the same text to the “Ethics Statement” field of the submission form (via “Edit Submission”).

For additional information about PLOS ONE ethical requirements for human subjects research, please refer to http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research.

b. Please provide additional details regarding participant consent. In the ethics statement in the Methods and online submission information, please ensure that you have specified (1) whether consent was informed and (2) what type you obtained (for instance, written or verbal, and if verbal, how it was documented and witnessed). If your study included minors, state whether you obtained consent from parents or guardians. If the need for consent was waived by the ethics committee, please include this information.

If you are reporting a retrospective study of medical records or archived samples, please ensure that you have discussed whether all data were fully anonymized before you accessed them and/or whether the IRB or ethics committee waived the requirement for informed consent. If patients provided informed written consent to have data from their medical records used in research, please include this information

Once you have amended this/these statement(s) in the Methods section of the manuscript, please add the same text to the “Ethics Statement” field of the submission form (via “Edit Submission”).

For additional information about PLOS ONE ethical requirements for human subjects research, please refer to http://journals.plos.org/plosone/s/submission-guidelines#loc-human-subjects-research.

3. Thank you for stating the following financial disclosure:

“No”

At this time, please address the following queries:

a) Please clarify the sources of funding (financial or material support) for your study. List the grants or organizations that supported your study, including funding received from your institution.

b) State what role the funders took in the study. If the funders had no role in your study, please state: “The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.”

c) If any authors received a salary from any of your funders, please state which authors and which funders.

d) If you did not receive any funding for this study, please state: “The authors received no specific funding for this work.”

Please include your amended statements within your cover letter; we will change the online submission form on your behalf.

4. Thank you for stating the following in your Competing Interests section: 

“No”

Please complete your Competing Interests on the online submission form to state any Competing Interests. If you have no competing interests, please state "The authors have declared that no competing interests exist.", as detailed online in our guide for authors at http://journals.plos.org/plosone/s/submit-now

This information should be included in your cover letter; we will change the online submission form on your behalf.

5. We note that you have stated that you will provide repository information for your data at acceptance. Should your manuscript be accepted for publication, we will hold it until you provide the relevant accession numbers or DOIs necessary to access your data. If you wish to make changes to your Data Availability statement, please describe these changes in your cover letter and we will update your Data Availability statement to reflect the information you provide.

6. We note that Figure 1 in your submission contain copyrighted images. All PLOS content is published under the Creative Commons Attribution License (CC BY 4.0), which means that the manuscript, images, and Supporting Information files will be freely available online, and any third party is permitted to access, download, copy, distribute, and use these materials in any way, even commercially, with proper attribution. For more information, see our copyright guidelines: http://journals.plos.org/plosone/s/licenses-and-copyright.

We require you to either (1) present written permission from the copyright holder to publish these figures specifically under the CC BY 4.0 license, or (2) remove the figures from your submission:

a. You may seek permission from the original copyright holder of Figure 1 to publish the content specifically under the CC BY 4.0 license.

We recommend that you contact the original copyright holder with the Content Permission Form (http://journals.plos.org/plosone/s/file?id=7c09/content-permission-form.pdf) and the following text:

“I request permission for the open-access journal PLOS ONE to publish XXX under the Creative Commons Attribution License (CCAL) CC BY 4.0 (http://creativecommons.org/licenses/by/4.0/). Please be aware that this license allows unrestricted use and distribution, even commercially, by third parties. Please reply and provide explicit written permission to publish XXX under a CC BY license and complete the attached form.”

Please upload the completed Content Permission Form or other proof of granted permissions as an "Other" file with your submission.

In the figure caption of the copyrighted figure, please include the following text: “Reprinted from [ref] under a CC BY license, with permission from [name of publisher], original copyright [original copyright year].”

b. If you are unable to obtain permission from the original copyright holder to publish these figures under the CC BY 4.0 license or if the copyright holder’s requirements are incompatible with the CC BY 4.0 license, please either i) remove the figure or ii) supply a replacement figure that complies with the CC BY 4.0 license. Please check copyright information on all replacement figures and update the figure caption with source information. If applicable, please specify in the figure caption text when a figure is similar but not identical to the original image and is therefore for illustrative purposes only.

7. Please include your tables as part of your main manuscript and remove the individual files. Please note that supplementary tables (should remain/ be uploaded) as separate "supporting information" files

Additional Editor Comments:

The authors should pay careful attention to each of the comments below and address the issues raised by the reviewers. Since the scale of the current study is not sufficient to support a generic conclusion, the authors need to be more cautious when interpretting the results.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Partly

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: I Don't Know

Reviewer #4: No

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: Yes

Reviewer #4: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

Reviewer #4: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: This is an interesting study. The actual quality of MCQ can be assessed by item analysis after testing it on students. The authors used assessment by experts. The study is interesting. needs some changes.

In title specify exam of which country? Different countries has different level of difficulty.

"An international prospective study" is not needed.

Background in abstract may have more detail about why this study was conducted.

Avoid old references.

Figure 3 is hazy.

Reviewer #2: My main criticism concerns the method used to assess the quality of the questions. For me the only scientifically valid way to evaluate the quality of the questions is to use them in real life, in official exams for example. Only the statistical analysis of students' answers makes it possible to evaluate the quality of a question. This is especially true for distractors. Making this criticism in the discussion is necessary in my opinion.

Reviewer #3: Many thanks for the authors to tackle this hot topic in medical education. I enjoyed reading your manuscript however, the following points should be addressed to improve it

1. The title “ChatGPT versus human in generating medical graduate exam questions” indicate more general term while the study mentioned only MCQ type of questions so either you need to include other types of questions or change the title to indicate the MCQ.

2. Method, how you make sure that the two faculty staff who generated the 50 questions were not using AI to generate questions? Clarify this in method section.

3. Some results mentioned in the discussion section “When the questions generated by ChatGPT were reviewed, we could also observe that they were compatible with the guidance from the Division of Education, American College ofSurgeons, with minimal negative features, including a minimal use of negative stem (only14% (7/50), compared to 12% (6/50) by human examiners) with a lack of “except”,“All/none of the above”

This should be mentioned in result section and discussed in the discussion section.

4. Regarding assessor evaluation of quality of questions it would be better to assess the Bloom levels of questions and add it as another parameter to ensure that the questions not only assessing the lower levels of Bloom (remembering and understanding) specially for questions generated by AI.

5. Table 3 need more clarification (how you calculate the percentage in each column)

6. Table 4 the calculation of % should be the number of truly guessed question written by AI to the 50 questions written by AI, the same for human correct guess(e.g for assessor B it should be 14/50 not 14/23

7. Attach a strobe checklist for the article and address the missed part in the manuscript.

8. Look for instruction of authors for both incitation and reference section.

Reviewer #4: The authors made a comparative study between ChatGPT and Human Experts in generating medical domain MCQ questions from textbooks. The study offers some insight on the performance of chatGPT in generating medical MCQs.

However, I don't find any technical contribution in this article. What is the contribution of the authors? They only made some comparison between the MCQs generated by ChapGPT and human experts using certain evaluation criteria.

Additionally, the article needs some improvement for common readers:

There is no explanation on how the questions are generated by ChatGPT. I feel a large number of readers do not have specific idea on how to generate MCQs from a textbook (as input) using ChatGPT. So, there should be some discussion on that.

What are the list of human input (like, specific information, selection of specific portion, any tuning, parameter setting) or amount of experience required to generate meaningful questions using ChatGPT? For instance, the authors mentioned, the length of the text was limited by ChatGPT at 510 tokens. What are the other such inputs?

The authors claimed that "There was no significant difference in question quality between questions drafted by AI versus human". This is actually the key highlight of the paper. However, this is not completely supported by evidence. When I study the values in Table 3, I find that the human experts are much better than ChatGPT. For instance, appropriateness of the question: 18 (AI) vs 27 (human); Clarity and specificity 18 (AI) vs 26 (human); Distractor quality 21 (AI) vs 26 (human). The gap is significant. Then how the claim is justified?

Also, the number of questions is too less to make such a generic comment on ChatGPT vs Human. The study should consider much larger number of questions, more subjects as input, more number of human experts.

More number of MCQ evaluation metrics are required to be considered for complete evaluation of the MCQs from all aspects. The evaluation of quality of MCQs is a tricky task, and various metrics have been used in the literature in the past two decades. For example, well-formed, sentence length, sentence simplicity, difficulty level, answerable, sufficient context, relevant to the domain, over-informative, under-informative, grammaticality, item analysis etc. are some metrics I find in the literature.

Finally, the human evaluators give a score against each evaluation metric in a scale of 1-10. What was the specific guidelines given to the evaluators for the scoring? Only the name of a evaluation metric might have certain ambiguity and might carry different meaning to different experts. So, there should be proper guidelines for the evaluators. Please mention those.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: Yes: Pr. Olivier Palombi, Grenoble Université Alpes

Reviewer #3: Yes: Nazdar Ezzaddin Alkhateeb

Reviewer #4: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2023 Aug 29;18(8):e0290691. doi: 10.1371/journal.pone.0290691.r002

Author response to Decision Letter 0


23 Jul 2023

Dear Editor-in-Chief and Reviewers,

Thank you very much for your kind consideration and comprehensive assessment of the article originally titled ChatGPT versus human in generating medical graduate exam questions – An international prospective study [PONE-D-23-14620] - [EMID:26d712de0fc6d895].

We have reviewed your valuable comments and recommendations and here is our reply and amendments.

1. Reviewer #1: This is an interesting study. The actual quality of MCQ can be assessed by item analysis after testing it on students. The authors used assessment by experts. The study is interesting. needs some changes.

In title specify exam of which country? Different countries has different level of difficulty.

"An international prospective study" is not needed.

Reply:

TITLE

ChatGPT versus human in generating medical graduate exam multiple choice questions – A multinational prospective study (Hong Kong S.A.R, Singapore, Ireland, and the United Kingdom)

2. Background in abstract may have more detail about why this study was conducted.

Reply:

ABSTRACT

Introduction

Large language models, in particular ChatGPT, have showcased remarkable language processing capabilities. Given the substantial workload of university medical staff, this study aims to assess the quality of multiple-choice questions (MCQs) produced by ChatGPT for use in graduate medical examinations, compared to questions written by university professoriate staffs based on standard medical textbooks.

3. Avoid old references

(Updated)

4. Figure 3 is hazy.

We have revised the figure. Thank you.

5. Reviewer #2: My main criticism concerns the method used to assess the quality of the questions. For me the only scientifically valid way to evaluate the quality of the questions is to use them in real life, in official exams for example. Only the statistical analysis of students' answers makes it possible to evaluate the quality of a question. This is especially true for distractors. Making this criticism in the discussion is necessary in my opinion.

Reply:

(Thank you for pointing this limitation out and our team is also aware of this limitation due to the lack of exam candidates for assessment. However, it might not be appropriate at this stage to apply these questions in real exams unless their usability has been assessed and further discussed in the faculty. Hence we think that it may be more suitable to address in the limitation session.)

Another limitation is the limited number of people involved and the number of questions generated, which could limit the applicability of the results…., and only five assessors were involved. Real-world performance, in particular, the differentiate index, was impossible to assessed due to the lack of actual students doing the test. Hence, efforts were made to improve the generalizability, including comprehensive coverage of all areas in both medicine and surgery and the participation of a multinational panel for assessment.

Reviewer #3: Many thanks for the authors to tackle this hot topic in medical education. I enjoyed reading your manuscript however, the following points should be addressed to improve it

6. (1.) The title “ChatGPT versus human in generating medical graduate exam questions” indicate more general term while the study mentioned only MCQ type of questions so either you need to include other types of questions or change the title to indicate the MCQ.

Reply:

ChatGPT versus human in generating medical graduate exam multiple choice questions – A multinational prospective study (Hong Kong S.A.R, Singapore, Ireland, and the United Kingdom)

7. (2.) Method, how you make sure that the two faculty staff who generated the 50 questions were not using AI to generate questions? Clarify this in method section.

Reply:

Question construction

Another 50 MCQs with four options were drafted by two experienced university professoriate staff (with more than 15 years of clinical practice and academic experience in Internal Medicine and Surgery, respectively) based on the same textbook reference materials.

They were given the exact instructions as the criteria mentioned above. Additionally, while they were allowed to refer back to additional textbooks or online resources, they were not allowed to use any AI tool to assist their question writing.

8. (3.) Some results mentioned in the discussion section “When the questions generated by ChatGPT were reviewed, we could also observe that they were compatible with the guidance from the Division of Education, American College of Surgeons, with minimal negative features, including minimal use of negative stem (only14% (7/50), compared to 12% (6/50) by human examiners) with a lack of “except”,“All/none of the above”

This should be mentioned in result section and discussed in the discussion section.

Reply:

Results

….

In addition, when the questions generated by ChatGPT were reviewed, we observed few negative features, including a minimal use of negative stem (only 14% (7/50), compared to 12% (6/50) by human examiners) with a lack of “except”, “All/none of the above”.

Discussion

Compatible with the guidance from the Division of Education, American College of Surgeons, there were minimal negative features in both the MCQs written by ChatGPT and humans (14% (7/50) vs 12% (6/50)) and there was also no use of “except” or “All/none of the above”, which could create additional confusion to exam candidates 31.

9. (4.) Regarding assessor evaluation of quality of questions it would be better to assess the Bloom levels of questions and add it as another parameter to ensure that the questions not only assessing the lower levels of Bloom (remembering and understanding) specially for questions generated by AI.

Reply:

Blinded assessment by multinational expert panel

and (V) Suitability for graduate medical school exam focused on the level of challenge including if the question higher order of learning outcomes such as application, analysis and evaluation, as elaborated by Bloom and his collaborators (24).

10. (5. ) Table 3 need more clarification (how you calculate the percentage in each column)

Reply:

AI vs human

The average scores of the questions based on the same reading material were compared between that generated by AI vs by human writers. A "win" denotes a higher average score. Questions generated by human generally received a higher mark when compared to the AI counterpart (Table 3). However, AI-drafted questions outperformed human ones in 36% to 44% of cases across five assessment domains, including the total score.

11. (6.) Table 4 the calculation of % should be the number of truly guessed question written by AI to the 50 questions written by AI, the same for human correct guess(e.g for assessor B it should be 14/50 not 14/23

Reply:

Table 4. Blinded guess of question writer (i.e. AI vs Human)

AI (total = 50)

(Correct guess, %) Human (total = 50)

(Correct guess, %) Correlation p

Assessor A 24, 48% 23, 46% - 0.14 – 0.26 0.55

Assessor B 14, 28% 41, 82% - 0.38 – 0.10 0.24

Assessor C 33, 66% 24, 48% - 0.35 – 0.06 0.16

Assessor D 27, 53% 26, 52% - 0.26 – 0.14 0.55

Assessor E 26, 52% 32, 64% - 0.36 – 0.04 0.11

GPT-2 Output Detector 7, 14% 45, 90% - 0.40 – 0.21 0.54

Reults

… The percentage of correct guesses was consistently low. None of the five assessors could achieve a correct “guess” rate of 90% or above. Only G.P.T. -2 Output Detector could achieve a higher correct “guess” rate of 90% for human exam writers….

12. (7.) Attach a strobe checklist for the article and address the missed part in the manuscript.

Reply:

Introduction

… We hypothesize that advanced large language models, such as ChatGPT, can reliably generate high quality MCQs that are comparable to those of an experienced exam writer….

Question construction

…ChatGPT was assessed in Hong Kong S.A.R. via a virtual personal network to overcome its geographical limitation (18)….

Blinded assessment by a multinational expert panel

…The question set were then sent to five independent international blinded assessors (From United Kingdom, Ireland, Singapore and Hong Kong S.A.R.) for assessment of the question quality (who were also the authors of this study as well)….

13. (8.) Look for instruction of authors for both incitation and reference section.

Reply: updated

Reviewer #4: The authors made a comparative study between ChatGPT and Human Experts in generating medical domain MCQ questions from textbooks. The study offers some insight on the performance of chatGPT in generating medical MCQs.

14. However, I don't find any technical contribution in this article. What is the contribution of the authors? They only made some comparison between the MCQs generated by ChapGPT and human experts using certain evaluation criteria.

Reply:

(The primary contribution of this study centers on the novel utilization of ChatGPT as a tool for multiple-choice question (MCQ) generation with the use of reference text as guidance (instead of posting a direct request to ask ChatGPT to write a question from scratch to improve its relevance and reliability), the importance of the request (also known as “prompt”, which people also developed a branch of speciality, coined “prompt engineering”), and its subsequent evaluation of its applicability. While I acknowledge the reviewer's perspective that the approach may not seem pioneering, given the widespread availability of ChatGPT, the application in this specific context is indeed unique. This work presents a valuable exploration of suitabilities in multiple dimensions, which will guide faculty members and academic staff in leveraging such technology to asssit their work. Moreover, this study also underscores the limitations inherent to this technology, offering a balanced understanding of its potential and constraints.)

Additionally, the article needs some improvement for common readers:

14. There is no explanation on how the questions are generated by ChatGPT. I feel a large number of readers do not have specific idea on how to generate MCQs from a textbook (as input) using ChatGPT. So, there should be some discussion on that.

Reply:

Question construction

Fifty multiple choice questions (MCQs) with four options covering Internal Medicine and Surgery were generated by the ChatGPT with reference articles selected from the two reference medical textbooks.

The commander using the ChatGPT will only copy and paste the instruction “Can you write a multiple-choice question based on the following criteria, with the reference I am providing for you and your medical knowledge? ” (also known as the prompt), the criteria and the reference text (selected from the reference textbook and input into ChatGPT interface as pure text input), to yield a question (Supplementary file 1). After sending the request to ChatGPT, a response will be automatically generated. Interaction is only allowed for clarification but not any modification. Questions provided by ChatGPT will be directly copied and used as the output for assessment by the independent quality assessment team. Moreover, a new chat session was started in ChatGPT for each entry to reduce memory retention bias.

….

Discussion

As demonstrated in our study, a reasonable MCQ can be written by ChatGPT using simple command with a text reference provided. This indicates a huge potential to explore the use of such tool as an assistance in other educational scenarios.

….

15. What are the list of human input (like, specific information, selection of specific portion, any tuning, parameter setting) or amount of experience required to generate meaningful questions using ChatGPT? For instance, the authors mentioned, the length of the text was limited by ChatGPT at 510 tokens. What are the other such inputs?

Reply:

Discussion

In our study, no additional training is required for the user or extra tuning of ChatGPT to yield similar results except for the need to follow the commands. As demonstrated in our study, a reasonable MCQ can be written by ChatGPT using simple commands with a text reference provided.

Limitations

And with extra fine-tuning in the form of further conversation with ChatGPT, the output quality from ChatGPT can be significantly enhanced…

16. The authors claimed that "There was no significant difference in question quality between questions drafted by AI versus human". This is actually the key highlight of the paper. However, this is not completely supported by evidence. When I study the values in Table 3, I find that the human experts are much better than ChatGPT. For instance, appropriateness of the question: 18 (AI) vs 27 (human); Clarity and specificity 18 (AI) vs 26 (human); Distractor quality 21 (AI) vs 26 (human). The gap is significant. Then how the claim is justified?

Reply:

Discussion

The questions written by humans were rated superior to those written by A.I. when we compare the questions with the same reference head-to-head (Table 3). However, the difference in our raters' average scores was insignificant except in the relevance domain (Table 2). This shows that despite the apparent superiority of the MCQ written by humans, the score difference is actually narrow and mostly insignificant. This indicates a huge potential to explore the use of such tools as assistance in other educational scenarios.

17. Also, the number of questions is too less to make such a generic comment on ChatGPT vs Human. The study should consider much larger number of questions, more subjects as input, more number of human experts.

(unfortunately the number of questions written and the assessors involved were limited by the resourced. We agreed that these limitations should be addressed in the Limitations session as well)

Limitations

Another limitation is the limited number of people involved and the number of questions generated, which could limit the applicability of the results. Only two professoriate staff from Medicine and Surgery departments, but not experts in all fields, developed the MCQs in this study. This differs from the real-life scenario where graduate exam questions were generated by a large pool of question writers from various clinical departments, which were then reviewed and vetted by a panel of professoriate staff. Besides, only 50 questions were generated by humans and another 50 by AI, and only five assessors were involved. Efforts were made to improve the generalizability, including comprehensive coverage of all areas in both medicine and surgery and the participation of a multinational panel for assessment.

17. More number of MCQ evaluation metrics are required to be considered for complete evaluation of the MCQs from all aspects. The evaluation of quality of MCQs is a tricky task, and various metrics have been used in the literature in the past two decades. For example, well-formed, sentence length, sentence simplicity, difficulty level, answerable, sufficient context, relevant to the domain, over-informative, under-informative, grammaticality, item analysis etc. are some metrics I find in the literature.

Reply:

(The domains for the assessment is specifically designed after literature review of relevant metrics for MCQ quality assessment and specific instructions were given to the assessors for the meaning of each domain.)

Blinded assessment by a multinational expert panel

An assessment tool with five domains is specifically designed after literature review of relevant metrics for MCQ quality assessment in this study (20-23). Specific instructions were given the assessors for the meaning of each domain. These include (I) Appropriateness of the question, defined as if the question is correct, appropriately constructed with appropriate length and well-formed; (II) Clarity and specificity, defined as if the question is clear and specific without ambiguity, its answerability and without being under- or over-informative; (III) Relevance, defined as the relevance to clinical context; (IV) Quality of the alternatives & discriminative power for the assessment of the alternative choices provided; and (V) Suitability for graduate medical school exam focused on the level of challenge including if the question higher order of learning outcomes such as application, analysis and evaluation, as elaborated by Bloom and his collaborators (24). The quality of MCQs was objectively assessed by a numeric scale of 0 – 10. “0” is defined as extremely poor, while “10” means that it is at the quality of a gold-standard question. The total score of this section ranges from 0 to 50.

….

18. Finally, the human evaluators give a score against each evaluation metric in a scale of 1-10. What was the specific guidelines given to the evaluators for the scoring? Only the name of a evaluation metric might have certain ambiguity and might carry different meaning to different experts. So, there should be proper guidelines for the evaluators. Please mention those.

Reply:

Blinded assessment by a multinational expert panel

… The quality of MCQs was objectively assessed by a numeric scale of 0 – 10. “0” is defined as extremely poor, while “10” means that it is at the quality of a gold-standard question. The total score of this section ranges from 0 to 50.

We hope that our reply address your concerns regarding our research and provide solid evidence of its originality, scientific robustness and importance for consideration for publication in your reputable journal. You are also very welcome to provide further comments and recommendations to improve the work in order to demonstrate its value towards potential audience.

Yours sincerely,

Dr Michael Co

21/7/2023

Attachment

Submitted filename: Response to Reviewers.docx

Decision Letter 1

Jie Wang

15 Aug 2023

ChatGPT versus human in generating medical graduate exam multiple choice questions – A multinational prospective study (Hong Kong SAR, Singapore, Ireland, and the United Kingdom)

PONE-D-23-14620R1

Dear Dr. CO,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Jie Wang, Ph.D.

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: All comments have been addressed

Reviewer #2: All comments have been addressed

Reviewer #4: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: No

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #4: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #4: Yes

**********

6. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The study addresses an important and relevant topic concerning the potential use of AI language models for generating MCQs in medical education. The research purpose is well-defined, and the study's objectives are clear.

The study adopts a prospective design and employs a systematic approach by comparing ChatGPT-generated MCQs with those developed by human examiners. The use of standardized assessment scores and independent international assessors enhances the reliability and validity of the findings.

The article highlights the significant time advantage of ChatGPT in generating MCQs, with a total time of 20 minutes and 25 seconds to create 50 questions, compared to 211 minutes and 33 seconds taken by human examiners for the same task. This efficiency demonstrates the potential practicality of using AI in question generation.

The study reveals that ChatGPT-generated MCQs show comparable quality to those created by human examiners across most domains, including appropriateness, clarity, discriminative power, and suitability for medical graduate exams. This suggests that ChatGPT is capable of producing high-quality MCQs in these areas

Reviewer #2: (No Response)

Reviewer #4: Actually, I feel some more experiments are required to validate the claim. But it seems the authors feel their work is sufficient. Anyway ...

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Dr. Himel Mondal

Reviewer #2: Yes: Full Professor Olivier Palombi, Université Grenoble Alpes

Reviewer #4: No

**********

Acceptance letter

Jie Wang

18 Aug 2023

PONE-D-23-14620R1

ChatGPT versus human in generating medical graduate exam multiple choice questions – A multinational prospective study (Hong Kong S.A.R., Singapore, Ireland, and the United Kingdom)

Dear Dr. CO:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Jie Wang

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Checklist. PLOS ONE clinical studies checklist.

    (DOCX)

    S1 File. ChatGPT webpage.

    (DOCX)

    Attachment

    Submitted filename: Response to Reviewers.docx

    Data Availability Statement

    All relevant data are within the paper and its Supporting information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES