Skip to main content
PLOS ONE logoLink to PLOS ONE
. 2024 Sep 26;19(9):e0306233. doi: 10.1371/journal.pone.0306233

Is ChatGPT 3.5 smarter than Otolaryngology trainees? A comparison study of board style exam questions

Jaimin Patel 1, Peyton Robinson 2, Elisa Illing 1, Benjamin Anthony 1,*
Editor: Harpreet Singh Grewal3
PMCID: PMC11426521  PMID: 39325705

Abstract

Objectives

This study compares the performance of the artificial intelligence (AI) platform Chat Generative Pre-Trained Transformer (ChatGPT) to Otolaryngology trainees on board-style exam questions.

Methods

We administered a set of 30 Otolaryngology board-style questions to medical students (MS) and Otolaryngology residents (OR). 31 MSs and 17 ORs completed the questionnaire. The same test was administered to ChatGPT version 3.5, five times. Comparisons of performance were achieved using a one-way ANOVA with Tukey Post Hoc test, along with a regression analysis to explore the relationship between education level and performance.

Results

The average scores increased each year from MS1 to PGY5. A one-way ANOVA revealed that ChatGPT outperformed trainee years MS1, MS2, and MS3 (p = <0.001, 0.003, and 0.019, respectively). PGY4 and PGY5 otolaryngology residents outperformed ChatGPT (p = 0.033 and 0.002, respectively). For years MS4, PGY1, PGY2, and PGY3 there was no statistical difference between trainee scores and ChatGPT (p = .104, .996, and 1.000, respectively).

Conclusion

ChatGPT can outperform lower-level medical trainees on Otolaryngology board-style exam but still lacks the ability to outperform higher-level trainees. These questions primarily test rote memorization of medical facts; in contrast, the art of practicing medicine is predicated on the synthesis of complex presentations of disease and multilayered application of knowledge of the healing process. Given that upper-level trainees outperform ChatGPT, it is unlikely that ChatGPT, in its current form will provide significant clinical utility over an Otolaryngologist.

Introduction

Current developments in artificial intelligence (AI) technology using advanced language models have generated a significant amount of public interest. Chat Generative Pre-Trained Transformer (ChatGPT), an AI-based language model developed by OpenAI, stands out for its ability to generate human-like responses in written format. Recent improvements to ChatGPT have garnered significant attention as this sophisticated AI platform finds its place in modern society. Fueled by vast databases, ChatGPT provides precise, personalized answers, a testament to its prowess in understanding the intricacies of human language. Based on this repository of knowledge, this language model effortlessly mirrors real-life conversations and boasts profound knowledge across diverse subjects [1].

The role of AI in medicine has been met with both hopeful intrigue as well as skepticism. AI-powered systems like ChatGPT can provide immediate access to information for patients and healthcare providers to augment healthcare decisions. ChatGPT seems to have an obvious role in patient education and medical education due to its ability to generate knowledgeable responses to fact-based questions with categorical answers. ChatGPT could possibly even play a direct role in augmenting patient care decisions and treatment. However, the accuracy and reliability of AI systems like ChatGPT has not yet been firmly established in medicine. Nevertheless, efforts continue to further develop this technology to determine if it holds value for patient care.

ChatGPT has been tested with a diverse list of standardized examinations, such as the uniform Bar Examination, the Scholastic Assessment test (SAT), the Graduate Record Examination (GRE), high school advanced placement exams and more [2]. Despite medicine being filled with niche terminology, acronyms, and multidisciplinary topics, ChatGPT has been able to exhibit a broad knowledge of medicine. Indeed, ChatGPT was found to likely be able to pass the USMLE Step 1 examination [3]. With regards to subspecialty fields, the literature has shown that ChatGPT is passable or near passable in board exams for Ophthalmology, Pathology, Neurosurgery, Cardiology, and Otolaryngology [39]; however, ChatGPT did quite poorly on the multiple-choice Orthopedic board exam [10]. As a repository of advanced medical knowledge, ChatGPT underperformed in comparison to the widely used UpToDate medical reference [11]. AI-based language models could be a great tool when patients desire reliable information on upcoming procedures, information on prescriptions, and other aspects of their care that carry significant weight to the patient [12], but their utility in advanced medical decision making remains to be investigated.

This current project compares the performance of ChatGPT version 3.5 to medical trainees at a US medical school and residency on board style questions for the Otolaryngology–Head and Neck Surgery board exam. To objectively quantify ChatGPT’s knowledge of otolaryngology, we compared it to the infancy of medical education to senior level otolaryngology residents. The spectrum of questions ranged from fundamental concepts learned during the early years of medical school to the complexities of advanced medical and surgical patient management derived by the end of resident training. Our primary aim is to assess if and when ChatGPT can outperform human learners on Otolaryngology board style questions.

Materials and methods

This study was exempt from requiring approval by the institutional review board at Indiana University. The study started collecting data on October 2nd, 2023, through January 5th, 2024. 30 multiple choice Otolaryngology board-style questions were asked to all years of medical students and Otolaryngology residents. The same questions were also asked to ChatGPT. Given that ChatGPT is a reiterative, learning-based model with a potential for different answers each time a question is asked, the test was administered to ChatGPT five times. These 30 questions were randomly aggregated with varying degrees of difficulty from pre-published board prep question bank with slight changes to the question-and-answer choices to prevent infringement of data. The questions are provided in supplement for review. The stem of the question and concepts were not changed to mimic the rigor of a board exam. Neither the human participants nor ChatGPT were asked to explain why they chose their respective answer to a question. However, ChatGPT did provide a reasoning to its choice without being prompted.

Questions were dispersed by using Google Forms to all 1461 medical students via listserv email, years 1–4, (MS1-MS4) and 17 of the 18 Otolaryngology residents, years 1–5, (PGY1-PGY5) at Indiana University School of Medicine. Participants were blinded to the purpose of this exam to avoid bias; thus, they were not provided informed consent on the underlying purpose of the study. They were simply asked to answer questions to test the quality of the questions written. No compensation or incentives were provided for the completion of this questionnaire. The only identifying data collected was the education level of each participant (MS1-PGY5). At the beginning of the study, the participants were given clear instructions: “Thanks so much for taking the time to answer this 30-question quiz that covers topics within Otolaryngology. We ask that you take this quiz in one sitting and do not use outside resources. This will allow us to accurately evaluate the questions written.”

For ChatGPT, the model was prompted with the following: “You are a medical professional and I want you to pick an answer from the multiple-choice question I provide.” For example, in one administration, ChatGPT responded with: “Of course, I would be happy to help you with multiple choice questions related to medical topics. Please provide the question and its options, and I’II do my best to provide you with the correct answer and explanation.” Following this prompt, each of the 30 questions were provided one at a time. The answer and reasoning were recorded. The test was administered five times, once each day on five different days. This methodology was utilized to help capture the variability that language models can exhibit. We believe this allowed ChatGPT additional chances to retrieve the correct information within the vast databases it utilizes. Additionally, while ChatGPT was not solicited for an explanation, its reasonings were recorded for each response; however, we did not further analyze this data as it was not the intention of this study.

Participants

The 30-question survey was completed by medical students and Otolaryngology residents at Indiana University (n = 48) and ChatGPT model 3.5 (n = 5). There were 9 education level groups across the human participants, MS1 (n = 8), MS2 (n = 7), MS3 (n = 10), MS4 (n = 6), PGY1 (n = 4), PGY2 (n = 4), PGY3 (n = 4), PGY4 (n = 2), and PGY5 (n = 3). See Table 1.

Table 1. Demographics of participants.

Level of Education Number of participants
MS 1 8
MS 2 7
MS 3 10
MS 4 6
PGY– 1 4
PGY– 2 4
PGY– 3 4
PGY– 4 2
PGY– 5 3
MS–Medical Student Year, PGY–Post graduate year

Statistical analysis

Statistical analysis was conducted using Statistical Package for the Social Sciences (SPSS) [IBM]. A one-way ANOVA was conducted to compare Otolaryngology Board Exam Scores between human participants at each medical education level and ChatGPT. The ANOVA was implemented to identify if group differences were present between the 9 education levels (MS1-PGY5) and ChatGPT. Tukey’s Honest Significant Difference Test (HSD) post hoc test was utilized to identify which of the 9 education levels (MS1-PGY5) differed to ChatGPT. A regression analysis was conducted to explore the relationship between education level and score, specifically to explore whether education level predicted score.

Results

A regression revealed that the education level significantly predicted score R2 = .765, F(1, 46) = 150.003, p < .001. The average score of human participants increased linearly as education level increased by years (MS1-PGY5) (MS1 = 28.75%; MS2 = 31.44%; MS3 = 36%; MS4 = 37.77%; PGY1 = 49.18%; PGY2 = 56.68%; PGY3 = 70.83%; PGY4 = 81.65%; PGY5 = 84.47%,). See Table 2.

Table 2. Percent correct and mean difference between ChatGPT and medical trainees.

Group
A
Average % Correct Group B Average % Correct Mean Difference (A-B) Sig. 95% Confidence Interval
Lower Bound Upper Bound
ChatGPT 54.66 MS1 28.75 25.91* < .001 8.40 43.43
MS2 31.44 23.22* .003 5.22 41.21
MS3 36.00 18.66* .019 1.83 35.49
MS4 37.77 16.89 .104 -1.72 35.50
PGY-1 49.18 5.49 .996 -15.13 26.10
PGY-2 56.68 -2.01 1.000 -22.63 18.60
PGY-3 70.83 -16.17 .242 -36.78 4.45
PGY-4 81.65 -26.99* .033 -52.70 -1.28
PGY-5 84.47 -29.81* .002 -52.25 -7.36
MS–Medical Student Year, PGY–Post graduate year

The average score of ChatGPT was 54.66% across the 5 administrations. At times, ChatGPT did provide different answers to questions with different explanations. However, there was not a consistent increase in percent correct overtime. By mean, ChatGPT out-performed human participants from education level MS1-PGY1 but underperformed in comparison to PGY2-PGY5. See Fig 1.

Fig 1. Board exam scores between medical trainees and ChatGPT.

Fig 1

A one-way ANOVA revealed that there were statistically significant differences in the average score between at least two of the 10 groups (F(9, 43) = [20.393], p < .001).

Tukey’s HSD test for multiple comparisons were implemented to identify which groups differed significantly from each other, particularly from ChatGPT. Results revealed that the score significantly differed between ChatGPT and MS1 (p < .001, 95% C.I. = 8.3905, 43.4295), MS2 (p = .003, 95% C.I. = 5.2228, 41.2115), MS3 (p = .019, 95% C.I. = 1.8278, 35.4922), PGY-4 (p = .033, 95% C.I. = -52.7016, -1.2784), PGY-5 (p = .002, 95% C.I. = -52.2496, -7.3637).

Results revealed that the score did not significantly differ between ChatGPT and MS4 (p = .104, 95% C.I. = -1.7154, 35.5020), nor between ChatGPT and PGY-1 (p = .996, 95% C.I. = -15.1302, 26.1002), nor PGY-2 (p = 1.000, 95% C.I. = -22.6302, 18.6002), nor PGY-3 (p = .242, 95% C.I. = -36.7802, 4.4502).

Discussion

Language-centric AI models, exemplified by ChatGPT, are gaining momentum for their ability to sustain coherent conversations as well as demonstrating aptitude on standardized examinations. Powered by deep machine learning techniques and extensive textual data, ChatGPT iteratively enhances its abilities via user interactions and reinforcement learning [1].

This study elucidates the comparison of otolaryngology knowledge of ChatGPT 3.5 to medical trainees. Findings reveal ChatGPT’s superiority over beginners but eventual inferiority to seasoned tolaryngology residents on board-style questions targeting otolaryngology knowledge, indicating a progressive convergence in performance. Our senior residents scoring 85% tracks with historical data demonstrating that the written otolaryngology board exam has a 97% pass rate for senior residents scoring in the top 3 quartiles on their in-service exams [13]. Thus, this highlights that our questions align with the likely rigor of a board exam.

An additional key finding that we believe challenged ChatGPT was the nuanced and context-dependent nature of medical questions. Medical learners exhibited marked growth in their knowledge base, showcasing a linear progression in their average correct responses on the exam over years of continued training. This aligns with our expectations, as their evolving domain-specific knowledge, clinical experiences, and ability to interpret complex scenarios increase with seniority. Human participants are adept at synthesizing information, applying critical thinking skills, and adapting responses to the intricacies of each scenario. This foundational skill is nurtured throughout the educational journey, particularly for individuals in the medical field. Resultantly, senior Otolaryngology residents demonstrate superior deductive abilities in answering multiple-choice questions compared to ChatGPT.

This AI model continues to struggle with advanced otolaryngology topics that require a deep understanding of current medical literature to properly navigate [14, 15]. This may be in part due to its lack of a deep understanding of patient-specific factors, consideration of evolving clinical contexts, and the incorporation of the latest medical research, specifically in Otolaryngology [11]. Future research should explore how AI language models can be trained to better perform answering medical queries. Further investigation should continue to be done to test the growth of ChatGPT as the model advances.

Albeit, the explanations for selecting an answer were unsolicited from ChatGPT and not a part of our intend study, there were instances where we noticed that ChatGPT seemed to grapple with a lack of understanding or data support, leading to what appeared as a guess, misinformed, or ill-informed answer. This was seen through multiple repetitions of the question with either similar answer choices but different explanations and vice versa. This has been demonstrated in multiple other study where ChatGPT struggled with the intricacies of medical knowledge resulting in its subpar performance [16, 17]. Overall, while illustrating the robust power of this language model, these inconsistencies beg the questions about continued knowledge gaps in specific queries on AI language models. Thus, while the model demonstrated an impressive ability to generate human-like responses in natural language, it continues to struggle with the intricacies and subtleties inherent in otolaryngology, and perhaps medicine generally. While this was an incidental finding in our study it would be a great opportunity for further research into ChatGPT’s understanding of medical topics.

One limitation of this study was the small number of participants in the medical student group. While there was significance found in the comparison of the groups, there were still many students who did not answer the survey. This was most likely due to the inability to individually reach out to the large number of medical students at the university without creating bias among students. Also, the robust amount of information that is communicated to medical students could make the invitation to participate in this survey be missed. Additionally, the number of otolaryngology residents were also limited. To bolster a future study, we would recommend exploring the performance of trainees from multiple institutions.

Conclusion

In conclusion, our findings emphasize the need for caution and meticulous assessment when deploying language models in specialized fields like otolaryngology or medicine, where precision is critical, and the stakes are high. ChatGPT showcases remarkable capabilities in natural language understanding and has been shown to pass a host of different board examinations [28]. In our study, ChatGPT scored an average of 54.66% which is similar to the 57% correct seen in Hoch et al. [9]. Considering this, ChatGPT is not yet intelligent enough to become the trusted gold standard to accessing medical information within Otolaryngology.

Additionally future research should focus on refining and tailoring language models for specific domains, incorporating real-time learning mechanisms, and addressing the interpretability challenges associated with automated systems in complex decision-making processes within the medical field. Consequently, with time, AI language models may evolve into indispensable tools for medical professionals and potentially even to patients and future research must aim to keep our understanding of their limits and abilities up to date.

Supporting information

S1 Dataset

(XLSX)

pone.0306233.s001.xlsx (33.8KB, xlsx)
S1 File

(DOCX)

pone.0306233.s002.docx (28.1KB, docx)

Data Availability

All relevant data are within the manuscript and its Supporting Information files.

Funding Statement

The author(s) received no specific funding for this work.

References

  • 1.Schade M. How ChatGPT and Our Language Models Are Developed.
  • 2.V L. AI models like ChatGPT and GPT-4 are acing everything from the bar exam to AP Biology. Here’s a list of difficult exams both AI versions have passed. Business Insider. 2023. [Google Scholar]
  • 3.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4.Long C, Lowe K, Zhang J, Santos AD, Alanazi A, O’Brien D, et al. A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology-Head and Neck Surgery Certification Examinations: Performance Study. JMIR Med Educ. 2024;10:e49970. doi: 10.2196/49970 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the Performance of ChatGPT in Ophthalmology: An Analysis of Its Successes and Shortcomings. Ophthalmol Sci. 2023;3(4):100324. doi: 10.1016/j.xops.2023.100324 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Sinha RK, Deb Roy A, Kumar N, Mondal H. Applicability of ChatGPT in Assisting to Solve Higher Order Problems in Pathology. Cureus. 2023;15(2):e35237. doi: 10.7759/cureus.35237 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Ali R, Tang OY, Connolly ID, Fridley JS, Shin JH, Zadnik Sullivan PL, et al. Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank. Neurosurgery. 2023;93(5):1090–8. doi: 10.1227/neu.0000000000002551 [DOI] [PubMed] [Google Scholar]
  • 8.Ali R, Tang OY, Connolly ID, Zadnik Sullivan PL, Shin JH, Fridley JS, et al. Performance of ChatGPT and GPT-4 on Neurosurgery Written Board Examinations. Neurosurgery. 2023;93(6):1353–65. doi: 10.1227/neu.0000000000002632 [DOI] [PubMed] [Google Scholar]
  • 9.Hoch CC, Wollenberg B, Luers JC, Knoedler S, Knoedler L, Frank K, et al. ChatGPT’s quiz skills in different otolaryngology subspecialties: an analysis of 2576 single-choice and multiple-choice board certification preparation questions. Eur Arch Otorhinolaryngol. 2023;280(9):4271–8. doi: 10.1007/s00405-023-08051-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Lum ZC. Can Artificial Intelligence Pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents Versus ChatGPT. Clin Orthop Relat Res. 2023;481(8):1623–30. doi: 10.1097/CORR.0000000000002704 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Karimov Z, Allahverdiyev I, Agayarov OY, Demir D, Almuradova E. ChatGPT vs UpToDate: comparative study of usefulness and reliability of Chatbot in common clinical presentations of otorhinolaryngology-head and neck surgery. Eur Arch Otorhinolaryngol. 2024;281(4):2145–51. doi: 10.1007/s00405-023-08423-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Balel Y. Can ChatGPT be used in oral and maxillofacial surgery? J Stomatol Oral Maxillofac Surg. 2023;124(5):101471. doi: 10.1016/j.jormas.2023.101471 [DOI] [PubMed] [Google Scholar]
  • 13.Puscas L. Otolaryngology resident in-service examination scores predict passage of the written board examination. Otolaryngol Head Neck Surg. 2012;147(2):256–60. doi: 10.1177/0194599812444386 [DOI] [PubMed] [Google Scholar]
  • 14.Makhoul M, Melkane AE, Khoury PE, Hadi CE, Matar N. A cross-sectional comparative study: ChatGPT 3.5 versus diverse levels of medical experts in the diagnosis of ENT diseases. Eur Arch Otorhinolaryngol. 2024;281(5):2717–21. doi: 10.1007/s00405-024-08509-z [DOI] [PubMed] [Google Scholar]
  • 15.Lechien JR, Naunheim MR, Maniaci A, Radulesco T, Saibene AM, Chiesa-Estomba CM, et al. Performance and Consistency of ChatGPT-4 Versus Otolaryngologists: A Clinical Case Series. Otolaryngol Head Neck Surg. 2024;170(6):1519–26. [DOI] [PubMed] [Google Scholar]
  • 16.Sahin S, Erkmen B, Duymaz YK, Bayram F, Tekin AM, Topsakal V. Evaluating ChatGPT-4’s performance as a digital health advisor for otosclerosis surgery. Front Surg. 2024;11:1373843. doi: 10.3389/fsurg.2024.1373843 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Mondal H, Dhanvijay AK, Juhi A, Singh A, Pinjar MJ, Kumari A, et al. Assessment of the Capability of ChatGPT-3.5 in Medical Physiology Examination in an Indian Medical School. Interdisciplinary Journal of Virtual Learning in Medical Sciences. 2023;14(4):311–7. [Google Scholar]

Decision Letter 0

Harpreet Singh Grewal

4 Jul 2024

PONE-D-24-22688Is ChatGPT smarter than Otolaryngology trainees? 

A comparison study of board style exam questionsPLOS ONE

Dear Dr. Patel,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Please submit your revised manuscript by Aug 18 2024 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

  • A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.

  • A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.

  • An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: https://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols. Additionally, PLOS ONE offers an option for publishing peer-reviewed Lab Protocol articles, which describe protocols hosted on protocols.io. Read more information on sharing protocols at https://plos.org/protocols?utm_medium=editorial-email&utm_source=authorletters&utm_campaign=protocols.

We look forward to receiving your revised manuscript.

Kind regards,

Harpreet Singh Grewal

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at 

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and 

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

2. We note that your Data Availability Statement is currently as follows: All relevant data are within the manuscript and its Supporting Information files

Please confirm at this time whether or not your submission contains all raw data required to replicate the results of your study. Authors must share the “minimal data set” for their submission. PLOS defines the minimal data set to consist of the data required to replicate all study findings reported in the article, as well as related metadata and methods (https://journals.plos.org/plosone/s/data-availability#loc-minimal-data-set-definition).

For example, authors should submit the following data:

- The values behind the means, standard deviations and other measures reported;

- The values used to build graphs;

- The points extracted from images for analysis.

Authors do not need to submit their entire data set if only a portion of the data was used in the reported study.

If your submission does not contain these data, please either upload them as Supporting Information files or deposit them to a stable, public repository and provide us with the relevant URLs, DOIs, or accession numbers. For a list of recommended repositories, please see https://journals.plos.org/plosone/s/recommended-repositories.

If there are ethical or legal restrictions on sharing a de-identified data set, please explain them in detail (e.g., data contain potentially sensitive information, data are owned by a third-party organization, etc.) and who has imposed them (e.g., an ethics committee). Please also provide contact information for a data access committee, ethics committee, or other institutional body to which data requests may be sent. If data are owned by a third party, please indicate how others may request data access.

Additional Editor Comments:

Well conceptualized manuscript. However, the selection criteria for the questions has to be clarified. Did they represent the old ENT board exam questions ? Why were Medical students included and not only ENT trainees since only they will potentially take the board exams and not the med students. There were also some issues with discussion and introduction section. Following are some hard recommendations and some suggestions for improvement:

HARD RECOMMENDATIONS:

-Please provide selection criteria for the questions. How were they selected ? Were they graded ? Did they represent some of the old ENT board exam question ?

-Please explain why med students were included in the cohort since they do no typically take the board exam, which is potentially only to be taken by the residents/fellows ? This sets different playing fields for ChatGPT which could have skewed the results.

-Please touch upon why the ChatGPT was asked to provide additional reasoning for a question where as the human participants were just answering the question. This may have altered the results. Please address the rationale behind this.

-The discussion section needs to be more nuanced, by referring to studies that have already done something similar and then comparing your results against those. Please note that discussion section should always start by summarizing the results, then compare the studies results with some of the similar previous work that has been done. Then add strengths and limitations section, followed by a conclusion. The comparison narrative on your discussion seems weaka nd should benefit by referencing more studies that have done some similar work previously.

SUGGESTIONS FOR IMPROVEMENT:

-Please shorten the discussion section. Some of the studies mentioned there should actually be moved to your discussion section. This would help rooting your study in the prior literature, as i have mentioned above.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: No

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: No

Reviewer #3: No

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: Yes

Reviewer #3: Yes

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: The article compared the performance of ChatGPT with that of human. Simple and straightforward study. Just need to cite some more recent literature in the discussion like https://ijvlms.sums.ac.ir/article_49682.html. The version of the chatGPT can also be mentioned in the title of the article so that readers understand at a glance which chatgpt you are referring to.

Reviewer #2: Thank you for this interesting paper. While the results no doubt add to our understanding of AI in medicine, there are a few comments to be made:

1) You do not provide any information about the questions asked of the two groups. How were they constructed? Are the questions at different levels or graded, and how do we know that? The quality of the questions and answers no doubt have implication on the results and should be shared with the reader. This is particularly given your conclusion that ChatGPT struggles with harder questions, but we don't have the data to know that's where it struggled or that's where higher learners didn't struggle. Also, do we have a comparison to know that higher learners do this well on these tests in general? Is a score of 85% expected from PGY5s?

2) The article suggests you asked ChatGPT to provide an answer and then explain their reasoning. It does not appear this was done for human participants. Why is that? Your article suggests there were times that the answers and the explanations did not match. Could this also not be the case for humans? Again, also knowing the questions could provide insight here. Were there certain questions that were more likely to trip up humans or ChatGPT?

3) What is the denominator of participants? How many people were invited to participate? Is there potential bias introduced based on who participates?

4) Please include potential issues with this study in your discussion.

5) Your discussion includes a paragraph about the ethical implications of AI. It does seems to fit with the paper itself (ie the study doesn't focus on AI ethics) nor are there any references provided. You may want to remove this, and from the conclusion.

Reviewer #3: This is a well written but not rigorously conducted study on the test taking abilities of chatgpt3.5. The strength of the paper Include its simply and coherently written introduction and methods. The weakness of this paper is in the rigor of the experiment design and the construction of the test for chatgpt. Firstly, it is unclear whether the 30 questions chosen are valid representations of otolaryngology board examination questions. There is no explanation of how they were chose, how they were vetted and against what criteria, and what mix of topics/difficulty they test. Many articles have been written on similar topics that can be referenced for their methodology. (Ideally, the questions used would be real board questions, but in the absence of that, there are still ways to objectively vet and classify questions. For example, see https://pubs.rsna.org/doi/10.1148/radiol.230582).

Secondly, the discussion does not root the paper within the existing literature, with zero citations or references to other similar work. The discussion actually begins to present new information (anecdotal observations of how chatgpt is reasoning through questions or not) that was not shown in results

Lastly, the authors should try alternative prompts to see if chatgpt may Perform better if, for example, it was told it is a practicing otolaryngologist rather than a medical professional, etc.

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

Reviewer #3: No

**********

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

Decision Letter 1

Harpreet Singh Grewal

4 Sep 2024

Is ChatGPT 3.5 smarter than Otolaryngology trainees? 

A comparison study of board style exam questions

PONE-D-24-22688R1

Dear Dr. Patel,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice will be generated when your article is formally accepted. Please note, if your institution has a publishing partnership with PLOS and your article meets the relevant criteria, all or part of your publication costs will be covered. Please make sure your user information is up-to-date by logging into Editorial Manager at Editorial Manager® and clicking the ‘Update My Information' link at the top of the page. If you have any questions relating to publication charges, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Harpreet Singh Grewal

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Acceptance letter

Harpreet Singh Grewal

17 Sep 2024

PONE-D-24-22688R1

PLOS ONE

Dear Dr. Patel,

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now being handed over to our production team.

At this stage, our production department will prepare your paper for publication. This includes ensuring the following:

* All references, tables, and figures are properly cited

* All relevant supporting information is included in the manuscript submission,

* There are no issues that prevent the paper from being properly typeset

If revisions are needed, the production department will contact you directly to resolve them. If no revisions are needed, you will receive an email when the publication date has been set. At this time, we do not offer pre-publication proofs to authors during production of the accepted work. Please keep in mind that we are working through a large volume of accepted articles, so please give us a few weeks to review your paper and let you know the next and final steps.

Lastly, if your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

If we can help with anything else, please email us at customercare@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Harpreet Singh Grewal

Academic Editor

PLOS ONE

Associated Data

    This section collects any data citations, data availability statements, or supplementary materials included in this article.

    Supplementary Materials

    S1 Dataset

    (XLSX)

    pone.0306233.s001.xlsx (33.8KB, xlsx)
    S1 File

    (DOCX)

    pone.0306233.s002.docx (28.1KB, docx)
    Attachment

    Submitted filename: Response To Reviewers.docx

    pone.0306233.s003.docx (17KB, docx)

    Data Availability Statement

    All relevant data are within the manuscript and its Supporting Information files.


    Articles from PLOS ONE are provided here courtesy of PLOS

    RESOURCES