Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations

Aidan Gilson; Conrad W Safranek; Thomas Huang; Vimig Socrates; Ling Chi; Richard Andrew Taylor; David Chartash

doi:10.2196/50336

letter

. 2023 Jul 13;9:e50336. doi: 10.2196/50336

Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations

Aidan Gilson ^1,², Conrad W Safranek ¹, Thomas Huang ², Vimig Socrates ^1,³, Ling Chi ¹, Richard Andrew Taylor ^1,^2,^#, David Chartash ^1,^4,^✉,^#

Editor: Tiffany Leung

PMCID: PMC10375396 PMID: 37440299

We thank Epstein and Dexter [1] for their close reading of our paper, “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” [2]. In response to their comments, we present the following points for clarification:

While search engines such as Bing (Microsoft Corp) and Google (Google LLC) have been noted to implement geographic tuning when presenting their information retrieval results, there is no evidence or documentation that the version of ChatGPT (OpenAI) used in our work similarly alters its output given the geolocation of the user or the device that is being used. Notably, however, the integration of ChatGPT into other online services, such as Bing or Snapchat (Snap Inc), has made the information provided to those services (eg, time zone or geolocation) available to ChatGPT [3].
Additionally, although it may be true that (dialectic) grammatical differences in the English language result in variability that may mimic the variability of prompt engineering, there is no empirical evidence that this alters the performance of ChatGPT. Further examination of the correlation between prompt engineering methods and within-sentence grammatical tuning or variability may alleviate these concerns in future research.
Although it is a medical knowledge–based examination, the American Board of Preventive Medicine Longitudinal Assessment Program pilot for clinical informatics is not equivalent to the USMLE (United States Medical Licensing Examination). ChatGPT’s performance on this maintenance of certification examination has been examined by Kumah-Crystal et al [4], and we defer to their assessment as a more apt comparator.
While Epstein and Dexter [1] offer a comparison between ChatGPT 3.5, ChatGPT 4.0, and Google Bard, it is unclear as to how the three have been statistically compared in terms of sample size and answer quality beyond performance on multiple-choice questions. Bootstrapping responses appear to address an element of variability in large language model (LLM) responses; however, a more robust statistical comparison is warranted alongside a comparison of nonbinarized LLM output performance.
While there is no doubt that there is variability in the responses of LLMs to identical inputs (as these tools are nondeterministic in character), we do not believe this devalues the statistical significance or the quantitative validity of our results. As we are evaluating the performance of ChatGPT in the same situation as a student examinee, a single response is more applicable. Additionally, since we used a large sample size of questions, which accounted for model variability, we elected not to repeat questions multiple times.

Abbreviations

LLM: large language model
USMLE: United States Medical Licensing Examination

Footnotes

Conflicts of Interest: None declared.

References

1.Epstein R, Dexter F. Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”. JMIR Med Educ. 2023;9:e48305. doi: 10.2196/48305. https://mededu.jmir.org/2023/1/e48305/ [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023 Feb 08;9:e45312. doi: 10.2196/45312. doi: 10.2196/45312.v9i1e45312 [DOI] [PMC free article] [PubMed] [Google Scholar]
3.How my AI uses location data. Snapchat Support. [2023-06-25]. https://archive.is/wcmk3 .
4.Kumah-Crystal Y, Mankowitz Scott, Embi Peter, Lehmann Christoph U. ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? J Am Med Inform Assoc. 2023 Jun 19;:104. doi: 10.1093/jamia/ocad104.7202064 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref1] 1.Epstein R, Dexter F. Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment”. JMIR Med Educ. 2023;9:e48305. doi: 10.2196/48305. https://mededu.jmir.org/2023/1/e48305/ [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref2] 2.Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023 Feb 08;9:e45312. doi: 10.2196/45312. doi: 10.2196/45312.v9i1e45312 [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref3] 3.How my AI uses location data. Snapchat Support. [2023-06-25]. https://archive.is/wcmk3 .

[ref4] 4.Kumah-Crystal Y, Mankowitz Scott, Embi Peter, Lehmann Christoph U. ChatGPT and the clinical informatics board examination: the end of unproctored maintenance of certification? J Am Med Inform Assoc. 2023 Jun 19;:104. doi: 10.1093/jamia/ocad104.7202064 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations

Aidan Gilson, BS

Conrad W Safranek, BS

Thomas Huang, BS

Vimig Socrates, MS

Ling Chi, BSE

Richard Andrew Taylor, MD, MHS

David Chartash, PhD

Abbreviations

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations

Aidan Gilson, BS

Conrad W Safranek, BS

Thomas Huang, BS

Vimig Socrates, MS

Ling Chi, BSE

Richard Andrew Taylor, MD, MHS

David Chartash, PhD

Abbreviations

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases