TO THE EDITOR:
We read with great interest the recent article by Pushpanathan et al1 evaluating the performance of OpenAI’s o1 model in ophthalmology-related queries. The study provides timely insights into the growing role of large language models in clinical practice. While we found the study informative, we would like to raise two points—one regarding methodological clarity and another suggesting a possible direction for future research informed by clinical observations.
As large language model outputs are inherently probabilistic, minor changes in parameters—particularly the temperature setting—can lead to a significant variation in responses. In the study, the default temperature range (0.7–1.0) was used without specific adjustment. By contrast, Antaki et al2 demonstrated improved consistency and accuracy in ophthalmic tasks when using a lower temperature (0.3). This raises an important methodological question: Did the authors observe any variability in model outputs that might have influenced the grading? We believe a discussion on this point would be helpful, especially given the growing emphasis on reproducibility in large language model–based research.3 If such variability was observed, it might warrant acknowledgment as a study limitation, as stated in the study by Ghanem et al.4
Recently, there has been an increasing number of patients using Chat Generative Pre-trained Transformer (ChatGPT) independently—likely powered by a recent model such as o1—to inquire about their symptoms and perform informal self-assessments. Anecdotally, it seems that the diagnostic accuracy of ChatGPT in such patient-led queries has improved with newer model versions. However, to our knowledge, no study has systematically evaluated ChatGPT’s diagnostic performance when used directly by patients without clinician guidance. Current studies largely examine ChatGPT’s utility from a clinician or educator perspective. Therefore, we propose that exploring this patient-facing application could provide valuable insights and would be a natural extension of this research.
Footnotes
Disclosure(s):
All authors have completed and submitted the ICMJE disclosures form.
The author(s) have no proprietary or commercial interest in any materials discussed in this article.
References
- 1.Pushpanathan K., Zou M., Srinivasan S., et al. Can OpenAI's new o1 model outperform its predecessors in common eye care queries? Ophthalmol Sci. 2025;5 doi: 10.1016/j.xops.2025.100745. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Antaki F., Milad D., Chia M.A., et al. Capabilities of GPT-4 in ophthalmology: an analysis of model entropy and progress towards human-level medical question answering. Br J Ophthalmol. 2024;108:1371–1378. doi: 10.1136/bjo-2023-324438. [DOI] [PubMed] [Google Scholar]
- 3.Zhu L., Mou W., Hong C., et al. The evaluation of generative AI should include repetition to assess stability. JMIR Mhealth Uhealth. 2024;12 doi: 10.2196/57978. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Ghanem Y.K., Rouhi A.D., Al-Houssan A., et al. Dr. Google to dr. ChatGPT: assessing the content and quality of artificial intelligence-generated medical information on appendicitis. Surg Endosc. 2024;38:2887–2893. doi: 10.1007/s00464-024-10739-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
