Abstract
Purpose
We aim to evaluate the performance of 5 large language models (LLMs) and human teachers in answering optometry-related questions raised by medical undergraduate students.
Methods
This prospective and comparative study collected 108 questions from 30 students. The questions were sent to their teachers for responses and were also inputted into 5 LLMs, including 2 local models (Mistral-7B and Llama-2-13B) and 3 online models (Claude-3, Gemini-1.0 pro, and GPT-4.0), to generate corresponding answers. All answers were independently evaluated by 2 optometry experts in a blind manner for accuracy, completeness, comprehensibility, and overall quality, using a 5-point scale. Students were asked to complete a 6-item questionnaire about their satisfaction and perspectives on the integration of LLMs.
Results
LLMs responded more quickly and generated more extensive answers compared to humans (P < .001). In terms of overall performance, human teachers ranked fifth among the 6 participants, with scores significantly lower than GPT-4.0 (P < .001), Claude-3 (P < .001), and Gemini-1.0 pro (P < .001). GPT-4.0 received the highest scores for accuracy (3.87/5) and completeness (4.11/5), while Claude-3 excelled in comprehensibility (3.91/5) and overall quality (3.93/5); however, the differences between them were not statistically significant. Online LLMs outperformed both humans and locally deployed LLMs (P < .001). Students agreed that LLMs provided more comprehensive and detailed information (3.80/5), but found human answers easier to understand (4.17/5). They were less supportive of replacing teachers with LLMs for answering questions (2.93/5).
Conclusion
Our findings demonstrate the potential of LLMs to serve as valuable tools in optometry education, particularly in addressing students’ real-world questions.
Keywords: medical education, artificial intelligence, large language models, optometry, answering questions
Introduction
Eye health is a global issue in public health, as well as in economic and social development. Visual impairment and refractive errors affect not only individual quality of life but also workforce productivity and national development. The prevalence of myopia has been increasing worldwide, particularly among children and adolescents, creating growing demands for qualified eye care professionals. 1 However, many countries face a significant shortage of optometrists and trained educators, limiting the capacity to deliver high-quality eye care and education. This disparity highlights the urgent need to strengthen optometry training and explore innovative educational models that can support scalable, interactive, and learner-centered instruction.
Optometry, as a specialized discipline bridging vision science and clinical ophthalmology, plays an essential role in vision correction, disease prevention, and rehabilitation. 2 However, its education remains less developed globally compared with other medical specialties. In many regions, including both developing and developed countries, optometry programs are constrained by limited teaching resources, small faculty pools, and the dual clinical and teaching responsibilities of instructors. 3 These challenges often restrict opportunities for personalized guidance and in-depth student-teacher interaction, which are central to effective medical education. To address such limitations, educators have explored innovative teaching strategies, such as real-time digital imaging system for slit lamp training, 4 smartphones-based and self-designed eye models for fundus examination practice,5,6 and the integration of massive open online courses (MOOCs) into undergraduate medical education, 7 Despite these efforts, there remains a pressing need for new tools that can provide interactive, adaptive, and high-quality learning experiences.
Recent advances in artificial intelligence (AI), 8 particularly the development of large language models (LLMs) such as the generative pre-trained transformer (GPT), have opened new possibilities for medical education. 9 LLMs can generate coherent, contextually relevant, and human-like language, simulate clinical dialogue, and facilitate self-directed learning. 10 These capabilities enable their application in diverse educational settings—serving as virtual patients, 11 generating medical case studies, 12 and assisting in personalized learning. 13 The integration of LLMs has the potential to enhance student engagement, knowledge acquisition, and skill development. 13 On the other hand, their use also raises important concerns, such as algorithmic bias, factual inaccuracies, academic integrity issues, and the absence of human empathy and real-time feedback.13,14 Furthermore, although LLMs demonstrate satisfactory accuracy in structured multiple-choice questions in medical exams,15,16 their ability to address real-world, highly specialized, and open-ended questions from medical students remains uncertain.
In this study, we propose a unique evaluation framework to assess how LLMs can complement human teaching in an authentic classroom environment. Using the “Ophthalmic Optics” course as a representative example of a conceptually complex and calculation-intensive subject, we collected genuine questions from optometry students during regular instruction. We systematically compared the responses of classroom teachers, locally deployed LLMs, and online LLMs in terms of accuracy, clarity, comprehensiveness, and impact on student engagement. By focusing on real, open-ended student inquiries rather than standardized test items, this study provides a novel application domain and an underexplored educational levelhealth—professional training in optometry—to illustrate how LLM integration might enhance both the quality and interactivity of specialized medical education.
Subjects and Methods
Subjects, Ethics Approval, and Study Design
This study followed the tenets of the Declaration of Helsinki and was approved by a local Research Ethics Committee in the Joint Shantou International Eye Center (JSIEC) of Shantou University and the Chinese University of Hong Kong (EC 20240316(2)-p22). All participants had been informed of the overall objectives of the study and the procedures, and provided written informed consent. The reporting of this study conforms to the STROBE statement for observational studies 17 (Supplemental file).
The flow chart of this study is demonstrated in Figure 1. Thirty undergraduate medical students in their second year of a 5-year curriculum in Optometric Medicine & Ophthalmology in JSIEC were recruited. All students attended the theoretical course of “Ophthalmic Optics” from March 2024 to May 2024, which comprised 13 sessions, each lasting 2 academic hours (90 min). This course is taught by Dr Huang, Dr Wang, and Dr Chin, all of whom are optometry faculty members with over 3 years of teaching experience. Theoretical sessions include “Lens curvature and thickness” taught by Dr Wang, “Optical prism and prismatic effect” and “Lens power and lens measuring” taught by Dr Huang, and “Lens design, refractive correction, and clinical application” taught by Dr Chin. Following the completion of each section, students were asked to pose questions about what they were interested in or did not understand after class. Particularly, all the questions were independently asked by the students, and no pre-formed questions were provided for selection. This activity resembled an open-ended discussion rather than a structured learning exercise.
Figure 1.
Flowchart of the study.
The inclusion criteria were as follows: (1) second-year optometry students enrolled in the Ophthalmic Optics course; (2) questions had to be written in English; (3) each submission had to constitute a complete and clearly formulated question; and (4) questions were required to be relevant to the field of optometry, including but not limited to theoretical concepts, clinical applications, and optical calculations. The exclusion criteria were as follows: (1) students who were absent from the class were not eligible to participate in that session's question submission; (2) questions containing images were discouraged, as LLMs might not accurately recognize visual information; and (3) questions unrelated to optometry or ophthalmic education were excluded.
The questions were collected and sent to the respective teacher for responses. The questions, in the state they were presented without any revision or modification, were also independently inputted into 5 LLMs to generate corresponding answers. Specifically, each student was asked to propose 1 question after each theoretical class. The questions were then answered by both LLMs and human teachers, and the Q&A were provided to the students before they asked the next question in another class.
The time spent on answering the questions by both instructors and LLMs was recorded. All answers were randomized using a computer-generated random number method and were single-blind evaluated by 2 optometry experts. Upon completion of all sessions and Q&A assessments, students were asked to complete a survey questionnaire regarding their satisfaction with the answers.
Large Language Models
In the last 2 years, various companies have developed advanced language models available online. While these models are excellent at reasoning, they still need careful evaluation when providing answers in specialized fields like health. 18 Retrieval-augmented generation (RAG) technology helps by using a knowledge base to improve answers. 19 It turns user questions into vectors and finds matching information in the knowledge base to create accurate responses. It mitigates the incidence of model hallucinations and enhances the verifiability and source traceability of the answers, proving especially efficacious for question answering and text summarization endeavors. However, RAG on online platforms might risk data leaks and can't handle very large or diverse data sets. 19 Benefiting from the development of Graphics Processing Units (GPUs), it is now also possible to deploy lightweight LLMs locally on personal computers equipped with commercial-grade GPUs. Deploying LLMs with RAG technology locally can customize personal AI assistants in vertical fields with massive professional knowledge, but without the risk of data leakage.
In this study, LLMs were primarily used to answer questions raised by students on topics they found confusing or particularly interesting after completing the optometry course. In terms of the application of local LLMs, “Chat with RTX” was installed on a personal computer equipped with an Nvidia GeForce 3090 GPU (24G) along with 2 language models, Mistral-7B and Llama-2-13B. Then, 634 English professional e-books related to ophthalmology and optometry were used to build our knowledge base, then a general sentence vector model, UAE-Large-V1, was used to generate a vector library for querying. Finally, each question was input directly to “Chat with RTX” for questioning, and the content of each answer and the time required were recorded. We also tested online models, including Claude-3, Gemini-1.0 pro, and GPT-4.0 by asking questions on their official websites.
Assessment of the Answers
All answers were randomized using a computer-generated random number method and were single-blind evaluated by 2 optometry experts, who are senior faculty members in the field of optometry, with over 5 years of experience in teaching and administration. The assessments included several dimensions: “Accuracy,” which referred to the ability to correctly understand the question and provide a concise answer; “Completeness,” which reflected to the ability to provide a complete response to the question, including background, explanations, examples, and additional relevant information; “Comprehensibility,” which indicated the clarity, logical coherence, and ease of understanding of the answers; and “Overall evaluation,” which showed the overall satisfaction and approval of the answers. The score was based on a numerical scale from 1 to 5 (1 = very poor, 2 = poor, 3 = average, 4 = good, 5 = excellent) for each aspect. Reliability between the 2 optometry experts was assessed using intraclass correlation coefficients (ICCs). The average scores of the 2 graders were taken.
Record of Time Spent Answering Questions
For LLMs, the time spent on answering questions was directly obtained by the system, accurate to hundredths of a second. For teachers, the time required to answer questions was self-recorded by the teachers using a built-in timer on the computer and was accurate to seconds. The analysis of word count was performed using the “Word Count” feature in Microsoft Word. Specifically, spaces, text boxes, special symbols in formulas, and similar elements were not included in the count.
Sample Size
The required number of questions was calculated using an unpaired 1-way analysis of variance (ANOVA). The formula was n = 2(σ)2·× (f)2·× (k−1)·× (1 + 2/α)·(δ)−2. According to our preliminary data, the pooled standard deviation (σ) for scoring methods was determined as 0.8. The least significant difference (LSD; δ) was set as 0.5. The significance level (α) was set to 0.05. The effect size (f) was chosen 0.25 as a medium effect size. The number of groups (k) was 6. The required number of questions was calculated to be 66. In this study, 108 questions were finally included, which was sufficient to achieve a statistically significant difference.
Questionnaires
After finishing the evaluations, all students were asked to complete a questionnaire on their satisfaction with the Q&A as well as their feelings regarding the integration of AI into the optometry curriculum. They were provided with the responses from both the human teacher and the 5 LLMs, with clear identification of which answer was given by the teacher and which were generated by the models. This approach enabled students to make more informed and comparative assessments. The questionnaire comprises 6 items, including whether students were satisfied with these answers, which answers they preferred, whether AI provided more comprehensive and professional answers, whether teachers' answers were concise and easier to understand, whether AI could replace teachers in answering optometry questions, and whether the integration of AI boosted learning interest. The questionnaire graded acceptance on a numerical scale from 1 as the least agreement to 5 as the most agreement.
Statistical Analysis
Data are presented as mean ± SD and were analyzed using SPSS for Windows version 29.0 (SPSS Inc., Chicago, IL). The distribution of the data was validated by the D'Agostino-Pearson omnibus normality test. The LSD post-hoc test for pairwise multiple comparisons was performed to evaluate the difference in outcomes between the 6 groups. P-values of <.05 were considered statistically significant.
Results
Thirty students participated in this study, with an average age of 20.5 ± 4.1 years, and 19 (63%) of them were female. A total of 112 questions were collected. Four questions were deemed invalid due to containing images that could not be recognized by the LLMs. The remaining 108 questions were included in the study, which were from 30 students, with 19 students each asking 4 questions, 10 students each asking 3 questions, and 1 student asking 2 questions. The questions were roughly categorized as: understanding of optical principles and concepts (92 questions), lens materials and clinical applications (10 questions), and calculation problems (6 questions). According to the sample size calculation, 108 questions were deemed sufficient to obtain statistically significant results.
Answer Assessment
The ICC between the 2 experts was 0.753, indicating a substantial agreement among their ratings.
For Accuracy, GPT-4.0 and Claude-3 scored 3.87 ± 0.82 and 3.86 ± 0.93, respectively, ranking top 2 among all participants. Human teachers’ responses ranked fourth with a score of 3.13 ± 0.84, which was significantly higher than Llama-2-13B (P = .005) but lower than Gemini-1.0 pro (P < .05), Claude-3 (P < .001), and GPT-4.0 (P < .001). Of note, the accuracy performance of Mistral-7B and Llama-2-13B to be barely satisfactory, with scores of 2.97 ± 0.96 and 2.78 ± 0.97, respectively (Figure 2A).
Figure 2.
Answering performance of large language models (LLMs) and human teachers. (A-D) Optometry experts’ rating for accuracy (A), completeness (B), comprehensibility (C), and overall assessment (D), on the answers provided by the human teachers and 5 LLMs. Data were presented as mean ± standard deviation and were analyzed using the least-significant difference (LSD) post-hoc test for pairwise multiple comparisons. The differences between the various LLMs were not marked in the figure.
In terms of completeness, GPT-4.0 and Claude-3 scored 4.11 ± 0.84 and 4.07 ± 0.95, respectively, which were significantly higher than those of both human teachers and other LLMs (P < .001). Notably, humans received a relatively low completeness score of 2.55 ± 0.68, indicating a substantial gap in providing comprehensive answers (Figure 2B).
Regarding comprehensibility, Claude-3 and GPT-4.0 demonstrated higher levels of satisfaction, scoring 3.91 ± 0.88 and 3.82 ± 0.81, respectively. Surprisingly, human responses were not considered easier to understand, with an average score of 3.02 ± 0.77, lower than the scores of Claude-3, GPT-4.0, and Gemini-1.0 pro, and comparable to Mistral-7B and Llama-2-13B (Figure 2C).
In overall quality, the optometry experts believed that Claude-3 and GPT-4.0 provided better responses to optometry-related questions, with scores of 3.93 ± 0.89 and 3.85 ± 0.79, respectively. Human responses received a score of 2.81 ± 0.573, ranking fifth among the 6 participants and were significantly lower than the scores of GPT-4.0 (P < .001), Claude-3 (P < .001), and Gemini-1.0 pro (P < .001) (Figure 2D).
Time Spent Answering Questions
The average time required for the human teachers to respond to a question was 205.1 ± 112.8 s (s), being the longest among all participants (P < .001). For LLMs, Mistral-7B had the shortest response time at 10.2 ± 3.8 s/question, while GPT-4.0 had the longest response time, which was 43.0 ± 9.3 s/question (Figure 3A). Subgroup analysis based on different teachers revealed similar results to the overall findings (Supplementary information 1).
Figure 3.
Time spent answering questions (A) and word count of the answers (B) from human teachers and 5 LLMs. Data were presented as mean ± standard deviation and were analyzed using the least-significant difference (LSD) post-hoc test for pairwise multiple comparisons. The differences between the various LLMs were not marked in the figure.
Word Count of the Answers
Human teachers used an average of 63.4 ± 34.2 words to answer a question, being the fewest among all participants. For LLMs, GPT-4.0 generated the most extensive answers, with an average of 485.4 ± 67.1 words for each question, followed by Claude-3 at 435.1 ± 129.2 words. The volume of Mistral-7B's answers was the shortest among the LLMs, which was average of 130.9 ± 76.0 words/question (Figure 3B).
Performance among Different Human Teachers
The performance of the 3 human teachers was similar in terms of accuracy, completeness, comprehensibility, and overall quality of their answers. Although Dr Huang and Dr Chin scored slightly higher in some aspects, the differences were not statistically significant. Dr Chin spent less time answering questions compared to the other 2 teachers (128.1 ± 73.4 s/question). Dr Huang provided the lengthiest responses, used more averaging 77.3 ± 32.0 words per question, although his answers were still considerably shorter than those of LLMs (Supplementary information 2).
Questionnaires
All 30 students provided valid questionnaires. Students reported general satisfaction with both teachers’ and LLMs’ responses (3.67 out of 5). They showed a higher preference for teachers’ responses (3.60/5). They perceived teachers’ answers as more accurate and easier to understand (4.17/5). On the other hand, they felt that AI-generated answers were more comprehensive and detailed, enabling them to acquire additional knowledge and broaden their perspectives. They also believed that the involvement of AI in the Q&A activities increased their interest in studying optometry (3.77/5). However, students gave relatively low scores (2.93/5) regarding whether AI could replace teachers in answering optometry-related questions. It was noted that 8 out of 30 (27%) students scored 2 points or below for this item, indicating a clear preference for maintaining teacher-led interaction in problem-solving (Figure 4).
Figure 4.
A 6-item questionnaire for satisfaction and comments on the integration of large language models in the optometry course. The scoring results were shown in the right panel, presented as mean ± SD.
Discussion
In this study, we integrated several LLMs into the medical undergraduate curriculum by employing them to answer optometry-related questions posed by students. These responses were assessed by 2 experts and compared to those provided by human teachers. The results demonstrated that the overall performance of LLMs in answering questions was comparable to or surpassed that of human teachers, particularly in terms of completeness and accuracy. In addition, students generally welcomed the incorporation of LLMs in their professional courses and felt that it enhanced their interest and motivation in learning, although they believed that LLMs could not fully replace the role of human teachers.
The performance of GPT in answering medical questions has been investigated. GPT has demonstrated moderate to high accuracy in responding to questions on topics such as cirrhosis and hepatocellular carcinoma, 20 hip arthroplasty, 21 epidemiology and diagnosis, 22 ophthalmology questions from StatPearls, 23 and gold-standard Basic and Clinical Science Course (BCSC) questions. 24 A study explored GPT's performance on medical biochemistry questions requiring higher-order thinking and found that the model received a median score of 4 out of 5, indicating its potential as a tool for addressing complex medical questions. 25 The responses were considered unbiased, evidence-based, and could provide multifaceted advice to patients. 21 However, the model failed to provide comprehensive explanations in certain areas, such as updates in disease guidelines, screening criteria, and detailed treatment protocols.20,22 Cardona et al focused on the application of LLMs in optometry. They posed 10 questions across 3 categories (contact lenses, low vision, vision therapy) and invited experts and students to evaluate the responses. They found a moderate accuracy of GPT (scoring 6-8 out of 10). 26 Another survey revealed that although the majority (72.0%) of optometrists believed LLMs would improve optometric practice, over half of them expressed concerns about their diagnostic accuracy. 27
Both locally deployed LLMs based on RAG technology (Mistral-7B and Llama-2-13B) as well as online LLMs (Claude-3, GPT-4.0, and Gemini-1.0 pro) were tested in this study. Locally deployed LLMs ensure that data is not accessed by external servers, thereby reducing the risk of data leakage. In addition, they can be customized with expertise specific to certain fields, like optometry, making them suitable for specialized industries or applications. 19 Interestingly, the overall performance of locally deployed LLMs was comparable to that of human teachers but inferior to that of the online LLMs. Possible reasons include the limitations in the local models’ knowledge base, which is often restricted in size and diversity due to storage and processing capabilities. In contrast, online models typically possess vast, diverse, and continually updated knowledge bases, allowing them to handle a wide range of questions. We propose a hybrid approach may be optimal: local models handle structured, curriculum-based questions, particularly in environments where data privacy or limited network access is a concern, while online models are leveraged for open-ended, complex queries, particularly for exploratory learning, interdisciplinary questions, and addressing emerging topics. In all cases, teacher supervision is essential to ensure information accuracy.
This study showed several advantages of LLMs in answering students’ questions. Firstly, LLMs responded quickly and efficiently. Even the slowest model, GPT-4.0, took only 43 s/response, whereas human instructors took approximately 5 times longer, averaging 205 s/response. In addition, LLMs provided answers much longer in word count compared to human teachers, demonstrating their ability to provide more comprehensive information. LLMs were also able to generate comprehensive answers with appropriate supplementary data, such as providing examples to aid understanding (Supplementary information 3), and included well-organized, point-by-point formats (Supplementary information 4). In contrast, human teachers generally provided brief answers with limited elaboration (Supplementary information 5).
The limitations of LLMs, primarily concerning the accuracy in identifying and answering questions, should also be considered. For instance, when a student mentioned the “swimming effect in the lens,” he was in fact asking about the “prismatic effect.” Yet Mistral-7B and Llama-2-13B misunderstood it as a question about the impact of swimming on the eyes while wearing contact lenses (Supplementary information 6). Similarly, when responding to “Why does the image move towards the top of the prism?” Claude-3 responded with, “but I don't see any image that was uploaded” (Supplementary information 7). These misunderstandings may be attributed to vague or underspecified nature of the questions, while the human teachers, having prior course context, could easily address the students’ concerns. Regarding some computational problems, LLMs provided varying solution processes and answers, which confused the students (Supplementary information 8). Therefore, providing more context—such as background information, detailed descriptions of the question, and the full names of key terms—may be an effective way to reduce misunderstandings. Finally, although LLMs can provide relevant images and PDF materials for answers, they currently lack the ability to interpret images embedded in the questions.
Medical students hold a positive attitude towards the integration of AI into medical education.28,29 A survey study involving 3133 medical students from 63 countries revealed that 72.2% of students viewed AI as a partner rather than a competitor. Additionally, they found the development of AI exciting (69.9%) and believed it should be integrated into medical training (85.6%). 30 This study investigated the opinions of optometry students on the integration of LLMs into their traditional curriculum. The results showed that most students were moderately satisfied with both the answers provided by human teachers and AI (3.7/5). Half of the students rated “I prefer the teacher's answers” at 4 or 5, indicating a general preference for human responses. They also found teachers’ answers easier to understand (4.2/5), while LLMs provided more comprehensive responses (3.8/5). However, for the statement “AI can replace teachers in answering my optometry questions,” they rated only 2.9 out of 5, with 8 students (26.7%) scoring 2 or below. This suggests that students prefer having their questions answered by teachers. One possible reason is that when they are unable to determine the accuracy of an answer, they tend to place trust in human teachers over AI. In addition, AI responses are often perceived as rigid and lengthy, lacking interpersonal interaction and emotional engagement.
Although prior studies have compared LLMs with humans in medical and educational tasks, our work addresses a distinct and previously underexplored gap. We designed a novel evaluation framework built around authentic, course-based questions that students generated during real learning processes, rather than using researcher-designed or exam-style prompts. The open-ended, context-rich questions better capture the complexity and ambiguity of real professional education, offering a more rigorous test of LLMs’ ability to support higher-level learning. In addition, by conducting a multiple comparison among classroom instructors, locally deployed LLMs, and online LLMs, our study extends existing research that typically evaluates a single model type. This design examines not only accuracy but also other dimensions—clarity, comprehensiveness, and influence on student engagement—which are central to real-world educational decision-making. Importantly, while optometry served as the initial application context, the framework itself is broadly generalizable to other specialized health-science disciplines that face similar challenges: limited faculty resources, high cognitive demands, and a need for individualized feedback. Thus, our findings speak not only to optometry education but also to the wider question of how LLMs can be meaningfully integrated into professional training environments.
The present study has several limitations. First, the generalizability of the findings to broader medical education contexts may be limited. Second, the study did not include the most up-to-date reasoning models, which have been shown to outperform non-reasoning models on board-style questions,24,31 nor did it include GPT-5. Third, image-based questions were not incorporated, even though some of the evaluated models possess multimodal capabilities. Fourth, the dataset relied heavily on optical principles (92 out of 108 questions), which may restrict the applicability of the results to other areas of ophthalmic or medical education. Finally, a subjective and non-standardized rubric was used for grading, which may have introduced evaluator bias.
Conclusion
AI has been significantly impacting various aspects of medical education. In this study, some LLMs significantly outperform human teachers in terms of accuracy, completeness, and comprehensibility. This superiority can be attributed to the vast amount of data available to LLMs and their powerful learning capabilities, as well as the fact that teachers often do not have enough time to consult materials for better responses. Our findings suggest that the integration of AI can introduce new dimensions and vitality to ophthalmology and optometry education. While integrating AI into medical education offers great potential to enhance learning and efficiency, it requires careful attention to its accuracy, reliability, ethical concerns, limited understanding of complex medical contexts, and the risk of over-reliance. A balanced approach—combining AI tools with traditional teaching methods and human oversight—is essential to ensure that future healthcare professionals are fully equipped to deliver high-quality patient care.
Supplemental Material
Supplemental material, sj-tif-1-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-tif-2-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-3-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-4-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-5-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-6-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-7-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-8-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-docx-9-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Acknowledgements
We expressed our gratitude to the 30 undergraduate students of Optometric Medicine & Ophthalmology from the 2022 class of Shantou University Medical College who raised questions in this study.
Footnotes
ORCID iDs: Zijing Huang https://orcid.org/0000-0003-2909-2538
Man Pan Chin https://orcid.org/0009-0006-2542-9593
Haoyu Chen https://orcid.org/0000-0003-0676-4610
Ethical Considerations: This study followed the tenets of the Declaration of Helsinki and was approved by a local Research Ethics Committee in the Joint Shantou International Eye Center (JSIEC) of Shantou University and the Chinese University of Hong Kong (EC 20240316(2)-p22).
Consent to Participate: All participants had been informed of the overall objectives of the study and the procedures, and provided written informed consent.
Consent for Publication: All participants were informed about the study and provided written informed consent.
Author Contributions: H.C. made a substantial contribution to the concept or design of the work. Z.H., H.L., Y.Z., M.P.C., H.W., and P.X. performed the study. T.L., H.C., and H.L. provided the 5 AI models and ran the questions through the models. Z.H. analyzed and interpreted the data. Z.H. drafted the article, and H.C. revised it critically for important intellectual content. All authors have read, revised, and approved the final manuscript.
Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the Natural Science Foundation of Guangdong Province, China (2024A1515012992), the National Natural Science Foundation of China (82471082), and the Guangdong Postgraduate Education Innovation Project (2022JGXM069).
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability: The data that support the findings of this study are available from the corresponding author (H.C.) upon reasonable request.
Supplemental Material: Supplemental material for this article is available online.
References
- 1.Holden BA, Fricke TR, Wilson DA, et al. Global prevalence of myopia and high myopia and temporal trends from 2000 through 2050. Ophthalmology. 2016;123(5):1036–1042. doi: 10.1016/j.ophtha.2016.01.006 [DOI] [PubMed] [Google Scholar]
- 2.The role of optometry in Vision 2020. Community Eye Health. 2002;15(43):33–36. [PMC free article] [PubMed] [Google Scholar]
- 3.Gammoh Y, Morjaria P, Block SS, Massie J, Hendicott P. 2023 Global survey of optometry: defining variations of practice, regulation and human resources between countries. Clin Optom (Auckl). 2024;16:211–220. doi: 10.2147/OPTO.S481096 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Huang Z, Yang J, Wang H, Pang CP, Chen H. Comparison of digital camera real-time display with conventional teaching tube for slit lamp microscopy teaching. Curr Eye Res. 2022;47(1):161–164. doi: 10.1080/02713683.2021.1952606 [DOI] [PubMed] [Google Scholar]
- 5.Wang H, Liao X, Zhang M, Pang CP, Chen H. Smartphone ophthalmoscope as a tool in teaching direct ophthalmoscopy: a crossover randomized controlled trial. Med Educ Online. 2023;28(1):2176201. doi: 10.1080/10872981.2023.2176201 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wang H, Liao X, Zhang M, Pang CP, Chen H. A simple eye model for objectively assessing the competency of direct ophthalmoscopy. Eye (Lond). 2022;36(9):1789–1794. doi: 10.1038/s41433-021-01730-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7.Huang Z, Yang J, Wang H, Chen B, Zheng D, Chen H. Integration of massive open online course (MOOC) in ophthalmic skills training for medical students: outcomes and perspectives. Asia Pac J Ophthalmol (Phila). 2022;11(6):543–548. doi: 10.1097/APO.0000000000000548 [DOI] [PubMed] [Google Scholar]
- 8.Chan KS, Zary N. Applications and challenges of implementing artificial intelligence in medical education: integrative review. JMIR Med Educ. 2019;5(1):e13930. doi: 10.2196/13930 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–1940. doi: 10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
- 10.Karanikolas NME, Samaridi N, Tousidou E, Vassilakopoulos M. Large language models versus natural language understanding and generation. Proceedings of the 27th Pan-Hellenic Conference on Progress in Computing and Informatics. Association for Computing Machinery (ACM); 2023:278–290. [Google Scholar]
- 11.Safranek CW, Sidamon-Eristoff AE, Gilson A, Chartash D. The role of large language models in medical education: applications and implications. JMIR Med Educ. 2023;9:e50945. doi: 10.2196/50945 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Zhang Q, Huang Z, Huang Y, et al. Generative AI in medical education: feasibility and educational value of LLM-generated clinical cases with MCQs. BMC Med Educ. 2025;25(1):1502. doi: 10.1186/s12909-025-08085-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Abd-Alrazaq A, AlSaad R, Alhuwail D, et al. Large language models in medical education: opportunities, challenges, and future directions. JMIR Med Educ. 2023;9:e48291. doi: 10.2196/48291 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Navigli RCS, Ross B. Biases in large language models: origins, inventory, and discussion. ACM J Data Inf Qual. 2023;15(2):1–21. [Google Scholar]
- 15.Longwell JB, Hirsch I, Binder F, et al. Performance of large language models on medical oncology examination questions. JAMA Netw Open. 2024;7(6):e2417641. doi: 10.1001/jamanetworkopen.2024.17641 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Gilson A, Safranek CW, Huang T, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312. doi: 10.2196/45312 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Vandenbroucke JP, von Elm E, Altman DG, et al. Strengthening the Reporting of Observational Studies in Epidemiology (STROBE): explanation and elaboration. Epidemiology. 2007;18(6):805–835. doi: 10.1097/EDE.0b013e3181577511 [DOI] [PubMed] [Google Scholar]
- 18.Kedia N, Sanjeev S, Ong J, Chhablani J. ChatGPT and beyond: an overview of the growing field of large language models and their use in ophthalmology. Eye (Lond). 2024;38(7):1252–1261. doi: 10.1038/s41433-023-02915-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Wang C, Ong J, Wang C, Ong H, Cheng R, Ong D. Potential for GPT technology to optimize future clinical decision-making using retrieval-augmented generation. Ann Biomed Eng. 2024;52(5):1115–1118. doi: 10.1007/s10439-023-03327-6 [DOI] [PubMed] [Google Scholar]
- 20.Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023;29(3):721–732. doi: 10.3350/cmh.2023.0089 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am. 2023;105(19):1519–1526. doi: 10.2106/JBJS.23.00209 [DOI] [PubMed] [Google Scholar]
- 22.Sauder M, Tritsch T, Rajput V, Schwartz G, Shoja MM. Exploring generative artificial intelligence-assisted medical education: assessing case-based learning for medical students. Cureus. 2024;16(1):e51961. doi: 10.7759/cureus.51961 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Moshirfar M, Altaf AW, Stoakes IM, Tuttle JJ, Hoopes PC. Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus. 2023;15(6):e40822. doi: 10.7759/cureus.40822 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Shean R, Shah T, Sobhani S, et al. OpenAI o1 large language model outperforms GPT-4o, Gemini 1.5 flash, and human test takers on ophthalmology board-style questions. Ophthalmol Sci. 2025;5(6):100844. doi: 10.1016/j.xops.2025.100844 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Ghosh A, Bir A. Evaluating ChatGPT's ability to solve higher-order questions on the competency-based medical education curriculum in medical biochemistry. Cureus. 2023;15(4):e37023. doi: 10.7759/cureus.37023 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Cardona G, Argiles M, Perez-Mana L. Accuracy of a large language model as a new tool for optometry education. Clin Exp Optom. 2025;108(3):343–346. doi: 10.1080/08164622.2023.2288174 [DOI] [PubMed] [Google Scholar]
- 27.Scanzera AC, Shorter E, Kinnaird C, et al. Optometrist's perspectives of artificial intelligence in eye care. J Optom. 2022;15(Suppl 1):S91–S97. doi: 10.1016/j.optom.2022.06.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Tangadulrat P, Sono S, Tangtrakulwanich B. Using ChatGPT for clinical practice and medical education: cross-sectional survey of medical students’ and physicians’ perceptions. JMIR Med Educ. 2023;9:e50658. doi: 10.2196/50658 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Moldt JA, Festl-Wietek T, Madany Mamlouk A, Nieselt K, Fuhl W, Herrmann-Werner A. Chatbots for future docs: exploring medical students’ attitudes and knowledge towards artificial intelligence and medical chatbots. Med Educ Online. 2023;28(1):2182659. doi: 10.1080/10872981.2023.2182659 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Bisdas S, Topriceanu CC, Zakrzewska Z, et al. Artificial intelligence in medicine: a multinational multi-center survey on the medical and dental Students’ perception. Front Public Health. 2021;9:795284. doi: 10.3389/fpubh.2021.795284 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Taloni A, Sangregorio AC, Alessio G, et al. Large language models provide discordant information compared to ophthalmology guidelines. Sci Rep. 2025;15(1):20556. doi: 10.1038/s41598-025-06404-z [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-tif-1-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-tif-2-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-3-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-4-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-5-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-6-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-7-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-pdf-8-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development
Supplemental material, sj-docx-9-mde-10.1177_23821205251409499 for Comparative Performance Evaluation of Large Language Models and Human Teachers in Answering Optometry Questions from Medical Undergraduates by Zijing Huang, Tian Lin, Huini Lin, Yuanjin Zheng, Man Pan Chin, Hongxi Wang, Peigeng Xu and Haoyu Chen in Journal of Medical Education and Curricular Development




