Comparing the performance of four mainstream large language models on medical literature review generation: a human expert evaluation in SMILE surgery

Mengyun Zhou; Fu Gui; Chong Ai; Ling Ling; Xian Zhang; Yalin Lu; Dongmei Han; Bin Zhao; Fei Zhong; Jie Liu; Zeyu Zhu; Jiayu Li; Fei Huang; Chuyang Lin; Weifeng Liu; Jian Xiong

doi:10.1007/s00417-025-07092-1

. 2026 Jan 3;264(5):1481–1488. doi: 10.1007/s00417-025-07092-1

Comparing the performance of four mainstream large language models on medical literature review generation: a human expert evaluation in SMILE surgery

Mengyun Zhou ^1,^2,^#, Fu Gui ^1,^#, Chong Ai ^1,^#, Ling Ling ³, Xian Zhang ³, Yalin Lu ¹, Dongmei Han ⁴, Bin Zhao ⁴, Fei Zhong ⁵, Jie Liu ⁶, Zeyu Zhu ⁷, Jiayu Li ⁸, Fei Huang ¹, Chuyang Lin ¹, Weifeng Liu ^1,^✉, Jian Xiong ^1,^✉

PMCID: PMC13091840 PMID: 41484266

Abstract

Purpose

To systematically evaluate and compare the performance of four leading large language models (LLMs) in generating medical literature reviews across topics of varying research maturity, thereby providing insights for their effective and responsible application in academic writing.

Methods

In this comparative study, using standardized prompts, we instructed four leading LLMs (GPT-4, Gemini 2.5 Pro, Grok-3, and DeepSeek R1) to generate literature reviews on nine topics related to small incision lenticule extraction (SMILE) surgery. These topics were categorized into three groups by research maturity: well-researched, controversial, and open. Seven ophthalmology experts evaluated the generated content across four dimensions: quality, accuracy, bias, and relevance, while all references were verified for authenticity. Performance differences among models were evaluated using group comparison tests followed by post-hoc analysis.

Results

Significant performance variations were identified across all four models and dimensions (p < 0.001). Specifically, Gemini ranked highest in content quality, accuracy, and bias control. In contrast, DeepSeek, despite its high-quality score, received the lowest relevance score. Grok-3 demonstrated the highest reference authenticity (p < 0.001), whereas GPT-4’s was the lowest (p < 0.001). All models showed diminished performance on open topics and exhibited severe reference fabrication (“hallucinations”).

Conclusion

Rather than excelling universally, LLMs exhibit distinct and task-specific strengths that mandate a task-driven, hybrid strategy in tool selection. Reference fabrication was found to be a pervasive issue across all models, regardless of the task topic, elevating human verification from a best practice to an essential safeguard for academic integrity.

Supplementary Information

The online version contains supplementary material available at 10.1007/s00417-025-07092-1.

Keywords: Large language models, SMILE surgery, Medical content generation, Literature review, AI evaluation

Introduction

Large language models (LLMs), represented by OpenAI’s GPT series and Google’s Gemini, have undergone rapid and remarkable evolution. Initially designed as conversational AI, these models have now demonstrated the ability to generate highly accurate content across multiple domains, including medicine [1, 2]. In the field of ophthalmology, they are being used to assist in disease identification [3], preclinical management planning [4], and education [5]. LLMs are increasingly assessed for their potential to support clinical decision-making and to contribute to innovation in ophthalmology [6]. Beyond clinical applications, they also assist scientific writing by summarizing literature, generating hypotheses, and improving manuscript clarity, underscoring their academic value [7, 8].

Despite their powerful capabilities, LLMs are not without limitations. Studies have shown that these models may exhibit biases originating from training data, model fine-tuning, or human interactions [9]. LLMs are susceptible to generating ‘hallucinations’—content that is not factually grounded. This issue becomes particularly prominent when addressing open-ended or cutting-edge topics [10]. Previous studies have examined individual LLMs in isolated contexts [11, 12], but few have compared multiple mainstream models within a unified framework or assessed their performance across topics of varying maturity. This gap leads to an important question: can LLMs maintain reliability and accuracy when moving from well-established medical knowledge to more exploratory or emerging areas? Answering this question is essential for guiding the responsible and task-appropriate use of LLMs in medical research and education.

With the continued rise in the prevalence of myopia, an increasing number of individuals are opting for small incision lenticule extraction (SMILE) surgery [13], making related clinical management research an emerging focus [14, 15]. The field of SMILE surgery provides an ideal case study, as it contains a rich mix of well-established procedures, ongoing clinical controversies, and emerging areas of innovation. Therefore, this study aims to systematically evaluate and compare the performance of four mainstream LLMs (OpenAI’s GPT-4, Google’s Gemini, xAI’s Grok-3, and DeepSeek AI’s DeepSeek) in generating reviews related to SMILE surgery. The evaluation encompasses topics ranging from well-researched to controversial and even open-ended areas, with the goal of offering clinicians a practical guide for the judicious and effective use of these AI tools.

Methods

Topic selection and classification

The overall process of this study is shown in Fig. 1. This study first identified nine core topics related to the management of SMILE surgery. The topic selection aimed to capture current clinical priorities, emerging trends, and areas of academic controversy in ophthalmology. A systematic PubMed search was conducted for each topic to establish a solid literature foundation. Subsequently, a panel of three domain experts (Dr. Xiong, Dr. Han, and Dr. Ling) classified the topics through a voting process. A classification was finalized when at least two of the three experts reached consensus.

The criteria for topic classification were as follows: (1) Researched topics: Topics with extensive peer-reviewed literature, characterized by well-established theories and practices within the field of SMILE surgery. This category tests the model’s ability to synthesize and accurately summarize a large body of established evidence; (2) Controversial topics: Topics that have generated considerable debate, with studies presenting conflicting results or divergent perspectives, making them focal points of ongoing discussion. This category evaluates the model’s capacity to present a balanced view, handle conflicting data, and avoid taking a biased stance; (3) Open topics: Topics for which the existing literature is scarce or lacking, representing potential directions and opportunities for future research and technological innovation in SMILE surgery management. This category assesses the model’s ability to reason from limited information, identify knowledge gaps, and propose logical future directions without resorting to fabrication.

Prompts for LLMs and expert evaluation

Four mainstream LLMs were selected for testing in this study: OpenAI’s GPT-4, Google’s Gemini 2.5 Pro, xAI’s Grok-3, and DeepSeek AI’s DeepSeek R1. To better reflect their accessibility in real-world clinical and research settings with limited resources, the freely available or basic-entry versions of these models were deliberately chosen. Specifically, GPT-4 through a basic subscription, Gemini 2.5 Pro via Google’s free interface, Grok-3 through its public platform, and DeepSeek R1 via its open-access release. All prompts and generated content were exclusively in English, and no translation services were utilized. This approach enhances the practical relevance of the study findings by simulating conditions under which these tools are likely to be used. For each model, a standardized prompt was applied: “Write a 1,500-word literature review on the topic, summarizing the current state of research and future research directions.” All generated reviews were collected. To ensure rigor in the evaluation, the accuracy and authenticity of the references cited in each review were also verified.

To quantitatively assess model performance, seven ophthalmologists with extensive experience in SMILE surgery (F. Gui, F. Zhong, X. Zhang, Z. Zhu, J. Li, J. Liu, B. Zhao) were convened. The generated reviews were anonymized, removing any indicators of the source LLM. All experts had more than three years of clinical experience and formal training in SMILE surgical techniques. Using predefined scoring criteria (Appendix 1) [16], they independently evaluated each review across four dimensions—quality, accuracy, bias, and relevance—on a scale from 0 to 100. Quality was assessed based on clarity and logical coherence; accuracy measured factual correctness; bias examined the presence of any discriminatory or skewed content; and relevance evaluated the degree to which the content aligned with the assigned topic.

Data analysis

All statistical analyses were conducted using SPSS software (version 25.0, IBM Corporation, USA). To evaluate the quality of references generated by each model, all cited references underwent a rigorous authenticity check. Each reference in the reviews was manually retrieved and verified through academic databases such as PubMed and Google Scholar. Based on the verification results, each reference was assigned a score as follows: (1) True and relevant = 1 point; (2) True but not relevant = 0 points; (3) Fictitious (hallucination) = − 1 point. The overall reference quality score for each model across different topics was then calculated and statistically compared. To compare the distribution of the overall reference quality scores among the four language models, a Kruskal-Wallis H test was performed. If a significant difference was found, Dunn’s post-hoc test with a Bonferroni correction was used for pairwise comparisons. We used the Pearson Chi-square test to evaluate the association between model type and the distribution of reference categories (‘True and relevant,’ ‘True but not relevant,’ and ‘Fictitious’). Pairwise z-tests for column proportions with a Bonferroni correction were conducted as a post-hoc analysis to identify specific differences between the models for each category.

Inter-rater reliability for the seven ophthalmologists was assessed using the intraclass correlation coefficient (ICC) based on a two-way random effects model with an absolute agreement definition. The average measures ICC was calculated for each of the four evaluation dimensions (quality, accuracy, bias, and relevance).

The expert evaluation scores for each dimension (quality, accuracy, bias, and relevance) were then analyzed. The data were first summarized using descriptive statistics, including the mean, median, and interquartile range (IQR). The normality of the score distributions was assessed using the Shapiro–Wilk test. As scores were not normally distributed, non-parametric tests were applied for group comparisons from two perspectives. First, to evaluate performance differences among the four LLMs, the Kruskal–Wallis H test was conducted for each evaluation dimension. Second, to examine the influence of topic type (researched, controversial, or open) on the generated content, a separate Kruskal–Wallis H test was applied for each dimension, using topic type as the grouping variable. For any test showing a significant overall difference (p < 0.05), Dunn’s post-hoc test with Bonferroni correction was performed to identify which specific pairs of groups differed significantly.

Ethics statement

This study involved only AI-generated content and did not include human participants; therefore, traditional ethical concerns such as informed consent and confidentiality did not apply.

Results

Characteristics of AI-Generated reviews

The nine topics selected for this study were evenly divided into three categories: researched, controversial, and open (Table 1). Four LLMs were employed to generate comprehensive reviews for each topic, resulting in a total of 36 manuscripts.

Table 1.

Attributes of topics selected for analysis

No.	Topics	Categorization
1	Long-term Stability and Safety after SMILE Surgery	Researched
2	Biomechanics of SMILE Surgery	Researched
3	Dry Eye after SMILE Surgery	Researched
4	Ectasia after SMILE Surgery	Controversial
5	SMILE for Astigmatism Correction	Controversial
6	SMILE for Hyperopia Correction	Controversial
7	How to Enhance the Predictability of SMILE	Open
8	How to Control Intraoperative Complications of SMILE	Open
9	How to Improve the Quality of Vision of SMILE Surgery	Open

Open in a new tab

Abbreviation: SMILE small incision lenticule extraction

Analysis of reference authenticity revealed significant performance differences among the models, as shown in Table 2. Grok demonstrated the best performance, achieving the highest overall quality score (4.11 ± 1.90), primarily due to generating the largest proportion of “True and relevant” references (64.0%) and the lowest proportion of “Fictitious” ones (25.5%). In contrast, GPT and Gemini showed the poorest performance in terms of authenticity. GPT recorded the lowest overall score (–1.00 ± 2.12) and, together with Gemini, produced the highest proportions of fictitious references (49.5% and 45.8%, respectively). These differences were statistically significant, particularly highlighting Grok’s markedly lower hallucination rate compared with GPT and Gemini.

Table 2.

Authenticity analysis and pairwise comparison of references generated by four large language models

	DeepSeek	Gemini	GPT	Grok	p
Total References	200	96	95	86	N/A
Overall Reference Quality Score¹					0.006
(Mean ± SD)	1.67 ± 4.66^ab	0.44 ± 2.92ᵃᵇ	-1.00 ± 2.12ᵇ	4.11 ± 1.90ᵃ
Distribution of Reference Categories²					0.011
True and relevant	48.0%ᵃᵇ	50.0%ᵃᵇ	40.0%ᵇ	64.0%ᵃ
True but not relevant	11.5%ᵃ	4.2%ᵃ	10.5%ᵃ	10.5%ᵃ
Fictitious	40.5%ᵃᵇ	45.8%ᵇ	49.5%ᵇ	25.6%ᵃ
N/A Not Applicable. SD Standard Deviation

Open in a new tab

¹Results from the Kruskal-Wallis H test. Within this row, means that do not share a common superscript letter are significantly different (p < 0.05) according to the Dunn’s post-hoc test with Bonferroni correction

²Results from the Pearson Chi-square test (χ²(6) = 16.53, p = 0.011). Within each category row, percentages that do not share a common superscript letter are significantly different (p < 0.05) according to pairwise z-tests for column proportions with Bonferroni correction

Expert evaluation of AI-Generated reviews

To ensure the consistency of the expert evaluations, inter-rater reliability was calculated. The average measures Intraclass Correlation Coefficient (ICC) indicated good to excellent agreement across all four dimensions, with ICC values ranging from 0.877 to 0.980. The full analysis for each dimension is presented in Appendix 2 (Table S1).

We further analyzed the expert evaluation scores for the 36 reviews. As the score distributions were found to be predominantly non-normal (Appendix 2, Table S2), a Kruskal-Wallis H test was conducted and revealed statistically significant differences among the models across all four dimensions (p < 0.001; Table 3). Subsequent post-hoc analysis with Dunn’s test (Appendix 2, Table S3) identified distinct performance hierarchies for each metric. For Quality, Gemini and DeepSeek significantly outperformed GPT, which was in turn superior to Grok. A more granular stratification was observed for Accuracy, where Gemini was significantly superior to GPT, followed by DeepSeek and then Grok, with all pairwise comparisons yielding significant differences. For Bias (where lower scores indicate better performance), Gemini showed significantly less bias than DeepSeek, which performed significantly better than GPT and Grok, between which no significant difference was found. Finally, regarding Relevance, Gemini, GPT, and Grok formed a statistically indistinguishable top-performing group, all of which scored significantly higher than DeepSeek.

Table 3.

Expert evaluation scores for LLM-generated reviews across four dimensions

Model	Quality	Accuracy	Bias	Relevance
Deepseek	88.00 (9.00)ᵃ	80.00 (11.00)ᶜ	28.00 (12.00)ᵇ	80.00 (3.00)ᵇ
Gemini	88.00 (8.00)ᵃ	90.00 (5.00)ᵃ	22.00 (11.00)ᵃ	87.00 (3.00)ᵃ
GPT	82.00 (10.00)ᵇ	85.00 (6.00)ᵇ	35.00 (14.00)ᶜ	87.00 (5.00)ᵃ
Grok	74.00 (12.00)ᶜ	75.00 (10.00)ᵈ	35.00 (16.00)ᶜ	86.00 (8.00)ᵃ
p-value	< 0.001	< 0.001	< 0.001	< 0.001

Open in a new tab

Values are presented as Median (Interquartile Range)

Overall p-value from the Kruskal-Wallis H test

Within each column, medians that do not share a common superscript letter (a, b, c, d) are significantly different (p < 0.05) based on Dunn’s post-hoc test

In addition to differences among models, the nature of the topic itself significantly influenced the evaluation scores (Table 4). For Quality and Relevance, reviews generated on Researched topics received significantly higher scores than those on both Controversial and Open topics. A similar pattern was observed for Accuracy: Researched topics scored significantly higher than Open topics, but not significantly higher than Controversial topics. A clear gradient was observed for Bias, where lower scores indicate better performance. Researched topics had the lowest bias scores, which were significantly better than those of Controversial topics. In turn, Controversial topics scored significantly better than Open topics.

Table 4.

Comparison of expert evaluation scores across different topic types

Topic type	Metric	Quality	Accuracy	Bias	Relevance
Researched	Mean	87.30	84.60	23.24	87.01
	Median (IQR)	89.00 (8.00)ᵃ	85.00 (11.00)ᵃ	22.00 (8.00)^a	88.00 (8.00)ᵃ
Controversial	Mean	83.98	82.31	31.06	84.55
	Median (IQR)	86.00 (10.00)ᵇ	82.00 (12.00)ᵃᵇ	28.50 (20)ᵇ	85.00 (5.00)ᵇ
Open	Mean	79.08	80.30	34.62	84.33
	Median (IQR)	80.00 (10.00)ᶜ	82.00 (16.00)ᵇ	34.00 (8)^c	86.00 (8.00)ᵇ

Open in a new tab

The primary comparison metric is the Median (IQR), with Mean provided for supplementary information. Within each column, medians that do not share a common superscript letter (a, b, c) are significantly different (p < 0.05) based on Dunn’s post-hoc test. The overall Kruskal-Wallis test was significant for all four dimensions (p < 0.001)

Examples of fabricated references generated by LLMs

Manual verification identified three primary categories of reference fabrication across all LLMs. A detailed description and a representative example for each fabrication type are presented in Appendix 2 (Table S4).

Discussion

This study systematically evaluated four leading LLMs for medical literature review generation, revealing that no single model is universally superior. Instead, each LLM exhibits a distinct performance profile, underscoring the need for a task-oriented selection strategy rather than a one-size-fits-all approach.

Our comparison of the four language models revealed distinct performance profiles, likely stemming from fundamental differences in their architecture, training data, and deployment strategies. Gemini demonstrated the strongest overall performance, achieving the highest scores in quality, accuracy, and bias control, thus can be positioned as a “drafting scholar” to produce coherent, objective, and unbiased text. Its strengths in quality, accuracy, and bias control likely come from multimodal pretraining and alignment techniques that integrate textual and factual signals into a unified reasoning process [17, 18]. GPT maintained strong relevance and completeness, though with limited quality, aligning with the profile of a “reliable synthesizer.” This pattern aligns with its broad pretraining and reinforcement learning from human feedback, which emphasize stability and consistency rather than exploratory reasoning [19]. DeepSeek performed well in logical organization and served effectively as an “outlining assistant,” pairing a high-quality score and the lowest relevance score. This likely reflects the absence of large proprietary datasets and advanced alignment pipelines, which restrict its ability to capture subtle topic differences despite strong structural clarity [20–22]. Grok exhibited a clear paradox: it achieved the highest rate of legitimate citations but lower quality and accuracy. This may result from its core design, which heavily emphasizes Retrieval-Augmented Generation (RAG) to access real-time information, thereby prioritizing verifiable sources. While this makes Grok a strong “information retriever,” its reliance on retrieval may limit its ability to abstract, integrate, and synthesize information into a coherent academic narrative [23].

Our findings demonstrate that the type of academic topic is a key factor influencing LLM performance. On aggregate, well-established topics consistently achieved higher scores in quality, relevance, and accuracy, accompanied by lower bias. This pattern is likely attributable to the extensive body of peer-reviewed literature available for these domains, which closely aligns with the corpora used during LLM pretraining and facilitates more contextually grounded generation [24, 25]. In contrast, model performance declined for controversial and emerging topics. For subjects characterized by unresolved debates or limited empirical evidence, LLMs face greater difficulty reconciling divergent viewpoints and identifying reliable knowledge boundaries, resulting in reduced accuracy and increased bias. Prior evaluations have shown that when models operate in domains lacking stable factual anchors, they display higher rates of hallucination, overgeneralization, and confabulated synthesis [26, 27]. These findings indicate that the reliability of LLM-generated content varies substantially across topic types. Accordingly, researchers should apply stricter verification procedures when using LLMs for novel or contentious areas, where the probability of generating unsupported or biased information is demonstrably higher [28].

The performance of open-source models compared with proprietary models is a major topic in academic medicine. Our findings reveal a complex performance profile for the open-source model: it matched the leading proprietary model, Gemini, in content quality, while simultaneously scoring the lowest on relevance. This aligns with findings from tasks that prioritize factual recall, where some studies have shown that proprietary models like OpenAI’s o1 Pro clearly outperform DeepSeek-R1 on standardized ophthalmology questions [29]. Other studies have found that DeepSeek-R1 performs better than Gemini and OpenAI models on tasks that require bilingual reasoning and more complex logic [22]. These findings indicate that model selection should depend on task-specific requirements. Open-source models can reduce cost and offer greater flexibility and customization. However, as reported by Liu et al. [30], they may have limitations in handling multimodal data, which is essential for visually oriented fields such as ophthalmology. They also raise concerns regarding transparency and safety. Therefore, although DeepSeek shows potential for certain specialized tasks, proprietary models still provide stronger overall robustness and better readiness for multimodal integration.

The distinct strengths of each LLM allow researchers to assign tasks based on model capabilities, forming the basis of an AI toolkit framework (Fig. 2). Within this framework, GPT is suited for producing broad initial drafts, Gemini enhances clarity and coherence, DeepSeek supports early-stage conceptual planning, and Grok provides rigorous verification of reference accuracy and factual reliability. It is important to note, however, that this framework should not be interpreted as a rigid sequential workflow. For most academic tasks, relying on four separate LLMs would be impractical and inefficient. Instead, the framework promotes a flexible, task-optimized strategy in which researchers select one or more models according to project-specific needs. For example, Gemini may be used to generate a high-quality initial draft, followed by Grok for dedicated citation verification to reduce hallucination-related risks. Alternatively, when cost considerations are relevant in early planning, DeepSeek can be used to construct the conceptual structure. This adaptable approach helps researchers balance efficiency, accuracy, and resource constraints while maintaining academic rigor.

Fig. 2 — A conceptual framework for task-oriented LLM applications in medical research. This framework outlines role-specific strengths of LLMs: GPT as a reliable synthesizer for structured drafting, Gemini as a refining scholar enhancing coherence, DeepSeek as an outlining assistant for idea expansion, and Grok as an information retriever for reference verification. Combined within a human-in-the-loop workflow, they enhance efficiency and maintain academic rigor. Abbreviation: LLM = large language model

A key finding of this study is the widespread presence of hallucinations across all evaluated LLMs, indicating a persistent technical limitation that requires careful attention. More than half of the references generated by GPT were fabricated, and Grok, although the best performer, still did not reach a reliable standard. These results show that reference lists produced by current LLMs cannot be trusted without manual verification. This issue stems from LLMs generating text through prediction rather than from factual retrieval. To fill knowledge gaps or satisfy an expected output pattern, LLMs may confidently produce plausible but entirely fictional references, which pose risks of misinformation and threaten academic integrity. Our proposed “AI toolkit” offers a potential way to mitigate this challenge. By assigning models to complementary roles, we can leverage their respective strengths. For example, Grok’s advantage in citation authenticity could be used as a dedicated citation-checking tool, but even this hybrid approach cannot replace the final, indispensable step of manual verification by a human expert.

This study has several limitations. First, our focus on freely accessible models improves the real-world relevance of this work but limits its generalizability. More advanced API versions and the latest models, including Gemini 3.0, GPT-5 and Grok 4, were not evaluated. Second, the small number of experts and the limited range of topics may restrict the applicability of these findings to other medical fields. Third, while inter-rater reliability was high, the 0-100 scoring scale remains inherently subjective, representing a limitation typical of human expert evaluations. Fourth, LLMs undergo frequent updates, and the performance profiles reported here may evolve over time, suggesting the need for periodic reassessment as newer versions become available. Finally, this study employed a standardized prompt and did not explore advanced prompt techniques (e.g., chain-of-thought, providing few-shot examples), which may potentially improve output quality. Future research should address these limitations through longitudinal studies and broader topic selections.

Conclusion

This study identifies distinct LLM strengths—Gemini for quality, GPT for synthesis, DeepSeek for structure, and Grok for authenticity—supporting a task-oriented “AI toolkit” approach. Nevertheless, the universal generation of hallucinated references underscores that LLMs are powerful augmentation tools, not substitutes for the expert human verification required to ensure the integrity of medical research.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary Material 1^{(177.9KB, pdf)}

Supplementary Material 2^{(102.2KB, pdf)}

Acknowledgements

We thank the developers of the large language models used in this research: ChatGPT (OpenAI), Gemini (Google), DeepSeek (DeepSeek AI), and Grok (xAI).

Author contributions

Conceptualization, J.X., and M.Z.; data analysis and collation, M.Z., F.G., C. A., L.L., X.Z., D.H., B.Z., F.Z., JY.L., Z.Z., J.L., F.H., and C.L.; writing—original draft preparation, J.X., M.Z., F.G. and C.A.; writing—review and editing, J.X and W.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by grants from the National Natural Science Foundation of China (82260214), and the Science and Technology Program of Jiangxi Provincial Health Commission (202210631).

Data availability

Data is provided within the supplementary information files. Correspondence and requests for additional data should be addressed by J. X. and M.Z.

Declarations

Ethics approval and consent to participate

Not applicable. This study did not involve human participants.

Consent for publication

Not applicable. The data used in this study are publicly available and do not involve identifiable personal information.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Mengyun Zhou, Fu Gui and Chong Ai co-first authors.

Contributor Information

Weifeng Liu, Email: 18970040725@163.com.

Jian Xiong, Email: 894040417@qq.com.

References

1.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DS W (2023) Large Language models in medicine. Nat Med 29(8):1930–1940. 10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
2.Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL et al (2023) Gpt-4 technical report. arXiv:2303.08774
3.Ha DH, Kim US (2025) Large Language models in neuro-ophthalmology diseases: ChatGPT vs bard vs Bing. Int J Ophthalmol 18(7):1231–1236. 10.18240/ijo.2025.07.05 [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Wu H, Su Z, Pan X, Shao A, Xu Y, Wang Y et al (2025) Enhancing diabetic retinopathy query responses: assessing large Language model in ophthalmology. Br J Ophthalmol 109(11):1272–1278. 10.1136/bjo-2024-325861 [DOI] [PMC free article] [PubMed]
5.Shean R, Shah T, Sobhani S, Tang A, Setayesh A, Bolo K et al (2025) OpenAI o1 large Language model outperforms GPT-4o, gemini 1.5 Flash, and human test takers on ophthalmology Board-Style questions. Ophthalmol Sci 5(6):100844. 10.1016/j.xops.2025.100844 [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Agnihotri AP, Nagel ID, Artiaga JCM, Guevarra MCB, Sosuan GMN, Kalaw FGP (2025) Large Language models in ophthalmology: A review of publications from top ophthalmology journals. Ophthalmol Sci 5(3):100681. 10.1016/j.xops.2024.100681 [DOI] [PMC free article] [PubMed] [Google Scholar]
7.Forero DA, Abreu SE, Tovar BE, Oermann MH (2025) Large Language models and the analyses of adherence to reporting guidelines in systematic reviews and overviews of reviews (PRISMA 2020 and PRIOR). J Med Syst 49(1):80. 10.1007/s10916-025-02212-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
8.Wu S, Ma X, Luo D, Li L, Shi X, Chang X et al (2025) Automated literature research and review-generation method based on large Language models. Natl Sci Rev 12(6):nwaf169. 10.1093/nsr/nwaf169 [DOI] [PMC free article] [PubMed] [Google Scholar]
9.Gallegos IO, Rossi RA, Barrow J, Tanjim MM, Kim S, Dernoncourt F et al (2024) Bias and fairness in large Language models: A survey. Comput Linguistics 50(3):1097–1179. 10.1162/coli_a_00524 [Google Scholar]
10.Schmidgall S, Harris C, Essien I, Olshvang D, Rahman T, Kim JW et al (2024) Addressing cognitive bias in medical language models. arXiv:2402.08113 [DOI] [PMC free article] [PubMed]
11.Wei J, Wang X, Huang M, Xu Y, Yang W (2025) Evaluating the performance of ChatGPT on Board-Style examination questions in ophthalmology: A Meta-Analysis. J Med Syst 49(1):94. 10.1007/s10916-025-02227-7 [DOI] [PubMed] [Google Scholar]
12.Taloni A, Coco G, Pellegrini M, Wjst M, Salgari N, Carnovale-Scalzo G et al (2025) Exploring detection methods for synthetic medical datasets created with a large Language model. JAMA Ophthalmol 143(6):517–522. 10.1001/jamaophthalmol.2025.0834 [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Kim TI, Alió D, Barrio JL, Wilkins M, Cochener B, Ang M (2019) Refractive surgery. Lancet 393(10185):2085–2098. 10.1016/s0140-6736(18)33209-4 [DOI] [PubMed] [Google Scholar]
14.Ivarsen A, Asp S, Hjortdal J (2014) Safety and complications of more than 1500 small-incision lenticule extraction procedures. Ophthalmology 121(4):822–828. 10.1016/j.ophtha.2013.11.006 [DOI] [PubMed] [Google Scholar]
15.Hatch KM, Ling JJ, Wiley WF, Cason J, Ciralsky JB, Nehls SM et al (2022) Diagnosis and management of postrefractive surgery ectasia. J Cataract Refract Surg 48(4):487–499. 10.1097/j.jcrs.0000000000000808 [DOI] [PubMed] [Google Scholar]
16.Koester SW, Catapano JS, Hartke JN, Rudy RF, Cole TS, Naik A et al (2024) Evaluation of ChatGPT in knowledge of newly evolving neurosurgery: middle meningeal artery embolization for subdural hematoma management. J Neurointerv Surg 16(10):1033–1035. 10.1136/jnis-2024-021480 [DOI] [PubMed] [Google Scholar]
17.Saab K, Tu T, Weng W-H, Tanno R, Stutz D, Wulczyn E et al (2024) Capabilities of gemini models in medicine. arXiv:2404.18416
18.Team G, Anil R, Borgeaud S, Alayrac J-B, Yu J, Soricut R et al (2023) Gemini: a family of highly capable multimodal models. arXiv:2312.11805
19.Dong H, Xiong W, Pang B, Wang H, Zhao H, Zhou Y et al (2024) Rlhf workflow: from reward modeling to online rlhf. arXiv:2405.07863
20.Dai D, Deng C, Zhao C, Xu R, Gao H, Chen D et al (2024) Deepseekmoe: towards ultimate expert specialization in mixture-of-experts language models. arXiv:2401.06066
21.Sandmann S, Hegselmann S, Fujarski M, Bickmann L, Wild B, Eils R et al (2025) Benchmark evaluation of deepseek large Language models in clinical decision-making. Nat Med 31(8):2546–2549. 10.1038/s41591-025-03727-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Xu P, Wu Y, Jin K, Chen X, He M, Shi D (2025) DeepSeek-R1 outperforms gemini 2.0 Pro, openai o1, and o3-mini in bilingual complex ophthalmology reasoning. Adv Ophthalmol Pract Res 5(3):189–195. 10.1016/j.aopr.2025.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y et al (2023) Retrieval-augmented generation for large language models: a survey. arXiv:2312.10997
24.Zhang Y, Khan SA, Mahmud A, Yang H, Lavin A, Levin M et al (2025) Exploring the role of large language models in the scientific method: from hypothesis to discovery. 1(1):14. 10.1038/s44387-025-00019-5
25.Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M et al (2025) A comprehensive overview of large language models. 16(5):1–72. 10.1145/3744746
26.Susnjak T, Hwang P, Reyes N, Barczak AL, McIntosh T, Ranathunga SJAToKDfD (2025) Automating research synthesis with domain-specific large language model fine-tuning. 19(3):1–39. 10.1145/3715964
27.Mugaanyi J, Cai L, Cheng S, Lu C, Huang J (2024) Evaluation of large Language model performance and reliability for citations and references in scholarly writing: Cross-Disciplinary study. J Med Internet Res 26:e52935. 10.2196/52935 [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Lissack M, Meagher BJSJTJ, o. D, Economics, Innovation (2024) Responsible use of large Language models: an analogy with the Oxford tutorial system. 10(4):389–413. 10.1016/j.sheji.2024.11.001
29.Shean R, Shah T, Pandiarajan A, Tang A, Bolo K, Nguyen V et al (2025) A comparative analysis of deepseek R1, deepseek-R1-Lite, openai o1 Pro, and Grok 3 performance on ophthalmology board-style questions. Sci Rep 15(1):23101. 10.1038/s41598-025-08601-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Liu Z, Zhao C, Lin H (2025) DeepSeek-R1 vs Open-Weight AI in ophthalmology. JAMA Ophthalmol 143(10):842–843. 10.1001/jamaophthalmol.2025.3153 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Material 1^{(177.9KB, pdf)}

Supplementary Material 2^{(102.2KB, pdf)}

Data Availability Statement

Data is provided within the supplementary information files. Correspondence and requests for additional data should be addressed by J. X. and M.Z.

[CR1] 1.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DS W (2023) Large Language models in medicine. Nat Med 29(8):1930–1940. 10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]

[CR2] 2.Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL et al (2023) Gpt-4 technical report. arXiv:2303.08774

[CR3] 3.Ha DH, Kim US (2025) Large Language models in neuro-ophthalmology diseases: ChatGPT vs bard vs Bing. Int J Ophthalmol 18(7):1231–1236. 10.18240/ijo.2025.07.05 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR4] 4.Wu H, Su Z, Pan X, Shao A, Xu Y, Wang Y et al (2025) Enhancing diabetic retinopathy query responses: assessing large Language model in ophthalmology. Br J Ophthalmol 109(11):1272–1278. 10.1136/bjo-2024-325861 [DOI] [PMC free article] [PubMed]

[CR5] 5.Shean R, Shah T, Sobhani S, Tang A, Setayesh A, Bolo K et al (2025) OpenAI o1 large Language model outperforms GPT-4o, gemini 1.5 Flash, and human test takers on ophthalmology Board-Style questions. Ophthalmol Sci 5(6):100844. 10.1016/j.xops.2025.100844 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR6] 6.Agnihotri AP, Nagel ID, Artiaga JCM, Guevarra MCB, Sosuan GMN, Kalaw FGP (2025) Large Language models in ophthalmology: A review of publications from top ophthalmology journals. Ophthalmol Sci 5(3):100681. 10.1016/j.xops.2024.100681 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR7] 7.Forero DA, Abreu SE, Tovar BE, Oermann MH (2025) Large Language models and the analyses of adherence to reporting guidelines in systematic reviews and overviews of reviews (PRISMA 2020 and PRIOR). J Med Syst 49(1):80. 10.1007/s10916-025-02212-0 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR8] 8.Wu S, Ma X, Luo D, Li L, Shi X, Chang X et al (2025) Automated literature research and review-generation method based on large Language models. Natl Sci Rev 12(6):nwaf169. 10.1093/nsr/nwaf169 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR9] 9.Gallegos IO, Rossi RA, Barrow J, Tanjim MM, Kim S, Dernoncourt F et al (2024) Bias and fairness in large Language models: A survey. Comput Linguistics 50(3):1097–1179. 10.1162/coli_a_00524 [Google Scholar]

[CR10] 10.Schmidgall S, Harris C, Essien I, Olshvang D, Rahman T, Kim JW et al (2024) Addressing cognitive bias in medical language models. arXiv:2402.08113 [DOI] [PMC free article] [PubMed]

[CR11] 11.Wei J, Wang X, Huang M, Xu Y, Yang W (2025) Evaluating the performance of ChatGPT on Board-Style examination questions in ophthalmology: A Meta-Analysis. J Med Syst 49(1):94. 10.1007/s10916-025-02227-7 [DOI] [PubMed] [Google Scholar]

[CR12] 12.Taloni A, Coco G, Pellegrini M, Wjst M, Salgari N, Carnovale-Scalzo G et al (2025) Exploring detection methods for synthetic medical datasets created with a large Language model. JAMA Ophthalmol 143(6):517–522. 10.1001/jamaophthalmol.2025.0834 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Kim TI, Alió D, Barrio JL, Wilkins M, Cochener B, Ang M (2019) Refractive surgery. Lancet 393(10185):2085–2098. 10.1016/s0140-6736(18)33209-4 [DOI] [PubMed] [Google Scholar]

[CR14] 14.Ivarsen A, Asp S, Hjortdal J (2014) Safety and complications of more than 1500 small-incision lenticule extraction procedures. Ophthalmology 121(4):822–828. 10.1016/j.ophtha.2013.11.006 [DOI] [PubMed] [Google Scholar]

[CR15] 15.Hatch KM, Ling JJ, Wiley WF, Cason J, Ciralsky JB, Nehls SM et al (2022) Diagnosis and management of postrefractive surgery ectasia. J Cataract Refract Surg 48(4):487–499. 10.1097/j.jcrs.0000000000000808 [DOI] [PubMed] [Google Scholar]

[CR16] 16.Koester SW, Catapano JS, Hartke JN, Rudy RF, Cole TS, Naik A et al (2024) Evaluation of ChatGPT in knowledge of newly evolving neurosurgery: middle meningeal artery embolization for subdural hematoma management. J Neurointerv Surg 16(10):1033–1035. 10.1136/jnis-2024-021480 [DOI] [PubMed] [Google Scholar]

[CR17] 17.Saab K, Tu T, Weng W-H, Tanno R, Stutz D, Wulczyn E et al (2024) Capabilities of gemini models in medicine. arXiv:2404.18416

[CR18] 18.Team G, Anil R, Borgeaud S, Alayrac J-B, Yu J, Soricut R et al (2023) Gemini: a family of highly capable multimodal models. arXiv:2312.11805

[CR19] 19.Dong H, Xiong W, Pang B, Wang H, Zhao H, Zhou Y et al (2024) Rlhf workflow: from reward modeling to online rlhf. arXiv:2405.07863

[CR20] 20.Dai D, Deng C, Zhao C, Xu R, Gao H, Chen D et al (2024) Deepseekmoe: towards ultimate expert specialization in mixture-of-experts language models. arXiv:2401.06066

[CR21] 21.Sandmann S, Hegselmann S, Fujarski M, Bickmann L, Wild B, Eils R et al (2025) Benchmark evaluation of deepseek large Language models in clinical decision-making. Nat Med 31(8):2546–2549. 10.1038/s41591-025-03727-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR22] 22.Xu P, Wu Y, Jin K, Chen X, He M, Shi D (2025) DeepSeek-R1 outperforms gemini 2.0 Pro, openai o1, and o3-mini in bilingual complex ophthalmology reasoning. Adv Ophthalmol Pract Res 5(3):189–195. 10.1016/j.aopr.2025.05.001 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR23] 23.Gao Y, Xiong Y, Gao X, Jia K, Pan J, Bi Y et al (2023) Retrieval-augmented generation for large language models: a survey. arXiv:2312.10997

[CR24] 24.Zhang Y, Khan SA, Mahmud A, Yang H, Lavin A, Levin M et al (2025) Exploring the role of large language models in the scientific method: from hypothesis to discovery. 1(1):14. 10.1038/s44387-025-00019-5

[CR25] 25.Naveed H, Khan AU, Qiu S, Saqib M, Anwar S, Usman M et al (2025) A comprehensive overview of large language models. 16(5):1–72. 10.1145/3744746

[CR26] 26.Susnjak T, Hwang P, Reyes N, Barczak AL, McIntosh T, Ranathunga SJAToKDfD (2025) Automating research synthesis with domain-specific large language model fine-tuning. 19(3):1–39. 10.1145/3715964

[CR27] 27.Mugaanyi J, Cai L, Cheng S, Lu C, Huang J (2024) Evaluation of large Language model performance and reliability for citations and references in scholarly writing: Cross-Disciplinary study. J Med Internet Res 26:e52935. 10.2196/52935 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Lissack M, Meagher BJSJTJ, o. D, Economics, Innovation (2024) Responsible use of large Language models: an analogy with the Oxford tutorial system. 10(4):389–413. 10.1016/j.sheji.2024.11.001

[CR29] 29.Shean R, Shah T, Pandiarajan A, Tang A, Bolo K, Nguyen V et al (2025) A comparative analysis of deepseek R1, deepseek-R1-Lite, openai o1 Pro, and Grok 3 performance on ophthalmology board-style questions. Sci Rep 15(1):23101. 10.1038/s41598-025-08601-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Liu Z, Zhao C, Lin H (2025) DeepSeek-R1 vs Open-Weight AI in ophthalmology. JAMA Ophthalmol 143(10):842–843. 10.1001/jamaophthalmol.2025.3153 [DOI] [PubMed] [Google Scholar]

PERMALINK

Comparing the performance of four mainstream large language models on medical literature review generation: a human expert evaluation in SMILE surgery

Mengyun Zhou

Fu Gui

Chong Ai

Ling Ling

Xian Zhang

Yalin Lu

Dongmei Han

Bin Zhao

Fei Zhong

Jie Liu

Zeyu Zhu

Jiayu Li

Fei Huang

Chuyang Lin

Weifeng Liu

Jian Xiong

Abstract

Purpose

Methods

Results

Conclusion

Supplementary Information

Introduction

Methods

Topic selection and classification

Fig. 1.

Prompts for LLMs and expert evaluation

Data analysis

Ethics statement

Results

Characteristics of AI-Generated reviews

Table 1.

Table 2.

Expert evaluation of AI-Generated reviews

Table 3.

Table 4.

Examples of fabricated references generated by LLMs

Discussion

Fig. 2.

Conclusion

Supplementary Information

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases