Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Andre Python; HongYi Li; Jun-Fen Fu

doi:10.2196/82729

letter

. 2025 Sep 24;27:e82729. doi: 10.2196/82729

Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Andre Python ^1,^2,^3,^✉, HongYi Li ^1,⁴, Jun-Fen Fu ^5,^6,⁷

Editor: Tiffany Leung

PMCID: PMC12459737 PMID: 40991734

Introduction

We thank the correspondent for engaging with our original work [1] and raising constructive points in their Letter [2].

Citation Threshold Bias

We acknowledge that the citation criteria applied to select journals may exclude relevant studies from emerging or specialized venues. Our criteria were not only desirable but necessary to balance comprehensiveness with methodological quality considering the rapidly expanding literature. To mitigate the risk of omission of innovative research, we (1) screened and incorporated all relevant articles from main database platforms as well as e-prints and (2) made available an interactive online guideline offering an up-to-date guide to clinicians.

Definition of “Best Performance”

We acknowledge the concerns associated with the performance comparison of models across heterogeneous contexts. To avoid ambiguity and misinterpretation, we stated and discussed in detail that, in our study, the term “best performance” is solely associated with the findings from the reviewed studies. Our analysis helps identify models successfully applied in clinical studies, without aiming at or implying comparison across domains. We direct readers to the excellent recent work by Liu et al [3] for a comparison of lightweight large language models (LLMs) for medical tasks.

Quality Assessment of the Included Studies

We carried out a thorough quality assessment following PRISMA guidelines [4]. This might have escaped the correspondent’s attention, as the details are provided in Multimedia Appendix 2 of our work [1].

Clinical Workflow

The suggested 5-stage workflow does not ignore nor intend to capture the complexity of clinical practice. Rather, it serves as a framework to associate the reported use of LLMs with tasks and processes familiar to clinicians, in line with a previous study [5]. Our workflow offers a practical assessment of the role and extent of LLMs applied in clinically relevant sectors of activities and tasks.

Clinical Validation Gap

We acknowledge and discuss the challenges in assessing the practicality of their deployment in clinical applications. Complementary to benchmarking LLMs on research datasets, our review covers studies using LLMs in both research and clinical settings. While we identified key challenges of LLMs in real-world applications, a comprehensive assessment of discrepancies between research and clinical settings is clearly beyond the scope.

Safety and Risk Analyses

While our review discusses key concerns of the use of LLMs in clinical settings including hallucination risks and ethical considerations, a comprehensive risk assessment is beyond scope. Future research dedicated to tackle this key topic would require substantial efforts.

Economic Evaluation

Our review assesses the associated costs of the graphics processing unit memory and its cooling requirements by process and clinical tasks. Our interactive online guideline will regularly incorporate future changes in the requirements and costs, as exemplified by the recent rise of lightweight LLMs that may offer excellent performance on consumer-grade hardware. However, a comprehensive cost-effectiveness or return-on-investment analysis is beyond the study scope.

Conclusion

These observations are a timely reminder that our current understanding of the application of LLMs in clinical settings remains provisional and that we need continual reassessment of their current and future roles in health care practice.

Acknowledgments

We declare that no part of this submission has been generated by AI.

Abbreviations

LLM: large language model

Footnotes

Conflicts of Interest: None declared.

References

1.Li H, Fu JF, Python A. Implementing large language models in health care: clinician-focused review with interactive guideline. J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]
2.Weizman Z. Critical limitations in systematic reviews of large language models in health care. J Med Internet Res. 2025;27:e81769. doi: 10.2196/81769. doi. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Liu F, Zhou H, Gu B, et al. Application of large language models in medicine. Nat Rev Bioeng. 2025;3(6):445–464. doi: 10.1038/s44222-025-00279-5. doi. [DOI] [Google Scholar]
4.Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021 Mar 29;372:n71. doi: 10.1136/bmj.n71. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Betzler BK, Chen H, Cheng CY, et al. Large language models and their impact in ophthalmology. Lancet Digit Health. 2023 Dec;5(12):e917–e924. doi: 10.1016/S2589-7500(23)00201-7. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R1] 1.Li H, Fu JF, Python A. Implementing large language models in health care: clinician-focused review with interactive guideline. J Med Internet Res. 2025 Jul 11;27:e71916. doi: 10.2196/71916. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R2] 2.Weizman Z. Critical limitations in systematic reviews of large language models in health care. J Med Internet Res. 2025;27:e81769. doi: 10.2196/81769. doi. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Liu F, Zhou H, Gu B, et al. Application of large language models in medicine. Nat Rev Bioeng. 2025;3(6):445–464. doi: 10.1038/s44222-025-00279-5. doi. [DOI] [Google Scholar]

[R4] 4.Page MJ, McKenzie JE, Bossuyt PM, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021 Mar 29;372:n71. doi: 10.1136/bmj.n71. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R5] 5.Betzler BK, Chen H, Cheng CY, et al. Large language models and their impact in ophthalmology. Lancet Digit Health. 2023 Dec;5(12):e917–e924. doi: 10.1016/S2589-7500(23)00201-7. doi. Medline. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Andre Python, PhD

HongYi Li, BS

Jun-Fen Fu, MD, PhD

Introduction

Citation Threshold Bias

Definition of “Best Performance”

Quality Assessment of the Included Studies

Clinical Workflow

Clinical Validation Gap

Safety and Risk Analyses

Economic Evaluation

Conclusion

Acknowledgments

Abbreviations

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Cite

Add to Collections

PERMALINK

Author's Reply: Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Andre Python, PhD

HongYi Li, BS

Jun-Fen Fu, MD, PhD

Introduction

Citation Threshold Bias

Definition of “Best Performance”

Quality Assessment of the Included Studies

Clinical Workflow

Clinical Validation Gap

Safety and Risk Analyses

Economic Evaluation

Conclusion

Acknowledgments

Abbreviations

Footnotes

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases