Skip to main content
Digital Health logoLink to Digital Health
. 2025 Sep 12;11:20552076251365059. doi: 10.1177/20552076251365059

The top 100 most-cited articles on large language models in medicine: A bibliometric analysis

Zhi-Qiang Li 1,*, Runbing Xu 2,*, Xin-Ran Gong 3, Cheng-Lu Wang 1, Jian-Ping Liu 1,4,
PMCID: PMC12432328  PMID: 40949670

Abstract

Objectives

Large language models (LLMs) are revolutionizing medical research. However, there is a lack of bibliometric analysis that identifies citation trends shaping the history of this field. This study analyzes the top 100 (T100) most-cited articles on LLMs in medicine to assess their impact and characteristics.

Methods

A bibliometric analysis of top-cited articles in the Web of Science database using search terms like “LLMs, generative artificial intelligence, GPT” from 2022 to 2025. Two reviewers identified the T100 papers, extracting publication details, citations, and research themes, adhering to BIBLIO reporting guidelines.

Results

The T100 articles had contributed from 655 authors, and 92 articles were published in 2023. Original research constituted the majority of publications (60 articles). Collectively, these works accumulated 14,847 citations, with individual citations ranging from 50 to 1057 (average 148.47). The U.S. led global contributions with 56 articles, Stanford University emerging as the most prolific institution (8 articles). The top seven journals contributed to 31% of the T100, and Journal of Medical Internet Research published the largest share (8 articles) in 70 peer-reviewed journals. The most-cited article is “Evolutionary-scale prediction of atomic-level protein structure with a language model” (Lin et al., Science 2023; 1057 citations). The research themes centered on evaluating LLMs’ performance in exam-style evaluations, medical knowledge synthesis, and question-answering tasks in medicine.

Conclusion

This analysis provides a core overview of high-impact LLMs research in medicine, guiding future applications. The findings highlighted the remarkable progress in clinical decision support, drug discovery, multimodal medical imaging analysis, and personalized medical information-answering. They also stress the need for prospective trials to assess real-world clinical impacts, boost the reliability of LLMs-generated medical info, develop consensus-driven solutions to address ethical challenges, and launch global initiatives to democratize LLMs tools.

Keywords: Bibliometric analysis, citation analysis, large language models, medical research

Introduction

The advent of large language models (LLMs) has precipitated a transformative shift in artificial intelligence (AI) applications across scientific disciplines. These models undergo pretraining on extensive corpora of textual data, endowing them with the capability to both produce and comprehend natural language text. Since the release of ChatGPT in November 2022, LLMs have demonstrated unprecedented capabilities in medical knowledge synthesis, clinical decision support, and patient–physician communication optimization. 1 Their integration into healthcare systems has sparked extensive scholarly discourse, evidenced by an exponential growth in publications indexed in database. 2 Subsequently, in March 2023, GPT-4 was introduced, offering enhanced language comprehension and generation capabilities. In May 2024, GPT-4o was released, supporting inputs of text, audio, and images, promoting more natural human–machine interaction. Moreover, the rapid development of domain-specific generative transformer language models has been burgeoning.3,4 However, the rapid proliferation of research outputs necessitates systematic evaluations to identify knowledge clusters, intellectual trajectories, and evidence gaps within this field.

Citation analysis serves as a cornerstone of bibliometric research, providing quantitative insights into academic impact and disciplinary evolution. 5 Prior studies have utilized this methodology to map landmark contributions in ChatGPT-related medical domains.6,7 However, there has been no thorough bibliometric analysis on impactful LLMs research in medicine, which is a significant gap considering these models’ ethical issues and potential for practical application. This gap hinders informed resource allocation in research and obscures our grasp of LLMs integration into medical knowledge. This study aims to conduct the first bibliometric analysis of the top 100 (T100) most-cited articles of LLMs applications in medicine.

Methods

Data collection

To ensure reproducibility, this study follow the Preliminary guideline for reporting bibliometric reviews of the biomedical literature (BIBLIO) 8 (Checklist in the Supplementary). The Science Citation Index Expanded of Web of Science (WOS) was queried using a targeted search term to capture the full of LLMs-related medical research, TS = (“ChatGPT” OR “GPT” OR “Generative pre-trained transformer” OR “Generative artificial intelligence” OR “Generative AI” OR “Gemini” OR “Bard” OR “Claude” OR “Copilot” OR “LLAMA” OR “Deep-seek” OR “Chatbot*” OR “large language model*” OR “LLM”). The selection of these terms was based on a careful consideration of the current landscape of LLMs in medicine.9,10 GPT, Generative AI (GAI), and its variants represent the core architecture of most LLMs in medicine; Gemini, Bard, Claude, and others are major commercial systems with medical applications. Gemini is demonstrating remarkable capabilities in multimodal reasoning. Bard and Deep-seek showing potential in medical text generation tasks. Claude and Copilot, as sophisticated LLMs, excel at handling complex medical inquiries and providing detailed explanations. Chatbot is more simpler, patient-facing tools designed to provide straightforward healthcare advice and support.

The timeframe for literature selection was set from 1 November 2022 to 11 February 2025. This period was chosen based on two critical considerations. (a) The release of ChatGPT in November 2022 marked a significant turning point in the application of medical LLMs. This period captures the complete innovation cycle from initial prototype deployment to clinical validation phases. (b) Preliminary analysis revealed that 92% of LLMs-related medical research outputs have occurred post-2022. Restricting to recent publications ensures capturing the rapidly evolving data.

Data preprocessing

Articles identified in this original search were then manually reviewed (ZQL and XRG) and filtered for the following criteria: (a) publications focus on medicine within LLMs; (b) document types only include original articles and reviews; (c) the T100 most-cited articles were purely based on total citation count. The literature search and screening process were shown in the flowchart (Figure 1). The H5-index of journals is primarily obtained through Google Scholar (https://scholar.google.com), while the H-index and G-index of authors are typically sourced from the WOS. The H5-index reflects the citation impact of a journal's articles over the past 5 years, offering insight into the journal's recent influence, while the H-index and G-index of authors provide a measure of their productivity and the impact of their work over their entire career. Two reviewers evaluated each article, with a third reviewer resolving any discrepancies.

Figure 1.

Figure 1.

The flowchart of the top 100 most-cited articles on large language models.

Statistical analysis

Statistical analysis was performed on the extracted data, which included article type, title, publication year, citation count, authors, journal, and research focus areas. The data were analyzed and visualized using CiteSpace (V6.3.R1, Drexel University, PA, USA), the bibliometrix and ggplot2 package in R software version 4.4.2, and Microsoft Excel (Microsoft Corp, WA, USA).

For the statistical analysis, bivariate correlation analyses were conducted using appropriate coefficients based on the types of variables. Specifically, Spearman's rank-order correlation coefficient was used to assess the associations between continuous variables that did not meet normality assumptions or with ordinal variables. Eta-squared coefficient was applied to measure the effect size between nominal categorical variables and continuous variables. The Chi-square test of independence was used to evaluate the associations between nominal categorical variables. A correlation heatmap was generated to illustrate correlations among various variables and citation counts. These visualizations facilitated the identification of patterns and connections between elements such as the impact factors (IFs), research type, and citation. All statistical analyses were carried out using IBM SPSS Statistics (version 26.0), and statistical significance was determined using a two-tailed α of 0.05.

Results

Publication outputs and basic characteristics

All T100 cited articles were English publications, averaging 148.47 citations (range: 50–1057). Most (92 articles) originated in 2023, reflecting the explosive growth of research following ChatGPT's release. Table 1 shows the top 10 most-cited papers; half were review articles, while the remaining four original studies focused on “protein structure prediction,” “protein sequence generation,” “Chatbot performance,” “Healthcare,” and “clinical knowledge encoding.” The T100 most-cited “Evolutionary-scale prediction of atomic-level protein structure with a language model” (Lin et al., Science 2023; 1057 citations), demonstrates the application of LLMs to directly infer atomic-level protein structures from primary sequences, significantly accelerating high-resolution structure prediction and enabling the construction of the ESM Metagenomic Atlas. 11 This provides a powerful tool for exploring the vast diversity and functionality of natural proteins. Notably, among the top 50 highly cited papers, the three published in 2024 had less time to accumulate citations compared to those from 2023. However, in 2025, these 2024 papers showed a rapid citation increase, significantly surpassing the citation counts of papers at the same ranking positions from 2023. The 2024 papers focused on ChatGPT performance on simplified radiology reports, scGPT for single-cell multi-omics, a taxonomy and systematic review of ChatGPT in healthcare.4,12,13 In contrast, in the bottom 50 ranked papers, those published in 2024 demonstrated a stronger citation acceleration than their 2023 counterparts (eTable 2 in the Supplementary).

Table 1.

Characteristics of the top 10 most-cited articles on large language models in medicine.

Rank No. of citations Citations per year Average citations per year Research Journal Title
2023 2024 2025
1 1057 234 720 103 352.33 Protein Structure Prediction Science Evolutionary-scale prediction of atomic-level protein structure with a language model
2 821 230 547 44 273.67 Systematic Review Healthcare ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns
3 771 51 652 68 257 Review article Nature Medicine Large language models in medicine
4 767 196 528 43 255.67 Chatbot Performance JAMA Internal Medicine Comparing Physician and Artificial Intelligence Chatbot Responses to Patient Questions Posted to a Public Social Media Forum
5 716 86 573 57 238.67 Knowledge Encoding Nature Large language models encode clinical knowledge
6 681 237 407 37 227 Review article NEJM Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine
7 432 136 272 24 144 Healthcare Journal of Medical Systems Evaluating the Feasibility of ChatGPT in Healthcare: An Analysis of Multiple Clinical and Research Scenarios
8 388 99 272 17 129.33 Review article NEJM Artificial Intelligence and Machine Learning in Clinical Medicine, 2023
9 321 99 194 28 107 Protein Sequence Nature Biotechnology Large language models generate functional protein sequences across diverse families
10 277 97 175 5 92.33 Review article Pakistan Journal of Medical Sciences ChatGPT- Reshaping medical education and clinical management

Note: NEJM, New England Journal of Medicine. All publications were published in 2023 in this Table. The data extraction of citations was completed on 2 Nov 2025.

Full table is shown in eTable 2 in Supplementary.

AI: artificial intelligence.

Across the entire data, original research predominated (60%), followed by reviews 39%, and editorial 1% (eTable 2 in the Supplementary). Within this category of original research, all studies employed an observational cross-sectional design to assess ChatGPT performance, compare LLMs, or focus on the development of specific LLMs models. However, these can be more accurately categorized as diagnostic test accuracy (n = 51), Case study (n = 5), Qualitative research (n = 3), Cross-sectional questionnaire survey (n = 1). This distribution highlights the field's emphasis on empirical validation, reflecting a dynamic and proactive approach towards exploring and confirming the practical utility of AI in healthcare environments. The emphasis on empirical validation is crucial for enhancing the reliability and efficacy of AI applications in medical practice.

Country, institution, and author analysis

The United States (U.S.) led global contributions with 56 articles (56%), followed by the UK (8%), Canada (7%), China (7%), Australia (6%), and Germany (6%) (Figure 2(A)). Notably, the U.S, UK, Singapore, Italy, and China exhibited extensive international collaborative networks (Figure 3(A)). The disproportionate geographic distribution of research highlights systemic inequities in LLMs development, which may be attributed to advanced economies with robust computing infrastructure and access to high-performance LLMs are better positioned to pioneer medical AI research. In addition, countries like the U.S. have established agile regulatory frameworks for AI validation in healthcare, enabling faster clinical adoption. A total of 309 institutions contributed to these studies, with Stanford University emerging as the most prolific institutions (n = 8), followed by Vanderbilt University, Harvard University, University of California System, and University of Toronto (each with 6 articles; Figure 2(B)). The highest number of citations, on the other hand, goes to New York University, largely due to the fact that it published four highly cited papers, one of which ranked highest in terms of citations. 11 Nine institutions produced five or more articles, with collaboration networks centered on U.S. elite academic hubs (e.g. Harvard, Stanford) further emphasizes the role of resource-rich ecosystems in driving high-impact research (Figure 3(B)). Regional institutions such as Sichuan University (China), University of Toronto (Canada), and National University of Singapore also demonstrated robust collaborative engagement.

Figure 2.

Figure 2.

(A) Number of studies per country/regions of large language models in medicine. (B)Number of studies of institutions of large language models in medicine.

Figure 3.

Figure 3.

Scientific publications of large language models on medical. (A) The co-occurrence network map of different countries. (B) The co-occurrence network map of institutions. (C) Number of publications in top 10 areas of research. (D) The clustering map of keywords.

No dominant authorship network emerged suggests decentralized innovation patterns in medical LLMs research, and the top three authors Mesko, Bertalan; Klang, Eyal; and Liu, Siru contributed three articles each. Notable entries include Nigam H. Shah (H-index 60, G-index 14 at Stanford Univ, Center for Biomedical Informatics Research), Stacy Joeb (H-index 61, G-index 12, New York Univ, Med Sch, Department of Psychiatry), and Yeo, Yee Hu (H-index 69, G-index 14 at Cedars Sinai Medical Center, Karsh Div Gastroenterol & Hepatol).

Journal analysis

There are 11 journals that publish two or more medical LLMs research (Figure 4(A)), with 31% of the T100 articles concentrated in the top 7 peer-reviewed journals. The Journal of Medical Internet Research published the largest share (8 articles; IF: 5.8, H5 = 160), followed by Radiology (6 articles; IF: 12.1, H5 = 140) and npj Digital Medicine (5 articles; IF: 12.4, H5 = 109), while high-impact journals like NEJM (IF,96.2) and Lancet (IF,168.9) contributed fewer but influential studies.14,15 75% of the T100 papers appeared in Q1 journals.

Figure 4.

Figure 4.

(A) Journal contribution rankings by publication volume and impact metrics. (B)Word cloud map of keyword co-occurrence analysis.

Keywords and Keywords clustering

The most frequent keywords were “AI (n = 46),” “LLMs (n = 25),” and “natural language processing (n = 7)” (Figure 4(B)). Keyword cloud visualization highlights research priorities, clinical decision support, patient communication, and multimodal data integration dominate. Technical terms (transformer architectures, natural language processing) intertwine with ethical concerns (medical bias, ethical challenges).

Keywords clustering, which identifies shared themes or concepts, effectively organizes keywords, outlining the main research directions and key points of interest in the literature. This study identified 10 major research clusters (modularity Q > 0.8; silhouette score >0.9), with different colors representing different thematic clusters, confirming robust thematic cohesion and reliability of the clustering results (Figure 3(C)). The largest cluster, labeled #0 LLMs, #2 Conversational Agent, #4 GAI, and #5 AI, pertain to LLMs and are predominantly connected to subjects such as AI development, algorithms, machine learning, learning models, and ChatGPT. Other notable clusters included: #1 which centers on “Patient Privacy” and concerns related to patient information, #3 focusing on “Clinical Decision Support,” with ties to clinical decision-making and best practices, #6 Equity related to terms associated with low- and middle-income countries (LMICs), global health, and public health, #7 Medical education mainly related to education and examination, continuing medical education and reshaping medical education, #8 Oncology, with a focus on topics like lung cancer and inquiries regarding cancer. #9 Extraintestinal Manifestations prominently linked with terms such as inflammatory bowel disease, period, biomarkers, diagnosis (Figure 3(D)).

Research disciplines and topics

These publications span 38 disciplines, with Health Care Sciences Services being the leading research direction, as evidenced by 23 articles. Moreover, Medical informatics, Medicine general internal also emerge as key research directions (Figure 3(D)). Titles and Content analysis of the topics in the original study revealed that 58 studies evaluated LLMs’ performance in generating or answering medical information questions. Common clinical focus included radiology (n = 10), oncology (n = 7), and medical licensing examinations (n = 7) (Table 2). For instance, numerous studies have assessed the quality of LLMs in answering clinical knowledge questions,16,17 it offers patients valuable information, especially when they are reluctant to consult healthcare professionals or when access to medical advice is restricted. In controlled settings, the quality and even empathy of Chatbot responses were significantly higher than physician responses. However, there was a significant difference in LLMs’ clinically reported treatment in providing health advice. 18 Furthermore, we also analyzed papers with IF above 10 and those with three or more publications (eTable 1 in the Supplementary). We found that highly cited review articles mainly focus on clinical medicine, healthcare, and medical images/radiology. Original research papers, on the other hand, often assess the performance of LLMs in knowledge acquisition or question-answer tasks across specific medical fields.

Table 2.

Research topics of 100 top-cited papers.

Domains Description Times
LLMs Performance (ChatGPT/Other tools)(N = 58) Radiological Information/Reports/Examination 10
Queries About Cancer 7
Orthopaedic Assessment/Plastic Surgery In-service/pharmacist licensing/Family Medicine Board/Laboratory medicine/Neurosurgery Written Board/USMLE soft skill Examination 7
Ophthalmic Knowledge Assessment 5
Generated articles/abstracts &literature search 3
Questions in Cosmetic Plastic/Aesthetic Surgery 3
Operating room/surgical education/oral and maxillofacial surgery 3
Digital Health/electronic health records 3
Information in Urology 3
Clinical decision 2
Total Joint/ Hip Arthroplasty 2
Obstetrics and gynecology 2
Gastroenterology Health-Related Questions 2
Foundation model for single-cell multi-omics 1
Questions Regarding Bariatric Surgery 1
Biomedical text generation and mining 1
Entire Clinical Workflow 1
Otolaryngology subspecialties 1
Medical Evidence Summarization 1
Advances and Applications of LLMs(N =42) Reflections on LLMs 9
Clinical Practice Applications 4
Scientific writing 4
Medical Education 4
Radiology/Medical Images 3
Ethics and Regulation 3
Dental Medicine 2
Biomedicine 2
Healthcare Education, Research, and Practice 2
Protein Structure Prediction/Protein Sequence Generation 2
Guidelines of LLMs 2
Prompt Engineering 1
Health Behavioral Changes 1
Clinical Knowledge Encoding 1
Ethical, technical, and cultural framework of LLMs 1
Drug Discovery 1

Abbreviations: LLM: large language models.

Citation distribution

Figure 5(A) presents the citation distribution of the T100 cited medical LLMs papers published from 2022 to 2024, grouped by year. It shows that citation counts are highest and most variable in 2023. Monthly citation trends for these papers, with a significant surge in citations in early 2023, particularly in January and February (Figure 5(B)). Citations then gradually decline from mid-2023 to 2024, indicating fluctuating research interest and impact over time. Figure 5(C) illustrates the citation distribution across JCR quartiles (Q1 to Q4), showing higher and more variable citations in Q1 and Q2. Figure 5(D) presents the annual citation trends, revealing a significant increase in citations for Q1 papers in 2023, along with notable citations for Q2 papers. Most papers in Q3-Q4 have relatively low and stable citation counts across the 3 years.

Figure 5.

Figure 5.

Boxplot of total citations for large language models in medicine (published in 2022–2024). (A) Total citations of top 100 cited papers by year; (B) monthly total citations of top 100 cited papers; (C) total citations of top 100 cited papers by JCR quartile; (D) total citations of top 100 cited papers by year and JCR quartile.

Analysis of correlation

Correlation Heatmap demonstrates significant correlations among bibliometric indicators in medical LLMs research. Significance associations appear between consecutive annual citations (2023–2025) and total citations, with strongest was 2024 total citation (r = 0.949). The IF shows a moderate correlation with total citations (r = 0.364), and over time, the association between the IF and annual citation counts becomes increasingly strong. However, there is no statistically significant correlation between the JCR or research type and citation counts (Figure 6).

Figure 6.

Figure 6.

Correlation heatmap in large language models in medicine.

Notes. IF: impact factor; JCR: JournaI Citation Reports(Q1-Q4).

The correlation coefficient is a statistical measure of the strength and direction of the relationship between two variables. It typically ranges from −1 to 1. The greater the absolute value of the correlation coefficient, the stronger the correlation between the two variables.

Discussion

This study characterizes and analyzes the T100 most-cited publications, which may shape the history of LLMs in medicine, showing the field's rapid evolution. These findings held significant implications for researchers and policymakers engaged in integrating LLMs into healthcare. The 14,837 citations accumulated within 28 months post-publication demonstrate unprecedented citation velocity compared to other AI/medical subfields. 7 The concentration of 92% top-cited articles in 2023 reflects the pivotal role of ChatGPT's public release (November 2022) in catalyzing a surge of LLMs research in medicine, which catalyzed rapid innovation in clinical decision support, drug discovery, and multimodal medical data analysis. This pattern aligns with the typical 6–12 months latency between technological breakthroughs and academic publication cycles. This temporal distribution also aligns with Gartner's AI maturity curve, where 2023 represents the “Peak of Inflated Expectations” phase for medical LLMs. 19 Key drivers of this growth are GAI Milestones. The launch of GPT-4 (March 2023) demonstrated LLMs’ capacity to achieve near-human performance on medical licensing exams, triggering widespread validation studies across clinical tasks. 20

Earlier studies typically accumulate more citations simply because they've had more time to be noticed and cited by researchers. This means newer articles naturally have relatively lower citation counts. To provide a more balanced view and fairly compare papers from different publication years, it's useful to rank papers by their annual citation counts and track citation changes on a yearly basis. This approach helps researchers better understand citation patterns over time. When comparing papers published in 2023 and 2024, it was found that many 2024 papers were published in high-quality journals and focused on pioneering topics, which may have accelerated their citation rates,12,13 and ultimately determined the current pattern of the T100 cited rankings. Although the correlations between 2025 citations and total citations are lower than those for 2024, this is mainly because our analysis only included citation data up to 11 February 2025. Hence, the half-lives of papers on LLMs in medicine still need to be tracked and analyzed over time.

This study reveals two transformative insights. First, accelerated knowledge production with persistent gaps. While LLMs research has achieved unprecedented citation velocity (14,837 citations in 52 months), its focus remains narrowly technical. While it has demonstrated a high level of accuracy in areas such as medical licensing exams and the accuracy of clinical expertise, there is less research addressing real-world challenges such as clinician workflow integration or patient consent protocols, which highlights a critical disconnect between technical innovation and clinical needs. Second, geographic and institutional imbalance. The concentration of high-impact research in U.S. institutions (56%) and elite universities (Stanford, Harvard) exposes systemic inequities in AI development. While elite institutions drive innovation through resource-rich ecosystems, LMICs remain underrepresented, risking the perpetuation of healthcare disparities. 14 To address these issues, several concrete proposals could be considered to encourage equitable contributions from LMICs. Firstly, international research funding bodies could establish dedicated grants and collaboration programs that specifically target researchers in LMICs, providing them with the financial resources and infrastructure needed to conduct AI research. This would help to level the playing field and enable researchers from diverse geographical backgrounds to contribute to the field. Secondly, piloting partnerships including joint research projects, academic exchanges, and mentorship programs could facilitate knowledge exchange and capacity building, allowing researchers from LMICs to access expertise and resources that might otherwise be unavailable to them. Thirdly, establishing computing resource-sharing and federated data networks to empower LMICs’ participation in model training. Lastly, investing in educational and training programs in LMICs focused on AI literacy and research methodologies would be crucial. By strengthening local capabilities in AI development, these countries can become more active participants in the global research landscape, rather than merely consumers of AI technologies developed elsewhere. This investment in human capital would have long-term benefits for the entire field of AI research.

We advocate for a judicious and ethical use of LLMs, ensuring that their integration into medical practice is both scientifically sound and beneficial to patient well-being. This analysis provided three actionable priorities for healthcare stakeholders. First, while research has evolved from foundational LLMs architectures (e.g. GPT series, BERT) to domain-specific medical adaptations, there is a striking contrast between the focus on medical licensing exams or the performance of medical questions and the limited exploration of clinical workflow integration and patient-centered outcomes. 21 Despite confirmed potential of LLMs in reducing diagnostic errors and enhancing patient communication, more rigorous trials are urgently needed to evaluate LLMs’ real-world impact on diagnostic accuracy, efficiency in clinical workflow, and clinician burnout mitigation.22,23 Second, keywords clustering and analysis of highly cited works reveal a significant emphasis on technical benchmarks, such as diagnostic accuracy and radiology report generation. However, there is limited attention to ethical and equity concerns, highlighting a concerning gap between technical advancements and real-world clinical integration. Although clusters such as #1 Patient privacy and #6 Equity were identified, fewer studies proposed actionable frameworks to mitigate biases or ensure algorithmic transparency. Given the increasing deployment of LLMs in sensitive applications, emerging challenges, including hallucination, patient data privacy and security, academic integrity, and liability determination, demand consensus-driven solutions and robust ethical and regulatory frameworks. 24 As LLMs rapidly evolve, standardized reporting guidelines are also emerging. For instance, GAMER Statement provided standardized guideline for LLMs use in medical research, covering tool specifications, roles and impacts on findings and ensuring transparency, integrity and quality of research. Similarly, the TRIPOD-LLM Statement enhances the quality, reproducibility, and clinical applicability of LLMs research in healthcare; the checklist integrates these concepts throughout, ensuring that bias and fairness are considered at every stage of the model's life cycle.25,26 Furthermore, a three-stage framework called HELP-ME has been designed to evaluate and protect privacy in healthcare-oriented LLMs. It includes ethical privacy threat assessment, prompt-focused evaluation, and ethical obfuscation to protect patient data while preserving model utility. The framework's effectiveness has been validated, highlighting its role in upholding ethical standards in clinical practice. 27 However, LLMs used in clinical decision-making still lack consensus on liability for incorrect recommendations. With no clear legal framework, it's crucial to clarify responsibility as these models are integrated into medical decision-making. Third, disparities in early access to foundational LLMs technologies risk exacerbating global healthcare inequities. Efforts to democratize LLMs tools, such as open-source initiatives exemplified by models like Deep-seek, represent critical steps toward ensuring these innovations address region-specific healthcare challenges. However, the success of such initiatives hinges on sustained funding and support for research led by LMICs. 14 This support is essential to ensure that LLMs tools not only advance technological capabilities but also reduce global disparities in healthcare access and quality. By fostering equitable access to these technologies, we can promote the development of contextually appropriate solutions that meet the unique needs of diverse populations, thereby advancing global health equity.

Limitations

The interpretation of our findings is subject to four limitations. First, despite employing a search strategy that combined broad and specific terms, some significant studies may have been inadvertently excluded due to inconsistent terminology. However, the supplementary materials help ensure transparency and reproducibility. Although limiting the search time to (November 2022–February 2025) may overlook earlier research findings, a supplementary search revealed only one predictive model study focused on the biomedical field that aligns with LLMs, 28 but it has no impact on the analysis of the trends of this study. Second, the inherent recency bias in citation-based metrics may underestimate pioneering contributions from emerging research ecosystems, particularly those from low-resource settings where dissemination delays and limited international visibility disproportionately affect citation accrual. Longitudinal tracking beyond the 2022–2025 window is essential to validate the sustained impact of such studies. Third, the minimum temporal slicing unit of CiteSpace is constrained to 1 year impedes granular analysis of knowledge trajectory inflection points in rapidly evolving domains in LLMs. For example, quarterly scale fluctuations, such as the Q2 2023 surge in LLMs-driven radiology report optimization studies following GPT-4's release,2931 remain undetectable, potentially obscuring critical shifts in research priorities. While citation acceleration metrics offer dynamic insights and might benefit from citations per month to more fairly compare, citation data is often only recorded annually, and obtaining monthly data would require more granular tracking that is not currently feasible due to limitations in data collection and reporting practices. Future studies should continue monitoring this field, with the expectation that, as time progresses, trends in LLMs within this domain will be more effectively captured. The last is AI-related breakthrough studies that first appear on arXiv before formal WOS indexing. Future investigations should adopt a hybrid framework, initially identifying core literature via WOS, then supplementing with arXiv preprints (time-lag adjusted) and ResearchGate attention scores. This dual-layer approach balances rigor with trend sensitivity, particularly for fast-evolving domains LLMs.

Conclusion

The study reveals the trend, hotpots, and critical gaps in the T100 most-cited LLMs studies in medicine. Current research focuses on technical validation, like diagnostic accuracy and protein structure prediction. But underexplored areas such as real-world surgical assistance systems and rare disease diagnostics need urgent attention. To bridge the current “bench-to-bedside” translation gap, three priorities emerge (Figure 7): First, we need to standardize evaluation protocols across clinical specialties, focusing on real-world impact metrics beyond just accuracy. Second, the dominance of US institutions shows systemic inequities. We need global resource-sharing platforms to democratize LLMs development, especially for low-resource regions. Third, we must foster interdisciplinary collaborations to turn technical advances into clinically useful tools. Future research must balance innovation with ethical considerations. This means ensuring LLMs enhance both medical knowledge and healthcare equity. By aligning citation trends with unmet clinical needs, the field can shift from exam-centric benchmarks to ethically sound, patient-centered AI solutions.

Figure 7.

Figure 7.

The urgent gaps of large language models in medicine need to action.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251365059 - Supplemental material for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis

Supplemental material, sj-docx-1-dhj-10.1177_20552076251365059 for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis by Zhi-Qiang Li, Runbing Xu, Xin-Ran Gong, Cheng-Lu Wang and Jian-Ping Liu in DIGITAL HEALTH

sj-docx-2-dhj-10.1177_20552076251365059 - Supplemental material for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis

Supplemental material, sj-docx-2-dhj-10.1177_20552076251365059 for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis by Zhi-Qiang Li, Runbing Xu, Xin-Ran Gong, Cheng-Lu Wang and Jian-Ping Liu in DIGITAL HEALTH

Acknowledgements

We would thank all of the global researchers who have contributed to the healthcare research field of LLMs. During the revised of the manuscript, the authors used Deep-seek and Kimi to correct typographical and grammatical errors. No generative language models were employed in the ideation or writing process. After using these tools, the authors reviewed and edited the content as necessary and take full responsibility for the content presented.

Footnotes

Ethical approval: There is no need for ethics committee approval or consent to participate, as all data used in this bibliometric analysis were sourced from the WOS and did not involve data from human or animal subjects.

Author contributions: ZQL, RBX, XRG, CLW, and JPL wrote and revised the text. All authors final approval of manuscript.

Funding: The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the High-level traditional Chinese medicine key subject construction project of National Administration of Traditional Chinese Medicine--Evidence-based Traditional Chinese Medicine, (grant number 90010951310169.).

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability: All data generated or analyzed in this study are derived from the following resources available in the WOS Core Collection (https://clarivate.com/). The original contributions presented in the study can be obtained from the corresponding author and first author.

Supplemental material: Supplemental material for this article is available online.

References

  • 1.Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. ChatGPT in medicine: a systematic review of clinical applications, risks, and ethical implications. Lancet Digit Health 2023; 5: e403–e413. [Google Scholar]
  • 2.Li ZQ, Wang XF, Liu JP. Publication trends and hot spots of ChatGPT's application in the medicine. J med Syst 2024; 48: 52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Luo R, Sun L, Xia Y, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform 2022; 23: bbac409. [DOI] [PubMed] [Google Scholar]
  • 4.Cui H, Wang C, Maan H, et al. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 2024; 21: 1470–1480. [DOI] [PubMed] [Google Scholar]
  • 5.Hood Ww, Wilson CS. The literature of bibliometrics, scientometrics, and informetrics. Scientometrics 2001; 52: 291–314. [Google Scholar]
  • 6.Chen X, Xie H, Wang FL, et al. A bibliometric analysis of artificial intelligence in biomedical research. Soc Sci Med 2020; 245: 112758. [Google Scholar]
  • 7.Yalcinkaya T, Cinar Yucel S. Bibliometric and content analysis of ChatGPT research in nursing education: the rabbit hole in nursing education. Nurse Educ Pract 2024; 77: 103956. [DOI] [PubMed] [Google Scholar]
  • 8.Montazeri A, Mohammadi SM, Hesari P, et al. Preliminary guideline for reporting bibliometric reviews of the biomedical literature (BIBLIO): a minimum requirements. Syst Rev 2023; 12: 239. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Rokhshad R, Bagherianlemraski M, Ehsani SS, et al. Large language models for the screening step in systematic reviews in dentistry. J Dent. 2025; 160: 105877. [DOI] [PubMed] [Google Scholar]
  • 10.Li Y, Li Z, Li J, et al. The actual performance of large language models in providing liver cirrhosis-related information: a comparative study. Int J Med Inform 2025; 201: 105961. [DOI] [PubMed] [Google Scholar]
  • 11.Lin Z, Akin H, Rao R, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023; 379: 1123–1130. [DOI] [PubMed] [Google Scholar]
  • 12.Jeblick K, Schachtner B, Dexl J, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34: 2817–2825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Li J, Dada A, Puladi B, et al. ChatGPT in healthcare: a taxonomy and systematic review. Comput Methods Programs Biomed 2024; 245: 108013. [DOI] [PubMed] [Google Scholar]
  • 14.Wang X, Sanders HM, Liu Y, et al. ChatGPT: promise and challenges for deployment in low- and middle-income countries. Lancet Reg Health West Pac 2023; 41: 100905. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Lee P, Bubeck S, Petro J. Benefits, limits, and risks of GPT-4 as an AI Chatbot for medicine. N Engl J Med 2023; 388: 1233–1239. [DOI] [PubMed] [Google Scholar]
  • 16.Pan A, Musheyev D, Bockelman D, et al. Assessment of artificial intelligence Chatbot responses to top searched queries about cancer. JAMA Oncol 2023; 9: 1437–1440. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Ayers JW, Poliak A, Dredze M, et al. Comparing physician and artificial intelligence Chatbot responses to patient questions posted to a public social Media forum. JAMA Intern Med 2023; 183: 589–596. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Huo B, Boyle A, Marfo N, et al. Large language models for Chatbot health advice studies: a systematic review. JAMA Netw Open 2025; 8: e2457879. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.17.Gartner. Hype Cycle for Artificial Intelligence 2024. Available from: https://gartner.com/en/articles/hype-cycle-for-artificial-intelligence.(Available at May 05, 2025).
  • 20.Liu M, Okuhara T, Chang X, et al. Performance of ChatGPT across different versions in medical licensing examinations worldwide. Systematic review and meta-analysis. J Med Internet Res 2024; 26: e60807. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Hartman V, Zhang X, Poddar R, et al. Developing and evaluating large language model-generated emergency medicine handoff notes. JAMA Netw Open 2024; 7: e2448723. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Goh E, Gallo R, Hom J, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open 2024; 7: e2440969. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Lai H, Ge L, Sun M, et al. Assessing the risk of bias in randomized clinical trials with large language models. JAMA Netw Open 2024; 7: e2412687. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Ong JCL, Chang SY, William W, et al. Ethical and regulatory challenges of large language models in medicine. Lancet Digit Health 2024; 6: e428–e432. [DOI] [PubMed] [Google Scholar]
  • 25.Luo X, Tham YC, Giuffrè M, et al. Reporting guideline for the use of Generative Artificial intelligence tools in MEdical Research: the GAMER Statement. BMJ Evid Based Med 2025; 13: bmjebm-2025-113825. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Gallifant J, Afshar M, Ameen S, et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat Med 2025; 31: 60–69. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Li C Meng Y, Dong L, et al. Ethical privacy framework for large language models in smart healthcare: a comprehensive evaluation and protection approach. IEEE J Biomed Health Inform. Published online June 4 2025. DOI: 10.1109/JBHI.2025.3576579. [DOI] [PubMed] [Google Scholar]
  • 28.Rives A, Meier J, Sercu T, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci U S A 2021; 118: e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Bhayana R. Chatbots and large language models in radiology: a practical primer for clinical and research applications. Radiology 2024; 310: e232756. [DOI] [PubMed] [Google Scholar]
  • 30.Sun Z, Ong H, Kennedy P, et al. Evaluating GPT4 on impressions generation in radiology reports. Radiology 2023; 307: e231259. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31.Fink MA, Bischoff A, Fink CA, et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology 2023; 308: e231362. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-docx-1-dhj-10.1177_20552076251365059 - Supplemental material for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis

Supplemental material, sj-docx-1-dhj-10.1177_20552076251365059 for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis by Zhi-Qiang Li, Runbing Xu, Xin-Ran Gong, Cheng-Lu Wang and Jian-Ping Liu in DIGITAL HEALTH

sj-docx-2-dhj-10.1177_20552076251365059 - Supplemental material for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis

Supplemental material, sj-docx-2-dhj-10.1177_20552076251365059 for The top 100 most-cited articles on large language models in medicine: A bibliometric analysis by Zhi-Qiang Li, Runbing Xu, Xin-Ran Gong, Cheng-Lu Wang and Jian-Ping Liu in DIGITAL HEALTH


Articles from Digital Health are provided here courtesy of SAGE Publications

RESOURCES