Abstract
Large Language Models (LLMs), represented by the Generative Pretrained Transformer (GPT), are profoundly transforming the healthcare sector. Spine medicine, a discipline heavily reliant on complex imaging data, detailed clinical records, and evidence-based medical practice, serves as an ideal testing ground for exploring and applying these advanced artificial intelligence technologies. It holds the promise of optimizing clinical workflows, enhancing the quality of diagnosis and treatment decisions, and improving patient communication.
We systematically searched PubMed and Embase from January 2023 to September 2025 for studies investigating LLMs in spinal diseases. Original research articles published in English with a Journal Impact Factor (JIF) ≥ 3.0 were included. Reviews, case reports, animal studies, and non-orthopedic topics were excluded. Data from eligible studies were extracted and narratively synthesized.
This review aims to systematically and comprehensively examine the current state of clinical applications of medical large models and related intelligent systems in the field of spinal diseases. The focus is on analyzing their core technical pathways, specific clinical application scenarios, and their medical value, and performance evaluation results, thereby identifying current opportunities and key challenges. Furthermore, it anticipates future developments, from leveraging general-purpose models to constructing specialized models based on high-quality, large-scale, multimodal spine-specific datasets.
The translational potential of this article: The translational potential of this article lies in its provision of a comprehensive roadmap and practical framework for implementing artificial intelligence in spinal surgery. It systematically synthesizes core application scenarios for large language models—including clinical documentation assistance and preoperative planning—while explicitly addressing four critical challenges requiring resolution for successful clinical integration: regulatory compliance, data privacy protection, algorithmic bias mitigation, and workflow integration. It establishes an actionable foundation for collaborative efforts among clinicians, developers, and policymakers to deploy safe, effective, and compliant AI tools in spinal care.
Keywords: Artificial intelligence, Interdisciplinary application, Large language model, Multi modal, Retrieval-augmented generation, Spinal diseases
Graphical abstract

1. Introduction
In recent years, the development of Generative Artificial Intelligence (Generative AI), particularly exemplified by Large Language Models (LLMs), has been penetrating scientific and industrial domains with unprecedented depth and breadth [[1], [2], [3], [4]]. Since the release of OpenAI's ChatGPT [5], its powerful capabilities in natural language understanding, generation, and reasoning have signaled a profound paradigmatic shift, with the healthcare sector emerging as a central stage for this transformation [[6], [7], [8], [9]]. These technologies represent not merely incremental improvements, but a disruptive force with the potential to fundamentally reshape the prevailing models of medical practice, scientific research, and medical education [[10], [11], [12], [13]].
Among various medical specialties, spine medicine, with its inherent complexity and data-intensive nature, has become an excellent field for exploring and applying LLMs [14,15], as shown in Fig. 1. Its distinctiveness is reflected in several aspects.
Fig. 1.
Illustration of medical large language models (MLLMs) applied to spine chordoma. MLLMs integrate multi-modal data (e.g., MRI, CT, WSI, genomic data, EHR) to address multi-task and multi-challenge clinical scenarios. Tasks include disease diagnosis, classification, prognosis analysis, treatment decision-making, surgical plan recommendation, and Molecular Pathology & Genetics. The approach aims to tackle data diversity, decision-making complexity, and high-stakes clinical outcomes, ultimately supporting accurate and evidence-based medical decision-making. (Schematic). CT, Computed Tomography; MRI, Magnetic Resonance Imaging; EHR, Electronic Health Record; WSI, Whole Slide Imaging; WHO, World Health Organization; ESCC, Epidural Spinal Cord Compression; PD-L1, Programmed Death-Ligand 1; VEGF, Vascular Endothelial Growth Factor; PD-1, Programmed Death-1; VEGFR-TKI, Vascular Endothelial Growth Factor Receptor-Tyrosine Kinase Inhibitor.
Data Intensity and Diversity: The diagnosis and treatment of spinal diseases generate a vast amount of heterogeneous data [[16], [17], [18]], including complex imaging materials (e.g., MRI, CT), detailed electronic health records (EHRs), structured surgical reports, biomechanical parameters, and constantly updated clinical practice guidelines. This provides rich “fuel” for data-driven AI models.
Complexity of Decision-Making: From degenerative diseases and spinal deformities to spinal tumors and trauma, the clinical decision-making process is highly complex. It often requires physicians to integrate multimodal information, conduct fine-grained risk assessments, and design individualized treatment plans.
High-Stakes Clinical Outcomes: The precision of spinal surgery is a factor directly influencing the patient's neurological function and quality of life, and any decision-making error can lead to serious consequences. Therefore, the demand for evidence-based medicine and precision medicine is extremely urgent, and AI is poised to provide powerful support in this regard.
Despite the promising outlook [[19], [20], [21], [22]], research on the application of LLMs in spine medicine is still in its infancy, with scattered results and a lack of systematic synthesis and critical evaluation. Therefore, this review aims to fill this gap by conducting a comprehensive analysis of the latest cutting-edge literature as of 2025. This paper will provide a detailed, tutorial-style overview, not only analyzing the core technologies behind them but also delving into their application status and medical value in specific clinical scenarios, systematically analyzing the key challenges they face, and looking forward to the future of AI-enhanced spinal healthcare models. The analysis in this paper will be based on the framework of a systematic review of large models in orthopedics and will integrate the latest research findings in the field of spine specialty.
2. Method
2.1. Literature search strategy
We conducted a systematic literature search to identify studies investigating the application of LLMs and related artificial intelligence systems in spinal diseases. Two major biomedical databases, PubMed and Embase, were queried. The search timeframe was restricted to January 2023 to September 2025, covering the period following the widespread adoption of generative AI in medicine.
2.2. Search queries
The search strategies combined controlled vocabulary (MeSH in PubMed, Emtree in Embase) and free-text terms related to two domains: (1) artificial intelligence and large language models, and (2) spinal diseases and spine surgery. The full search syntax for both databases is provided in Appendix A.
2.3. Study selection process
The study selection process followed the PRISMA 2020 flow diagram (Fig. 2). A detailed, step-by-step account of the screening process is provided in Appendix B. Ultimately, 26 studies met all criteria and were included in the review.
Fig. 2.
PRISMA flow diagram of study selection.
2.4. Inclusion and exclusion criteria
Inclusion Criteria.
-
•
Original research articles published in English between January 2023 and September 2025.
-
•
Studies focusing on the application of LLMs or AI systems in spinal disease diagnosis, treatment, surgical planning, prognosis, or patient communication.
-
•
Articles published in peer-reviewed journals with a Journal Impact Factor (JIF) ≥ 3.0.
-
•
High-quality preprints describing novel AI architectures, if no peer-reviewed version was available.
Exclusion Criteria.
-
•
Reviews, editorials, commentaries, conference abstracts.
-
•
Animal or in vitro studies.
-
•
Studies not directly related to spinal or orthopedic medicine.
-
•
Duplicate publications or studies with insufficient methodological detail.
2.5. Data extraction and synthesis
Data from the 26 included studies were extracted using a standardized template, capturing study characteristics, AI models evaluated, clinical application scenarios, performance metrics, and reported stages of clinical validation. A narrative synthesis approach was used to summarize findings and identify trends and gaps in the literature.
3. Core technology analysis: From general models to domain-specific intelligence
To enable clinicians and researchers to deeply understand and effectively utilize these emerging technologies, this section will, in a tutorial format, systematically analyze the four core technical pathways currently driving medical AI development, clarifying their working principles, clinical significance, and interrelationships. The details are depicted in Table 1 and Fig. 3.
Table 1.
Summary of core technical pathways in spinal clinical application scenarios.
| Technical Pathway | Core Mechanism |
|---|---|
| Direct Application of General LLMs | Guiding general-purpose LLMs through engineered prompt architectures to ensure clinically relevant and contextually accurate responses. |
| Retrieval-Augmented Generation (RAG) | Implementing Retrieval-Augmented Generation (RAG) for clinical contexts by using semantic search over established guidelines to inform LLM responses. |
| Fine-Tuning | Through specialized fine-tuning on curated datasets of surgical narratives and radiological reports, general-purpose large language models are transformed into expert systems for spinal diseases, developing specialized clinical reasoning capabilities. |
| Multimodal Fusion | Multimodal fusion of medical imaging, EHRs, and clinical narratives using techniques such as weighted averaging, enabling comprehensive diagnostic assessments and predictive analytics. |
Fig. 3.
Four Core Technological Drivers of Medical AI. (A) Carefully designed prompts can guide LLMs to perform clinical reasoning by simulating the role of a specialist. (B) RAG enhances accuracy and reduces hallucinations by retrieving relevant external knowledge and providing it to the LLM, enabling more precise and evidence-based answers. (C) Foundation models (e.g., ChatGPT, Qwen). can be specialized for medical tasks by fine-tuning (e.g., LoRA, QLoRA, adapter tuning) with domain-specific data, resulting in domain-specialized LLMs while maintaining computational efficiency. (D) Multimodal fusion, which combines diverse modalities (e.g., imaging, clinical notes, structured data) through concatenation, summation, or weighted averaging for more robust diagnostic and therapeutic decision-making. (Schematic). LLM, Large Language Model; Chat-GPT, Chat Generative Pre-trained Transformer; Qwen, Tongyi Qianwen; LoRA, Low-Rank Adaptation); QLoRA, Quantized Low-Rank Adaptation.
3.1. Direct application of general-purpose LLMs: The “Out-of-the-box” model
Working Mechanism: Currently, the most common application model involves researchers directly calling commercial or open-source general-purpose LLMs, such as OpenAI's GPT series (GPT-4, GPT-4o) [5], Google's Gemini series [23,24], and Anthropic's Claude series [25], via Application Programming Interfaces (APIs), with details provided in Table 2. As illustrated in Fig. 3 (A), these models are pretrained on vast general datasets of internet text and can perform a wide range of natural language tasks without task-specific modifications. In this model, the innovation lies mainly in “Prompt Engineering,” which involves carefully designing instructions to guide and constrain the model's output to achieve desired results.
Table 2.
Summary of key features of major medical large models in spine-related research.
| Model Name | Developer | Type | Key Application/Finding in Spine-Related Research |
|---|---|---|---|
| Claude 3.5 | Anthropic | General LLM | Showed near-perfect reliability (ICC = 0.984) in calculating the SINS scores from imaging reports, outperforming other models and matching human clinicians [26]. |
| NotebookLM | RAG-LLM | When augmented with NASS guidelines as an external knowledge base, it achieved 98.3% accuracy in answering questions about degenerative spine conditions, significantly outperforming un-augmented general LLMs (40.7%) [27]. | |
| GPT-4/GPT-4o | OpenAI | General LLM | Achieved 98% accuracy in extracting Pfirrmann degeneration grades from lumbar MRI text reports, with very high agreement with radiology experts (Kappa = 0.975). Also significantly improved the readability of thoracolumbar fracture MRI reports for patients [28]. |
| DeepSeek [29] | DeepSeek | General LLM (Open/Closed Source) | In a comparative study with surgeons, it performed well on knowledge-based theoretical questions (80% accuracy) but poorly on case analysis questions requiring clinical reasoning (44% accuracy) [30]. |
| Llama Series [31] | Meta AI | General LLM (Open Source) | In the task of calculating SINS scores, its performance (ICC = 0.829) was significantly lower than that of Claude 3.5 [26]. However, it can show great potential in specific text extraction tasks after fine-tuning. |
Abbreviations: LLM, Large Language Model; ICC, Intraclass Correlation Coefficient; SINS, Spinal Instability Neoplastic Score; RAG, Retrieval-Augmented Generation; NASS, North American Spine Society; GPT-4/GPT-4o, Generative Pre-trained Transformer 4/Generative Pre-trained Transformer 4, optimized; OpenAI, Open Artificial Intelligence; MRI, Magnetic Resonance Imaging; Kappa, Cohen's Kappa.
Clinical Relevance: Due to its extremely high accessibility and low technical barrier, this method dominates initial exploratory research. It allows a wide range of clinicians to rapidly validate the feasibility of LLMs in specific scenarios, such as testing their ability to answer clinical questions, summarize medical records, or simplify medical terminology [32,33]. However, the limitations of this method are also very apparent, as its reliability is questionable when handling high-risk, high-precision clinical tasks.
3.2. Retrieval-Augmented Generation (RAG): Bridging the knowledge gap, ensuring factual accuracy
Working Mechanism Analysis: Retrieval-Augmented Generation (RAG) [34] is a framework designed to combine the fluent generative capabilities of LLMs with the accuracy of external authoritative knowledge bases [[35], [36], [37], [38]]. Fig. 3 (B) presents a streamlined version of the workflow, highlighting the major functional components commonly depicted in standard RAG architectural diagrams. The full RAG pipeline consists of three core stages:
Indexing: First, a trusted external knowledge base is established. For example, all clinical guidelines from the North American Spine Society (NASS), the latest medical textbooks, or internal institutional treatment protocols are digitized. These documents are split into smaller text blocks (chunks), then converted into high-dimensional vector representations by an embedding model. As illustrated in Fig. 3(B), these embeddings are stored in the Vector Store Index, which serves as the searchable index for subsequent retrieval. The vector index then connects to the Database, where the structured knowledge repository is maintained.
Retrieval: When a user submits a query (e.g., “What is the preferred non-surgical treatment for a patient with lumbar spondylolisthesis and radicular symptoms?”), the system first converts the query into an embedding via the same Embedding Model. The query embedding is then matched against the stored vector index to identify the most semantically relevant text segments. In Fig. 3(B), this retrieval operation is shown by the connection from the Database to the Context module. The “Context” box represents the set of retrieved, high-relevance text chunks that will be supplied to the LLM.
Augmentation & Generation: The retrieved context is then combined with the user's original question to form an enriched prompt with a clear instruction: “Please answer this question based on the authoritative materials provided below.” This composite input is passed to the LLM, as shown in Fig. 3(B), enabling the model to generate an evidence-grounded, authoritative answer rather than relying solely on its internal parameters. This design ensures that the LLM's output remains aligned with up-to-date, verifiable clinical knowledge.
Clinical Significance: RAG represents a fundamental shift from unconstrained probabilistic text generation to rigorous and evidence-based reasoning [[39], [40], [41], [42]]. This is crucial for addressing one of the most serious shortcomings of LLMs in medical applications-“hallucination”. General LLMs generate responses by predicting the next most likely word, which can lead them to “fabricate” information that sounds plausible but is completely false, such as fictitious references (with rates as high as 88.3% in our study). By enforcing a “retrieve before generate” paradigm, RAG substantially improves factual accuracy. A landmark study on degenerative spine diseases clearly demonstrated this: when augmented with NASS guidelines, a RAG-LLM (NotebookLM) achieved an accuracy of 98.3% in answering relevant clinical questions, compared with only 40.7% for a state-of-the-art general-purpose LLM (ChatGPT-4o) [27].
Moreover newer RAG-based architectures have evolved beyond the original framework by incorporating access to up-to-date and authoritative data sources such as PubMed and other verified biomedical databases, and by improving transparency through explicit citation of retrieved evidence. Recent developments further integrate structured medical knowledge graphs (GraphRAG) [43] and multimodal inputs (Multimdal RAG) [44,45], which enhance semantic retrieval interpretability, and cross-modal reasoning in clinical applications. This performance gap highlights that RAG is not merely a plug-in but a core architecture for reliable clinical decision support, enabling continuous knowledge updates without the costly retraining of the entire model.
3.3. Fine-tuning: Model specialization and deep customization
Working Mechanism Analysis: Fine-tuning refers to the process of taking a pretrained general-purpose large model and pretraining it with a smaller yet highly specialized, domain-specific dataset (e.g., thousands of in-house spinal surgery records or imaging reports) [46]. This process adjusts the model's internal parameters (weights), enabling it to better adapt to the linguistic style, professional terminology, and intrinsic logic of a specific domain, thus becoming a “domain expert” [47,48]. Modern techniques like QLoRA (Quantized Low-Rank Adaptation) [49], LoRA (Low-Rank Adaptation) [50], and adapter tuning [51] have made this process more computationally efficient [[52], [53], [54]]. The details are shown in Fig. 3 (C).
Clinical Significance: Fine-tuning enables the creation of highly specialized “expert models” whose performance on specific clinical tasks can surpass that of state-of-the-art general-purpose models. General LLMs often lack a nuanced understanding of subspecialty-specific details. By fine-tuning a model on a professional orthopedic dataset, it can “internalize” the knowledge graph of that field. For example, a fine-tuned ROBERTa [55] model achieved an Area Under the Curve (AUC) value of 0.99 in identifying preoperative frailty from clinical notes. This means that hospitals or research institutions with high-quality, structured data can develop high-performance, customized AI models tailored to their specific needs (e.g., predicting surgical complication risks for their local patient population) without bearing the enormous cost of pretraining a large model from scratch [56].
3.4. Multimodal fusion: Comprehensive diagnostic and therapeutic capabilities beyond text
Working Mechanism Analysis: Clinical decision-making in spinal surgery is inherently a multimodal process, requiring physicians to synthesize and analyze patient's verbal history (text), physical examination findings (observation), imaging studies (images), and even biomechanical data. The goal of multimodal AI is precisely to simulate this process by building models that can simultaneously understand and integrate different types of data [[57], [58], [59]]. Fig. 3 (D) illustrates the fusion methods, which typically include multiple parallel processing streams. For instance, a Convolutional Neural Network (CNN) extracts visual features from CT or MRI images, a Transformer model extracts textual features from electronic medical records, and a “fusion mechanism” then intelligently weighs and combines these heterogeneous features to produce a unified diagnosis or prediction [[60], [61], [62], [63], [64]].
Clinical Significance: The “performance paradox” exhibited by general LLMs in evaluations, excellent text processing ability but extremely weak image interpretation ability (e.g., only 11.2% accuracy in diagnosing trauma X-rays [65]), is an inevitable consequence of their text-centric pre-training. This indicates that simply adding a generic “vision” module to an LLM is far from sufficient to handle the complexity and precision required for medical imaging. The path toward clinically reliable diagnostic AI necessarily lies in multimodal integration. Cutting-edge research has already pointed the way: a multimodal framework called “FracturaX” demonstrated superior performance in fracture detection by fusing deep features from CT and X-rays compared to single-modality methods. Another study developed a multimodal model that integrated clinical and MRI data to predict the neurological prognosis of patients with cervical spinal cord injury, achieving an AUC of up to 0.94 [17]. Therefore, the future of spine diagnostic AI lies not in making language models “see” better, but in building new hybrid intelligent systems that can deeply integrate the logical reasoning abilities of language models with the visual perception capabilities specifically designed for medical imaging.
4. Clinical application scenarios and medical value analysis
This section will systematically review and analyze the specific practices, performance, and potential medical value of large language models in four core application scenarios within spine medicine. The findings are listed in Table 3.
Table 3.
Performance summary of medical large models in spinal clinical application scenarios.
| Application Scenario | Specific Task | Evaluated Model | Key Performance Metric | Result | Stage of Clinical Adoption |
|---|---|---|---|---|---|
| Clinical Diagnosis (Text-based) | Extract Pfirrmann grade from lumbar MRI reports | GPT-4 | Accuracy/Cohen's Kappa | 98%/0.975 | Retrospective Validation [66] |
| Clinical Diagnosis (Image-based) | Direct diagnosis of trauma X-rays | GPT-4o | Accuracy | 11.2% | Prototype Verification [65] |
| Clinical Score Automation | Calculate SINS score from imaging reports | Claude 3.5 | ICC | 0.984 | Internal Validation [26] |
| Guideline Adherence (RAG) | Answer NASS guideline questions | NotebookLM (RAG) | Accuracy | 98.3% | Prototype Verification [27] |
| Guideline Adherence (General LLM) | Answer NASS guideline questions | ChatGPT-4o | Accuracy | 40.7% | Experimental [27] |
| Prognosis Prediction (Multimodal) | Predict neurological recovery after cervical spinal cord injury | LightGBM (Multimodal) | AUC | 0.94 | Prospective Pilot Study [17] |
| Patient Education (Readability) | Simplify thoracolumbar fracture MRI reports | GPT-4o | Flesch-Kincaid Grade Level | From 12th grade to 10th grade | Proof of Concept [28] |
| Patient Education (Understanding) | Simplify spine MRI reports | In-house Secure LLM | Patient understanding score (1–10) | From 6.56 to 8.50 (P < .001) | Prospective Pilot Study [67] |
| Medical Education | OITE Exam for Orthopedic Residents | GPT-4 | Accuracy | Up to 73.6% (PGY-5 level) | Validation (Non-Clinical) [68] |
Abbreviations: MRI, Magnetic Resonance Imaging; GPT-4, Generative Pre-trained Transformer 4; GPT-4o, Generative Pre-trained Transformer 4, optimized; SINS, Spinal Instability Neoplastic Score; ICC, Intraclass Correlation Coefficient; RAG, Retrieval-Augmented Generation; NASS, North American Spine Society; NotebookLM, Notebook Language Model; LightGBM, Light Gradient Boosting Machine; AUC, Area Under the Curve; OITE, Orthopaedic In-Training Examination; PGY, Post-Graduate Year.
Note: The clinical adoption stages are categorized as follows: Prospective Pilot Study denotes prospective, single-center application with human subjects; Retrospective Validation indicates high-accuracy validation using historical datasets; Internal Validation refers to evaluation on in-house, privacy-preserving models within an institution; Prototype Verification represents validation of a technical prototype (e.g., RAG-augmented system) in a controlled setting; Experimental/Proof of Concept encompasses early exploratory studies with limited clinical readiness.
4.1. Clinical decision support and diagnostic assistance
This is the area with the most potential and the greatest challenges for LLM applications [[69], [70], [71], [72]], where performance is sharply polarized.
The “Performance Paradox” in Spinal Diagnosis: The current capabilities of LLMs in diagnostic tasks clearly illustrate a “performance paradox.” On one hand, they are excellent “knowledge engines,” performing well in standardized theoretical exams. For example, GPT-4 achieved an accuracy of 73.6% on the Orthopaedic In-Training Examination (OITE) from a systematic review [68], equivalent to a senior resident's level. However, when faced with real clinical cases requiring complex reasoning directly from images, their ability is significantly deficient. A study comparing AI models with human surgeons found that AI was highly accurate on knowledge-based theoretical questions (e.g., DeepSeek at 80%), but its accuracy dropped sharply on case analysis questions requiring comprehensive judgment (44%), far below that of experienced surgeons (88.8%) [30]. This indicates that the current value of LLMs lies mainly in processing and organizing information that has already been interpreted by human experts, rather than in performing independent, de novo diagnostic reasoning.
4.1.1. Case study 1: Extracting structured data from clinical narrative text
This is one of the most successful applications of LLMs to date, capable of transforming unstructured, natural language text written by physicians into machine-readable, structured data, thereby greatly improving the efficiency of clinical data utilization. Details are illustrated in Fig. 4.
Fig. 4.
Three example responses from LLMs in spinal care. Case 1 demonstrates the extraction of structured data (e.g., vertebral level, disc height loss, Pfirrmann grade) from unstructured radiology reports. Case 2 shows automated standardized scoring, where clinical findings are translated into a SINS. Case 3 highlights guideline adherence, where RAG enables the LLM to provide accurate, evidence-based antimicrobial prophylaxis recommendations in alignment with NASS guidelines, surpassing the baseline model without external knowledge integration. (Schematic). MRI, Magnetic Resonance Imaging; JSON, JavaScript Object Notation; L4-L5, Lumbar 4 - Lumbar 5; T2, Transverse Relaxation Time 2; L5-S1, Lumbar 5 - Sacral 1; SINS, Spinal Instability Neoplastic Score; y/o, years old; T7-T8, Thoracic 7 - Thoracic 8; CT, Computed Tomography; PET, Positron Emission Tomography; FDG, Fluorodeoxyglucose; TLIF, Transforaminal Lumbar Interbody Fusion; RAG, Retrieval-Augmented Generation; IV, Intravenous; g/mg/kg, gram/milligram/kilogram; L, Liter; min/h, minute/hour; DM, Diabetes Mellitus; MRSA, Methicillin-Resistant Staphylococcus Aureus.
Pfirrmann Grade Extraction: In a study on lumbar disc degeneration, GPT-4 was applied to automatically extract Pfirrmann grades from unstructured lumbar MRI reports. According to an orthopedic systematic review [66], the results showed an accuracy of up to 98%, with an agreement (Cohen's Kappa coefficient) of 0.975, which is a near-perfect level of consistency with senior radiology experts. The medical value of this application lies in its ability to automate the large-scale processing of historical imaging reports, providing high-quality structured data for epidemiological research and prognostic model building, without the need for extensive manual review.
General Information Extraction: Research has shown that natural language processing technology can accurately query key information such as past treatments, positive physical examination signs, and types and sizes of surgical implants from historical progress notes with over 90% sensitivity and specificity [73].
4.1.2. Case study 2: Automated standardized clinical scoring
This constitutes another high-value, low-risk application scenario, where LLMs are leveraged to perform rule-based computational tasks [74,75]. Fig. 4 displays a process of automatically calculating SINS scores from a patient's medical record.
SINS Score Calculation: The Spinal Instability Neoplastic Score (SINS) is a key tool for assessing spinal stability in patients with spinal metastases, directly influencing treatment decisions. A study evaluated the accuracy of two privacy-preserving, in-house LLMs (Claude 3.5 and Llama 3.1) in calculating SINS scores based on imaging reports and electronic health records [26] The results showed that Claude 3.5, the better-performing model, had a very high consistency with the “gold standard” reference value determined by three experts, with an Intraclass Correlation Coefficient (ICC) of 0.984, performing on par with or even better than human residents [26]. Clinically, the significance of this technology is its potential to standardize and automate a time-consuming and somewhat subjective assessment process, thereby reducing the burden on clinicians and improving the consistency and efficiency of decision-making, especially in scenarios where non-spine specialists need to make a preliminary assessment.
4.1.3. Case study 3: Enhancing adherence to clinical guidelines
The Advantage of RAG Technology: As demonstrated in Fig. 4, general-purpose LLMs show variable performance in providing advice that complies with clinical guidelines. For example, GPT-4 had an accuracy of 81% when answering questions about NASS guidelines for antibiotic prophylaxis in spine surgery [33], and another comparative study indicated a consistency of 67.9% when dealing with guidelines for degenerative spondylolisthesis [76]. However, when RAG technology was used to enhance the model with NASS guidelines as an external knowledge base, the model's accuracy surged to 98.3% [27].
Clinical significance: This striking contrast highlights a clear pathway toward reliable clinical decision support tools. A RAG-based system can deliver point-of-care recommendations that are fully aligned with up-to-date clinical guidelines. Such a capability holds substantial value for promoting evidence-based practice, reducing inter-physician variability in treatment decisions, and ultimately improving the quality of patient care.
4.2. Surgical planning, risk stratification, and prognosis prediction
For complex surgeries such as Adult Spinal Deformity (ASD), precise preoperative planning, risk assessment, and prognosis prediction are crucial. The application of AI in this field is evolving from processing single-modality analysis to multimodal data fusion and multi-objective optimization [77].
Multimodal Predictive Models: Next-generation prognostic models are expected to be multimodal [[78], [79], [80]], incorporating structured clinical data (e.g., age, comorbidities), quantitative imaging features derived from radiomics, and even “digital phenotype” data from wearable devices (e.g., gait, activity levels). A recent study demonstrated a multimodal model integrating clinical and MRI data to predict postoperative neurological recovery in patients with cervical spinal cord injury, achieving an AUC of 0.94 and underscoring its exceptional predictive capability [17].
AI-Assisted Patient Screening: Instead of the traditional binary decision of “surgery” versus “non-surgery,” AI enables more refined risk stratification strategies [81], such as “direct surgery,” “surgery after optimization,” and “surgery not recommended”. For instance, models can identify patients at high risk of postoperative mechanical complications and recommend targeted preoperative interventions, such as bone health assessment and tailored rehabilitation. This approach may convert some patients previously deemed ineligible for surgery into viable candidates, thereby expanding the pool of patients who can safely benefit from surgical treatment.
4.3. Clinical workflow optimization and automation
Although not as eye-catching as diagnostic assistance, workflow automation may represent the fastest-deploying and most widely applicable use of LLMs in healthcare, with its core value lying in freeing physicians from burdensome administrative tasks and enabling them to devote more time to direct patient care [[82], [83], [84]].
Automated Medical Documentation: Studies have shown that LLMs can generate discharge summaries at a speed exceeding that of physicians while maintaining comparable quality. They can also assist in drafting responses to patient messages, thereby alleviating physician workload and mitigating burnout [[85], [86], [87], [88]].
Automated Coding: Another promising direction is the use of natural language processing to extract key information from operative notes and automatically generate the corresponding CPT (Current Procedural Terminology) billing codes. This capability has the potential to substantially enhance both the efficiency and accuracy of hospital operations.
4.4. Patient communication and health education
In patient-facing applications, LLMs are a “double-edged sword” with great potential that must be used with caution [89].
4.4.1. Successful application: Simplifying professional medical terminology
This is undoubtedly one of the most mature and successful applications to date. Multiple studies have consistently confirmed that LLMs can effectively translate medical reports filled with professional jargon, which are daunting for ordinary patients, into plain, understandable language.
Quantitative Evidence: A study on spinal MRI reports found that after AI translation and simplification, patients’ self-rated understanding scores significantly increased from 6.56 to 8.50 (on a 10-point scale, P < .001) [67]. Another study on thoracolumbar fracture MRI reports also showed that AI-generated explanations were far more readable than the original reports (Flesch Readability Ease Score: 53.9 vs 32.15, P < .001), and the reading difficulty level was reduced from 11th-12th grade (high school senior/college) to 10th-11th grade level [28].
4.4.2. The risk: Unreliability of direct Q&A
Using unverified general-purpose LLMs directly as patient-facing Q&A bots poses significant risks.
Factual Inaccuracy: Model-generated responses often require secondary clarification and correction by clinicians.
Poor Readability: The default-generated text often has a reading difficulty level at the university level (Flesch-Kincaid Grade Level = 12–14), far exceeding the recommended 6th-8th grade level for the general public.
Reference Hallucination: As revealed by prior research, the model's “fabrication” of references to increase credibility is extremely common (fabrication rate up to 55% [90]), which undermines trust and makes it impossible for patients to trace the source.
4.4.3. An Unexpected finding: The “empathy” of AI
An unexpected research finding is that in simulated doctor-patient communication, LLMs may exhibit “empathy” that surpasses humans. In a direct comparison, ChatGPT's responses to online patient queries were rated significantly higher than those of practicing surgeons in terms of both empathy and overall quality (P < .001). This does not mean that AI possesses emotions, but rather that LLMs can be trained to consistently and tirelessly output text that follows best practices for empathetic communication (e.g., using reassuring language, acknowledging patient feelings). This suggests a new application direction: on the premise of ensuring absolute information accuracy (e.g., via RAG), AI may serve as an “auxiliary tool for humanistic care,” providing standardized emotional support, upon which clinicians can further build personalized and in-depth communication.
5. Key challenges and ethical considerations
Although large language models have demonstrated great potential in spine medicine, the path toward safe and reliable clinical application remains beset by challenges. Before widespread deployment, it is essential to acknowledge and address the following four critical issues.
5.1. Information reliability: “Hallucination” and outdated knowledge
The “Hallucination” Problem: This is one of the most fundamental inherent flaws of generative AI. “Hallucination,” also referred to as “confabulation,” denotes the generation of content that appears plausible but is factually incorrect or entirely fabricated [[91], [92], [93], [94], [95]]. This is not a rare programming error but a direct product of the current models' working principle of generating text based on probability. In the medical field, where accuracy is paramount, even a single hallucination can lead to misdiagnosis, inappropriate treatment, or irreversible harm [96]. Alarmingly, our study found that up to 88.3% of AI-generated references were fabricated in support of the model's claims.
Knowledge Lag: The knowledge of large language models is “frozen” at the time their training data was collected [97]. Consequently, they are unable to incorporate the latest clinical guidelines, newly approved drugs, or emerging surgical techniques. Relying on such outdated information for clinical decision-making is akin to consulting an obsolete textbook, with obvious risks. This challenge underscores the importance of approaches such as RAG, which enable dynamic access to up-to-date medical knowledge bases.
5.2. Data privacy and security
Risk of Protected Health Information (PHI): Clinical data related to spine disorders, including electronic health records (EHRs) and imaging studies, constitute highly sensitive protected health information (PHI). When such data are uploaded to third-party, cloud-based LLM services, they are exposed to substantial privacy and security risks. Whether through accidental data leakage or malicious cyberattacks, the consequences could be catastrophic.
Compliance Requirements: The use of AI in medical settings must strictly adhere to relevant legal and regulatory frameworks [98], such as the Health Insurance Portability and Accountability Act (HIPAA) in the United States. Therefore, developing and deploying LLMs that can run within a hospital's firewall, either locally or through specially designed privacy-preserving architectures, is a safer and more compliant choice. For instance, the aforementioned study on calculating SINS scores used an institutional privacy-preserving model, which provides a model for future clinical applications.
5.3. Algorithmic bias and fairness
The “Garbage In, Garbage Out” Principle: AI models inherently lack values, they merely reflect the data on which they are trained. If the training data contains systemic biases related to race, gender, or socioeconomic status, the models will not only inherit and reproduce these biases but may even amplify them [99].
Threat to Health Equity: Algorithmic bias can lead to serious disparities in healthcare [100]. For example, a model inadequately trained on data from a specific population may have lower diagnostic accuracy for diseases in that group or recommend suboptimal treatment plans [101]. Therefore, rigorous fairness audits and bias mitigation must be conducted at every stage of model development and deployment [102]. RAG technology offers a potential way to alleviate this issue by guiding the model to retrieve information from more diverse and representative knowledge bases.
5.4. The “Black Box Problem” and clinical trust
Opacity of the Decision-Making Process: Large language models, with hundreds of billions or even trillions of parameters, exhibit extremely complex internal decision-making mechanisms. To end users, their operation often resembles a “black box,” in which only the inputs and outputs are visible, while the reasoning behind a specific conclusion remains obscure.
The Need for Explainable AI (XAI): This opacity represents one of the greatest obstacles to building clinical trust. Before adopting an AI's recommendation, a physician needs a clear understanding of the basis for its decision. Without insight into the model's “thought process,” clinicians cannot properly evaluate its reliability, let alone rely on it for high-stakes medical decisions. Therefore, future research must invest heavily in Explainable AI (XAI) to develop more transparent, traceable AI systems that can provide reasonable explanations for their conclusions [103]. Such advancements are essential for earning the trust of clinicians [104].
6. Conclusion
In this paper, we provided a comprehensive survey of large language models and their associated systems within spine medicine, examining their potential to function as “Clinical Co-Pilot” or “Intelligent Assistant”. Their core value lies in enhancing rather than replacing human expertise through automated information processing, augmented data insights, and accelerated knowledge acquisition. These systems can substantially improve efficiency in low-risk tasks such as document management and data extraction. However, it is crucial to emphasize that for high-stakes clinical decision-making, the human physician, armed with experience and ethical responsibility, remains the indispensable last line of defense. The transition of LLMs from experimental tools to reliable clinical partners is contingent upon overcoming significant evidence gaps, particularly concerning generalizability across diverse healthcare systems, interoperability with existing electronic health records, and robustness in real-world, multi-center validation.
To advance LLMs from their current role as auxiliary tools to deeply integrated intelligent diagnostic and therapeutic partners, future research should focus on constructing large-scale, privacy-preserving, and spine-specific multimodal datasets through multi-center and cross-institutional collaboration, which serve as the foundation of AI progress. In addition, domain-specific, RAG-enhanced, and multi-agent LLMs should be developed and integrated into clinical workflows within a Human-in-the-Loop framework, thereby seamlessly combining the reasoning capabilities of language models, the perceptual power of vision models, and the analytical capacity for structured data. This technical progress must be paralleled by an equally critical focus on the practical hurdles of clinical integration. Therefore, we propose that subsequent research and development must explicitly address a checklist of translational challenges: (1) Regulatory Compliance & Approval: Navigating frameworks for software as a medical device; (2) Data Privacy & Security: Ensuring strict adherence to protocols in model training and deployment; (3) Bias Mitigation & Governance: Implementing auditable frameworks to identify and mitigate dataset and model biases to ensure equity and fairness; and (4) Workflow Integration: Designing systems that seamlessly fit into clinical routines without increasing clinicians cognitive load.
Therefore, a measured optimism, coupled with a pragmatic approach to overcoming these barriers, is essential for guiding the responsible and ethical development of LLMs in spine medicine.
Authors contribution
B Chen and WY Tang were responsible for Conceptualization. Z Shen, X Long and DD Yu were responsible for Visualization. WY Tang and RZ Chen were responsible for Writing-original draft. B Chen, WY Tang and RZ Chen were responsible for Writing-review & editing. All authors have read and agreed to the published version of the manuscript.
Declaration of generative AI in scientific writing
Generative artificial intelligence (AI) and AI-assisted technologies were used to improve readability and language.
Funding
The research was supported by the Zhejiang Province's Vanguard Geese Leading Plan Project (Grant No. 2025C02170).
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Footnotes
This article is part of a special issue entitled: Cartilage Repair published in Journal of Orthopaedic Translation.
Appendix A. Full Electronic Search Strategies
PubMed Search Strategy:
(“Artificial Intelligence" [MeSH] OR “Natural Language Processing" [MeSH] OR “Large Language Model" [Title/Abstract] OR “LLM" [Title/Abstract] OR “Generative AI" [Title/Abstract] OR “ChatGPT" [Title/Abstract] OR “GPT-4" [Title/Abstract] OR “GPT-4o" [Title/Abstract] OR “Claude" [Title/Abstract] OR “Gemini" [Title/Abstract] OR “Llama" [Title/Abstract] OR “DeepSeek" [Title/Abstract] OR “Retrieval-Augmented Generation" [Title/Abstract] OR “RAG" [Title/Abstract] OR “Prompt Engineering" [Title/Abstract])
AND.
(“Spine" [MeSH] OR “Spinal Diseases" [MeSH] OR “Spinal Cord" [MeSH] OR “Vertebra∗" [Title/Abstract] OR “Spine" [Title/Abstract] OR “Spinal" [Title/Abstract] OR “Back Pain" [Title/Abstract] OR “Spondyl∗" [Title/Abstract] OR “Scolio∗" [Title/Abstract] OR “Disc Herniation" [Title/Abstract] OR “Degenerative Disc" [Title/Abstract])
Embase Search Strategy:
('artificial intelligence'/exp OR 'natural language processing'/exp OR 'large language model':ti,ab,kw OR 'generative ai':ti,ab,kw OR 'chatgpt':ti,ab,kw OR 'gpt 4':ti,ab,kw OR 'gpt 4o':ti,ab,kw OR 'claude':ti,ab,kw OR 'gemini':ti,ab,kw OR 'llama':ti,ab,kw OR 'deepseek':ti,ab,kw OR 'retrieval augmented generation':ti,ab,kw OR 'rag':ti,ab,kw)
AND.
('spine'/exp OR 'spine disease'/exp OR 'spine surgery'/exp OR 'orthopedics'/exp OR 'vertebra'/exp OR 'intervertebral disk hernia'/exp OR 'spine':ti,ab,kw OR 'spinal':ti,ab,kw OR 'vertebra∗':ti,ab,kw OR 'back pain':ti,ab,kw OR 'spondyl∗':ti,ab,kw OR 'scolio∗':ti,ab,kw OR 'disc herniation':ti,ab,kw)
Appendix B. Detailed Study Selection Protocol
The study selection process consisted of the following steps.
-
1)Identification:
-
•A total of 4590 records were identified through database searching (PubMed: n = 2071; Embase: n = 2519).
-
•After removing 821 duplicate records, 3769 unique records remained for screening.
-
•
-
2)Screening:
-
•Titles and abstracts of 3769 records were screened for relevance.
-
•A total of 3243 records were excluded for the following reasons:
-
•Not relevant to the study topic (n = 2801)
-
•Non-human subjects or non-English language (n = 442)
-
•This resulted in 526 reports sought for full-text retrieval.
-
•
-
3)Eligibility:
-
•Full-text articles were assessed for eligibility against predefined inclusion and exclusion criteria.
-
•Of the 526 reports, 12 could not be retrieved.
-
•A total of 514 full-text articles were assessed.
-
•Among these, 488 were excluded for the following reasons:
- o Reviews or case reports (n = 413)
- o Journal Impact Factor <3.0 (n = 75)
-
•26 studies met all eligibility criteria and were included in the final review.
-
•
References
- 1.Wu K., Xia Y., Deng P., Liu R., Zhang Y., Guo H., et al. TamGen: drug design with target-aware molecule generation through a chemical language model. Nat Commun. 2024;15(1):9360. doi: 10.1038/s41467-024-53632-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Ruan Y., Lu C., Xu N., He Y., Chen Y., Zhang J., et al. An automatic end-to-end chemical synthesis development platform powered by large language models. Nat Commun. 2024;15(1) doi: 10.1038/s41467-024-54457-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Yu Z., Ouyang X., Shao Z., Wang M., Yu J. Prophet: prompting large language models with complementary answer heuristics for knowledge-based visual question answering. IEEE Trans Pattern Anal Mach Intell. 2025;47(8):6797–6808. doi: 10.1109/TPAMI.2025.3562422. [DOI] [PubMed] [Google Scholar]
- 4.Lu J., Song Z., Zhao Q., Du Y., Cao Y., Jia H., et al. Generative design of functional metal complexes utilizing the internal knowledge and reasoning capability of large language models. J Am Chem Soc. 2025;147(36):32377–32388. doi: 10.1021/jacs.5c02097. [DOI] [PubMed] [Google Scholar]
- 5.Achiam O.J., Adler S., Agarwal S., Ahmad L., Akkaya I., Aleman F.L., et al. 2023. GPT-4 technical report. [Google Scholar]
- 6.Shah N.H., Entwistle D., Pfeffer M.A. Creation and adoption of large language models in medicine. JAMA. 2023;330(9):866–869. doi: 10.1001/jama.2023.14217. [DOI] [PubMed] [Google Scholar]
- 7.Khera R., Butte A.J., Berkwits M., Hswen Y., Flanagin A., Park H., et al. AI in Medicine-JAMA's focus on clinical outcomes, patient-centered care, quality, and equity. JAMA. 2023;330(9):818–820. doi: 10.1001/jama.2023.15481. [DOI] [PubMed] [Google Scholar]
- 8.Pressman S.M., Borna S., Gomez-Cabello C.A., Haider S.A., Haider C.R., Forte A.J. Clinical and surgical applications of large language models: a systematic review. J Clin Med. 2024;13(11) doi: 10.3390/jcm13113041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Che H., Jin H., Gu Z., Lin Y., Jin C., Chen H. LLM-driven medical report generation via communication-efficient heterogeneous federated learning. IEEE Trans Med Imaging. 2025 doi: 10.1109/TMI.2025.3591185. [DOI] [PubMed] [Google Scholar]
- 10.Zhang L., Liu M., Wang L., Zhang Y., Xu X., Pan Z., et al. Constructing a large language model to generate impressions from findings in radiology reports. Radiology. 2024;312(3) doi: 10.1148/radiol.240885. [DOI] [PubMed] [Google Scholar]
- 11.Quer G., Topol E.J. The potential for large language models to transform cardiovascular medicine. Lancet Digit Health. 2024;6(10):e767–e771. doi: 10.1016/S2589-7500(24)00151-1. [DOI] [PubMed] [Google Scholar]
- 12.Tu T., Schaekermann M., Palepu A., Saab K., Freyberg J., Tanno R., et al. Towards conversational diagnostic artificial intelligence. Nature. 2025;642(8067):442–450. doi: 10.1038/s41586-025-08866-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Liang S., Zhang J., Liu X., Huang Y., Shao J., Liu X., et al. The potential of large language models to advance precision oncology. EBioMedicine. 2025;115 doi: 10.1016/j.ebiom.2025.105695. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Zhang C., Liu S., Zhou X., Zhou S., Tian Y., Wang S., et al. Examining the role of large language models in orthopedics: systematic review. J Med Internet Res. 2024;26 doi: 10.2196/59607. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Yang A.J., Woo J.J., Ramkumar P.N. Editorial commentary: shifting from redundancy to rigor in orthopaedic large language model research. Arthroscopy. 2025 doi: 10.1016/j.arthro.2025.06.020. [DOI] [PubMed] [Google Scholar]
- 16.Tappa K., Bird J.E., Arribas E.M., Santiago L. Multimodality imaging for 3D printing and surgical rehearsal in complex spine surgery. Radiographics. 2024;44(3) doi: 10.1148/rg.230116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Shimizu T., Inomata K., Suda K., Matsumoto Harmon S., Komatsu M., Ota M., et al. A multimodal machine learning model integrating clinical and MRI data for predicting neurological outcomes following surgical treatment for cervical spinal cord injury. Eur Spine J. 2025;34(9):3747–3755. doi: 10.1007/s00586-025-08873-2. [DOI] [PubMed] [Google Scholar]
- 18.Yu Q.S., Shan J.Y., Ma J., Gao G., Tao B.Z., Qiao G.Y., et al. Multi-modal and multi-view cervical spondylosis imaging dataset. Sci Data. 2025;12(1):1080. doi: 10.1038/s41597-025-05403-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Denecke K., May R., LlmhealthGroup Rivera Romero O. Potential of large language models in health care: delphi study. J Med Internet Res. 2024;26 doi: 10.2196/52399. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Matheny M.E., Yang J., Smith J.C., Walsh C.G., Al-Garadi M.A., Davis S.E., et al. Enhancing postmarketing surveillance of medical products with large language models. JAMA Netw Open. 2024;7(8) doi: 10.1001/jamanetworkopen.2024.28276. [DOI] [PubMed] [Google Scholar]
- 21.Fang M., Wang Z., Pan S., Feng X., Zhao Y., Hou D., et al. Large models in medical imaging: advances and prospects. Chin Med J (Engl) 2025;138(14):1647–1664. doi: 10.1097/CM9.0000000000003699. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Liu Y., Yuan Y., Yan K., Li Y., Sacca V., Hodges S., et al. Evaluating the role of large language models in traditional Chinese medicine diagnosis and treatment recommendations. npj Digit Med. 2025;8(1):466. doi: 10.1038/s41746-025-01845-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Team G., Anil R., Borgeaud S., Alayrac J.-B., Yu J., Soricut R., et al. Gemini: a family of highly capable multimodal models. 2023. https://ui.adsabs.harvard.edu/abs/2023arXiv231211805G arXiv:2312.
- 24.Comanici G., Bieber E., Schaekermann M., Pasupat I., Sachdeva N., Dhillon I., et al. Gemini 2.5: pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. 2025. https://ui.adsabs.harvard.edu/abs/2025arXiv250706261C arXiv:2507.
- 25.Anthropic S. Model card addendum: claude 3.5 haiku and upgraded claude 3.5 sonnet. https://apisemanticscholarorg/CorpusID2024;273639283 URL.
- 26.Chan L.Y.T., Chan D.Z.M., Tan Y.L., Yap Q.V., Ong W., Lee A., et al. Evaluating the accuracy of privacy-preserving large language models in calculating the spinal instability neoplastic score (SINS) Cancers (Basel) 2025;17(13) doi: 10.3390/cancers17132073. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Su A.Y., Knebel A., Xu A.Y., Kaper M., Schmitt P., Nassar J.E., et al. Evaluation of retrieval-augmented generation and large language models in clinical guidelines for degenerative spine conditions. Eur Spine J. 2025 doi: 10.1007/s00586-025-08994-8. [DOI] [PubMed] [Google Scholar]
- 28.Sing D.C., Shah K.S., Pompliano M., Yi P.H., Velluto C., Bagheri A., et al. Enhancing magnetic resonance imaging (MRI) report comprehension in spinal trauma: readability analysis of AI-Generated explanations for thoracolumbar fractures. JMIR AI. 2025;4(1) doi: 10.2196/69654. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.DeepSeek-Ai, Guo D., Yang D., Zhang H., Song J., Zhang R., et al. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. 2025. https://ui.adsabs.harvard.edu/abs/2025arXiv250112948D arXiv:2501.
- 30.Demir M.T., Kültür Y. The Journal of Turkish Spinal Surgery; 2025. A comparative study of orthopedic surgeons and ai models in the clinical evaluation of spinal surgery. [Google Scholar]
- 31.Jiang A.Q., Sablayrolles A., Roux A., Mensch A., Savary B., Bamford C., et al. Mixtral of experts. 2024. https://ui.adsabs.harvard.edu/abs/2024arXiv240104088J arXiv:2401.04088.
- 32.Zhu Z., Liu J., Hong C.W., Houshmand S., Wang K., Yang Y. Multimodal large language model with knowledge retrieval using flowchart embedding for forming Follow-Up recommendations for pancreatic cystic lesions. AJR Am J Roentgenol. 2025;225(1) doi: 10.2214/AJR.25.32729. [DOI] [PubMed] [Google Scholar]
- 33.Zaidat B., Shrestha N., Rosenberg A.M., Ahmed W., Rajjoub R., Hoang T., et al. Performance of a large language model in the generation of clinical guidelines for antibiotic prophylaxis in spine surgery. Neurospine. 2024;21(1):128–146. doi: 10.14245/ns.2347310.655. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Gao Y., Xiong Y., Gao X., Jia K., Pan J., Bi Y., et al. Retrieval-augmented generation for large language models: a survey. arXiv preprint arXiv:231210997. 2023;2(1) [Google Scholar]
- 35.Ge J., Sun S., Owens J., Galvez V., Gologorskaya O., Lai J.C., et al. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology. 2024;80(5):1158–1168. doi: 10.1097/HEP.0000000000000834. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Hao P., Wang H., Yang G., Zhu L. Enhancing visual reasoning with LLM-Powered knowledge graphs for visual question localized-answering in robotic surgery. IEEE J Biomed Health Inform. 2025 doi: 10.1109/JBHI.2025.3538324. [DOI] [PubMed] [Google Scholar]
- 37.Yang R., Ning Y., Keppo E., Liu M., Hong C., Bitterman D.S., et al. Retrieval-augmented generation for generative artificial intelligence in health care. npj Health Systems. 2025;2(1):2. [Google Scholar]
- 38.Li D., Yang Y., Cui Z., Yin H., Hu P., Hu L. LLM-DDI: leveraging large language models for drug-drug interaction prediction on biomedical knowledge graph. IEEE J Biomed Health Inform. 2025 doi: 10.1109/JBHI.2025.3585290. [DOI] [PubMed] [Google Scholar]
- 39.Feng Y., Chan T.H., Yin G., Yu L. Democratizing large language model-based graph data augmentation via latent knowledge graphs. Neural Netw. 2025;191 doi: 10.1016/j.neunet.2025.107777. [DOI] [PubMed] [Google Scholar]
- 40.Ke Y.H., Jin L., Elangovan K., Abdullah H.R., Liu N., Sia A.T.H., et al. Retrieval augmented generation for 10 large language models and its generalizability in assessing medical fitness. npj Digit Med. 2025;8(1):187. doi: 10.1038/s41746-025-01519-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Kresevic S., Giuffre M., Ajcevic M., Accardo A., Croce L.S., Shung D.L. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit Med. 2024;7(1):102. doi: 10.1038/s41746-024-01091-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Fink A., Rau A., Reisert M., Bamberg F., Russe M.F. Retrieval-Augmented generation with large language models in radiology: from theory to practice. Radiol Artif Intell. 2025;7(4) doi: 10.1148/ryai.240790. [DOI] [PubMed] [Google Scholar]
- 43.Han H, Wang Y, Shomer H, Guo K, Ding J, Lei Y, et al. Retrieval-Augmented generation with graphs (GraphRAG). ArXiv 2024;abs/2501.00309..
- 44.Abootorabi MM, Zobeiri A, Dehghani M, Mohammadkhani M, Mohammadi B, Ghahroodi O, et al. Ask in any modality: a comprehensive survey on multimodal retrieval-augmented generation. Paper presented at: Annual meeting of the Association for computational Linguistics2025..
- 45.Xia P., Xia P., Zhu K., Li H., Li H., Shi W., et al. MMed-RAG: versatile multimodal RAG system for medical vision language models. ArXiv. 2024 abs/2410.13085. [Google Scholar]
- 46.Zhou S., Xu Z., Zhang M., Xu C., Guo Y., Zhan Z., et al. Large language models for disease diagnosis: a scoping review. npj Artificial Intelligence. 2025;1(1):9. doi: 10.1038/s44387-025-00011-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Cohen A.B., Adamson B., Larch J.K., Amster G. Large language model extraction of PD-L1 biomarker testing details from electronic health records. AI in Precision Oncology. 2025;2(2):57–64. [Google Scholar]
- 48.Liu R.J., Forsythe A., Rege J.M., Kaufman P. BIO25-024: Real-time clinical trial data library in non-small cell lung (nsclc), prostate (pc), and breast cancer (bc) to support informed treatment decisions: now a reality with a fine-tuned large language model (LLM) J Natl Compr Canc Netw. 2025;23(3.5) doi: 10.6004/jnccn.2024.7156. [DOI] [PubMed] [Google Scholar]
- 49.Dettmers T., Pagnoni A., Holtzman A., Zettlemoyer L. Qlora: efficient finetuning of quantized llms. Adv Neural Inf Process Syst. 2023;36:10088–10115. [Google Scholar]
- 50.Xu Y., Xie L., Gu X., Chen X., Chang H., Zhang H., et al. QA-LoRA: Quantization-Aware low-rank adaptation of large language models. ArXiv. 2023 abs/2309.14717. [Google Scholar]
- 51.Houlsby N., Giurgiu A., Jastrzebski S., Morrone B., De Laroussilhe Q., Gesmundo A., et al. Paper presented at: international conference on machine learning. 2019. Parameter-efficient transfer learning for NLP. [Google Scholar]
- 52.Cai Z., Fang H., Liu J., Xu G., Long Y., Guan Y., et al. Improving unified information extraction in Chinese mental health domain with instruction-tuned LLMs and type-verification component. Artif Intell Med. 2025;162 doi: 10.1016/j.artmed.2025.103087. [DOI] [PubMed] [Google Scholar]
- 53.Shi J., Wang Z., Zhou J., Liu C., Sun P.Z.H., Zhao E., et al. MentalQLM: a lightweight large language model for mental healthcare based on instruction tuning and dual LoRA modules. IEEE J Biomed Health Inform. 2025 doi: 10.1109/JBHI.2025.3594133. [DOI] [PubMed] [Google Scholar]
- 54.Zhang Z., Liu P., Xu J., Hu R. Fed-HeLLo: efficient federated foundation model fine-tuning with heterogeneous LoRA allocation. IEEE Trans Neural Netw Learn Syst. 2025;36(10):17556–17569. doi: 10.1109/TNNLS.2025.3580495. [DOI] [PubMed] [Google Scholar]
- 55.Liu Y., Ott M., Goyal N., Du J., Joshi M., Chen D., et al. Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:190711692. 2019 [Google Scholar]
- 56.Wu Z., Dadu A., Nalls M., Faghri F., Sun J. Instruction tuning large language models to understand electronic health records. Adv Neural Inf Process Syst. 2024;37:54772–54786. [Google Scholar]
- 57.Shen Z., Li Y., Zhang H., Weng Y., Wang J. Rethinking early-fusion strategies for improved multimodal image segmentation. 2025. https://ui.adsabs.harvard.edu/abs/2025arXiv250110958S arXiv:2501.10958. Available at:
- 58.Zhao Z., Bai H., Zhang J., Zhang Y., Zhang K., Xu S., et al. 2024 IEEE/CVF conference on computer vision and pattern recognition (CVPR) 2023. Equivariant multi-modality image fusion; pp. 25912–25921. [Google Scholar]
- 59.Li P., Shu X., Feng C.-M., Feng Y., Zuo W., Tang J. Surgical video workflow analysis via visual-language learning. npj Health Systems. 2025;2(1):5. [Google Scholar]
- 60.Huang C., Mees O., Zeng A., Burgard W. Multimodal spatial language maps for robot navigation and manipulation. 2025. https://ui.adsabs.harvard.edu/abs/2025arXiv250606862H arXiv:2506.06862. Available at:
- 61.Su D., Zhang Y., Li H., Li J., Liu Y. UniFuse: a unified all-in-one framework for multi-modal medical image fusion under diverse degradations and misalignments. 2025. https://ui.adsabs.harvard.edu/abs/2025arXiv250622736S arXiv:2506.
- 62.Park M., Park H., Kim J. ViTA-PAR: visual and textual attribute alignment with attribute prompting for pedestrian attribute recognition. ArXiv. 2025 abs/2506.01411. [Google Scholar]
- 63.Gertz R.J., Beste N.C., Dratsch T., Lennartz S., Bremm J., Iuga A.I., et al. From dictation to diagnosis: enhancing radiology reporting with integrated speech recognition in multimodal large language models. Eur Radiol. 2025 doi: 10.1007/s00330-025-11929-y. [DOI] [PubMed] [Google Scholar]
- 64.Du Y., Chen K., Zhan Y., Low C.H., Islam M., Guo Z., et al. LMT++: adaptively collaborating LLMs with multi-specialized teachers for continual VQA in robotic surgical videos. IEEE Trans Med Imaging. 2025 doi: 10.1109/TMI.2025.3581108. [DOI] [PubMed] [Google Scholar]
- 65.Öztürk A., Günay S., Ateş S., Yiğit Yavuz Yigit Y. Can Gpt-4o accurately diagnose trauma X-Rays? A comparative study with expert evaluations. J Emerg Med. 2025;73:71–79. doi: 10.1016/j.jemermed.2024.12.010. [DOI] [PubMed] [Google Scholar]
- 66.Sertorio A.C., Bernetti C., Di Gennaro G., Zobel B.B., Mallio C.A. GPT-4 to obtain Pfirrmann grade from lumbar spine magnetic resonance imaging (MRI) reports. Quant Imaging Med Surg. 2024;14(9):7012–7017. doi: 10.21037/qims-24-883. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Encalada S., Gupta S., Hunt C., Eldrige J., Evans J., 2nd, Mosquera-Moscoso J., et al. Optimizing patient understanding of spine MRI reports using AI: a prospective single center study. Interv Pain Med. 2025;4(1) doi: 10.1016/j.inpm.2025.100550. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Kung J.E., Marshall C., Gauthier C., Gonzalez T.A., Jackson J.B., 3rd Evaluating ChatGPT performance on the orthopaedic In-Training examination. JB JS Open Access. 2023;8(3) doi: 10.2106/JBJS.OA.23.00056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 69.Kanjee Z., Crowe B., Rodman A. Accuracy of a generative artificial intelligence model in a complex diagnostic challenge. JAMA. 2023;330(1):78–80. doi: 10.1001/jama.2023.8288. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Open-source LLM DeepSeek on a par with proprietary models in clinical decision making. Nat Med. 2025;31(8):2496–2497. doi: 10.1038/s41591-025-03850-0. [DOI] [PubMed] [Google Scholar]
- 71.Liu X., Liu H., Yang G., Jiang Z., Cui S., Zhang Z., et al. A generalist medical language model for disease diagnosis assistance. Nat Med. 2025;31(3):932–942. doi: 10.1038/s41591-024-03416-6. [DOI] [PubMed] [Google Scholar]
- 72.McDuff D., Schaekermann M., Tu T., Palepu A., Wang A., Garrison J., et al. Towards accurate differential diagnosis with large language models. Nature. 2025;642(8067):451–457. doi: 10.1038/s41586-025-08869-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Shah R., Schwab J.H. Large language models in spine surgery: a promising technology. HSS J. 2025 doi: 10.1177/15563316251340696. 15563316251340696. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Buyuktoka R.E., Surucu M., Erekli Derinkaya P.B., Adibelli Z.H., Salbas A., Koc A.M., et al. Applying large language model for automated quality scoring of radiology requisitions using a standardized criteria. Eur Radiol. 2025 doi: 10.1007/s00330-025-11933-2. [DOI] [PubMed] [Google Scholar]
- 75.Bhayana R., Jajodia A., Chawla T., Deng Y., Bouchard-Fortier G., Haider M., et al. Accuracy of large language model-based automatic calculation of ovarian-adnexal reporting and data system MRI scores from pelvic MRI reports. Radiology. 2025;315(1) doi: 10.1148/radiol.241554. [DOI] [PubMed] [Google Scholar]
- 76.Ahmed W., Saturno M., Rajjoub R., Duey A.H., Zaidat B., Hoang T., et al. ChatGPT versus NASS clinical guidelines for degenerative spondylolisthesis: a comparative analysis. Eur Spine J. 2024;33(11):4182–4203. doi: 10.1007/s00586-024-08198-6. [DOI] [PubMed] [Google Scholar]
- 77.Wan F., Wang T., Wang K., Si Y., Fondrevelle J., Du S., et al. Surgery scheduling based on large language models. Artif Intell Med. 2025;166 doi: 10.1016/j.artmed.2025.103151. [DOI] [PubMed] [Google Scholar]
- 78.Oh Y., Park S., Byun H.K., Cho Y., Lee I.J., Kim J.S., et al. LLM-driven multimodal target volume contouring in radiation oncology. Nat Commun. 2024;15(1):9186. doi: 10.1038/s41467-024-53387-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Zhou J., He X., Sun L., Xu J., Chen X., Chu Y., et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat Commun. 2024;15(1):5649. doi: 10.1038/s41467-024-50043-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Chen X., Zhang W., Xu P., Zhao Z., Zheng Y., Shi D., et al. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. npj Digit Med. 2024;7(1):111. doi: 10.1038/s41746-024-01101-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 81.Schonfeld E., Pant A., Shah A., Sadeghzadeh S., Pangal D., Rodrigues A., et al. Evaluating computer vision, large language, and genome-wide association models in a limited sized patient cohort for pre-operative risk stratification in adult spinal deformity surgery. J Clin Med. 2024;13(3) doi: 10.3390/jcm13030656. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Williams C.Y.K., Subramanian C.R., Ali S.S., Apolinario M., Askin E., Barish P., et al. Physician- and large language model-generated hospital discharge summaries. JAMA Intern Med. 2025;185(7):818–825. doi: 10.1001/jamainternmed.2025.0821. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Hartman V., Zhang X., Poddar R., McCarty M., Fortenko A., Sholle E., et al. Developing and evaluating large language model-generated emergency medicine handoff notes. JAMA Netw Open. 2024;7(12) doi: 10.1001/jamanetworkopen.2024.48723. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Bhayana R., Alwahbi O., Ladak A.M., Deng Y., Basso Dias A., Elbanna K., et al. Leveraging large language models to generate clinical histories for oncologic imaging requisitions. Radiology. 2025;314(2) doi: 10.1148/radiol.242134. [DOI] [PubMed] [Google Scholar]
- 85.Stroop A., Stroop T., Zawy Alsofy S., Nakamura M., Mollmann F., Greiner C., et al. Large language models: are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery? Eur Spine J. 2024;33(11):4135–4143. doi: 10.1007/s00586-023-07975-z. [DOI] [PubMed] [Google Scholar]
- 86.Ayers J.W., Poliak A., Dredze M., Leas E.C., Zhu Z., Kelley J.B., et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183(6):589–596. doi: 10.1001/jamainternmed.2023.1838. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Chen S., Kann B.H., Foote M.B., Aerts H., Savova G.K., Mak R.H., et al. Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol. 2023;9(10):1459–1462. doi: 10.1001/jamaoncol.2023.2954. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Chen S., Guevara M., Moningi S., Hoebers F., Elhalawani H., Kann B.H., et al. The effect of using a large language model to respond to patient messages. Lancet Digit Health. 2024;6(6):e379–e381. doi: 10.1016/S2589-7500(24)00060-8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Altorfer F.C.S., Kelly M.J., Avrumova F., Rohatgi V., Zhu J., Bono C.M., et al. The double-edged sword of generative AI: surpassing an expert or a deceptive "false friend"? Spine J. 2025;25(8):1635–1643. doi: 10.1016/j.spinee.2025.02.010. [DOI] [PubMed] [Google Scholar]
- 90.Walters W.H., Wilder E.I. Fabrication and errors in the bibliographic citations generated by ChatGPT. Sci Rep. 2023;13(1) doi: 10.1038/s41598-023-41032-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Yu E., Chu X., Zhang W., Meng X., Yang Y., Ji X., et al. Large language models in medicine: applications, challenges, and future directions. Int J Med Sci. 2025;22(11):2792–2801. doi: 10.7150/ijms.111780. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Majovsky M., Cerny M., Kasal M., Komarc M., Netuka D. Artificial intelligence can generate fraudulent but authentic-looking scientific medical articles: pandora's box has been opened. J Med Internet Res. 2023;25 doi: 10.2196/46924. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Farquhar S., Kossen J., Kuhn L., Gal Y. Detecting hallucinations in large language models using semantic entropy. Nature. 2024;630(8017):625–630. doi: 10.1038/s41586-024-07421-0. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Verspoor K. 'Fighting fire with fire' - using LLMs to combat LLM hallucinations. Nature. 2024;630(8017):569–570. doi: 10.1038/d41586-024-01641-0. [DOI] [PubMed] [Google Scholar]
- 95.Asgari E., Montana-Brown N., Dubois M., Khalil S., Balloch J., Yeung J.A., et al. A framework to assess clinical safety and hallucination rates of LLMs for medical text summarisation. npj Digit Med. 2025;8(1):274. doi: 10.1038/s41746-025-01670-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Kim Y., Jeong H., Chen S., Li S.S., Lu M., Alhamoud K., et al. Medical hallucinations in foundation models and their impact on healthcare. arXiv preprint arXiv:250305777. 2025 [Google Scholar]
- 97.Moëll B., Sand Aronsson F. Harm reduction strategies for thoughtful use of large language models in the medical domain: perspectives for patients and Clinicians. J Med Internet Res. 2025;27 doi: 10.2196/75849. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 98.Li H., Moon J.T., Purkayastha S., Celi L.A., Trivedi H., Gichoya J.W. Ethics of large language models in medicine and medical research. Lancet Digit Health. 2023;5(6):e333–e335. doi: 10.1016/S2589-7500(23)00083-3. [DOI] [PubMed] [Google Scholar]
- 99.Zack T., Lehman E., Suzgun M., Rodriguez J.A., Celi L.A., Gichoya J., et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12–e22. doi: 10.1016/S2589-7500(23)00225-X. [DOI] [PubMed] [Google Scholar]
- 100.Lai H., Ge L., Sun M., Pan B., Huang J., Hou L., et al. Assessing the risk of bias in randomized clinical trials with large language models. JAMA Netw Open. 2024;7(5) doi: 10.1001/jamanetworkopen.2024.12687. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Templin T., Perez M.W., Sylvia S., Leek J., Sinnott-Armstrong N. Addressing 6 challenges in generative AI for digital health: a scoping review. PLOS Digit Health. 2024;3(5) doi: 10.1371/journal.pdig.0000503. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 102.Kaul T., Damen J.A.A., Wynants L., Van Calster B., van Smeden M., Hooft L., et al. Assessing the quality of prediction models in health care using the prediction model risk of bias assessment tool (PROBAST): an evaluation of its use and practical application. J Clin Epidemiol. 2025;181 doi: 10.1016/j.jclinepi.2025.111732. [DOI] [PubMed] [Google Scholar]
- 103.Savage T., Nayak A., Gallo R., Rangan E., Chen J.H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. npj Digit Med. 2024;7(1):20. doi: 10.1038/s41746-024-01010-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Su H., Sun Y., Li R., Zhang A., Yang Y., Xiao F., et al. Large language models in medical diagnostics: scoping review with bibliometric analysis. J Med Internet Res. 2025;27 doi: 10.2196/72062. [DOI] [PMC free article] [PubMed] [Google Scholar]




