Abstract
Background
Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent.
Objective
The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges.
Methods
PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of “large language model,” “generative artificial intelligence,” “ChatGPT,” and “orthopaedics,” were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment.
Results
A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs’ performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4’s accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections.
Conclusions
LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.
Keywords: large language model, LLM, orthopedics, generative pretrained transformer, GPT, ChatGPT, digital health, clinical practice, artificial intelligence, AI, generative AI, Bard
Introduction
Background
Large language models (LLMs) typically refer to pretrained language models (PLMs) that have a large number of parameters and are trained on massive amounts of data. In recent years, this area has emerged as one of the most prominent areas of research in artificial intelligence (AI) innovation [1,2]. What makes LLMs different from smaller-scale PLMs is their remarkable emergent abilities to solve complex tasks. Studies have found that LLMs, such as generative pretrained transformer (GPT)-3 with approximately 175 billion parameters, exhibit a significant leap in natural language processing (NLP) capabilities compared to PLMs with fewer parameters, such as GPT-2 with approximately 1.5 billion parameters [2,3]. Generative AI applications developed based on LLMs not only possess the ability to understand natural language but can also generate corresponding text, images, and even videos based on input sources. This human-machine interaction mode holds great potential in medical scenarios.
LLMs have undergone significant advancements in recent years; currently, the most prevalent web-based LLM service is ChatGPT (OpenAI). Launched in November 2022, ChatGPT is a chatbot application developed based on GPT-3.5 or GPT-4 after fine-tuning, and it can quickly respond to questions posed by users. In addition to ChatGPT, applications include Bard (upgraded to Gemini in December 2023) based on Language Model for Dialogue Applications (Google LLC); Med-PaLM 2 (Google LLC); ERNIE Bot (Baidu); and MOSS (Fudan University). GPT-4 can approach or achieve human-level performance in cognitive tasks across various fields, including medical domains [4]. When answering the 2022 United States Medical Licensing Examination questions, without further training or reinforcement, ChatGPT reached or approached a passing level in all 3 examinations [5]. However, answering examination questions does not directly reflect the performance of LLMs in clinical applications. The value and safety of a chatbot that is already in use are still not fully understood, making clinical research both essential and imperative. Published narrative reviews and editorials have explored the medical applications of LLM technology from 3 perspectives: clinical practice, education, and research [1,6-8]. These publications also provide a preliminary assessment of the value and safety of LLMs, offering guidance for exploring their use in specialized medical fields.
Orthopedics is a significant branch of medicine, typically encompassing disciplines such as trauma, spine surgery, joint surgery, sports medicine, hand surgery, and bone oncology. Orthopedic diseases have a broad impact on populations and pose a major global health threat. Low back pain, a common symptom in orthopedics or spine surgery, has been identified as the leading cause of global productivity loss, as measured in years, according to a large-scale epidemiological study covering 195 countries and regions; in 126 countries, low back pain ranks first among the causes of years lived with disability [9]. In traditional health care systems, the annual medical expenditure for low back pain in the United States is estimated to exceed US $100 billion, contributing to a significant socioeconomic burden [10]. Similarly, osteoarthritis is also a critical global health issue. The global prevalence of knee osteoarthritis in adults aged >40 years is 23%, with approximately 61% of adults aged >45 years showing radiographic evidence of knee osteoarthritis [11]. Therefore, applying LLMs in orthopedics holds the potential to alleviate the current heavy socioeconomic burden.
It is worth noting that several pioneers in orthopedics have conducted studies on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies. The published reviews primarily focus on introducing and popularizing the basic concepts of LLMs in orthopedics [12,13], or they offer forward-looking perspectives by categorizing LLM applications in clinical practice, education, and research [14]. A systematic summary of existing research is absent. To the best of our knowledge, this review is the first to systematically summarize existing research findings. In contrast to prior works, we place greater emphasis on the quantitative evaluation methods and results of these studies because we believe that these methods and outcomes can help orthopedic and computer science researchers better understand the current state of LLM research and the performance of LLMs. Regarding application categorization, we consider tasks involving NLP in management as another important application area for LLMs in orthopedics. Therefore, this review adds a category for orthopedic management applications to the existing classification framework.
Objectives
The objective of this review was to comprehensively summarize the research findings on the application of LLMs in orthopedics and outline the advantages, limitations, and methodological evaluations, while also exploring the potential opportunities and challenges emerging in this era, for facilitating interdisciplinary collaboration and advancement among researchers in computer science and orthopedics. The ultimate goal is to contribute to improved efficiency and quality of orthopedic care as well as a reduction in medical costs and the associated socioeconomic burden.
Methods
Search Strategy
The protocol for this systematic review followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines (checklist can be found in the Multimedia Appendix 1) [15]. PubMed, Embase, and Cochrane Library databases were searched, with the language limited to English. The time frame was set from January 1, 2014, to February 22, 2024. Search terms were divided into 2 categories, with the first category including LLM-related terms and the second containing words related to orthopedics and its subspecialties (Textbox 1). Terms within each category were connected using “OR,” while terms within different categories were connected using “AND.” The full search strategy can be found in Multimedia Appendix 2.
Categories and terms applied in the search queries.
Category 1
“large language model,” “LLM,” “generative artificial intelligence,” “generative AI,” “ChatGPT,” and “Generative Pre-Trained Transformer”
Category 2
“orthopedics,” “bone,” “musculoskeletal,” “injury,” “wound,” “trauma,” “articular,” “joint,” “sports medicine,” “hand surgery,” “spine,” “spinal, “cervical vertebrae,” “thoracic vertebrae,” “lumbar vertebrae,” “sacrum,” “coccyx,” “spinal canal,” “vertebral body,” and “intervertebral disc”
Study Selection
The records were downloaded from the databases and imported into EndNote (version 21.2; Clarivate) for article management. The study selection process was conducted independently by 2 investigators (CZ and SL). The inclusion and exclusion criteria are listed in Textbox 2. The results were cross-checked, and discrepancies were resolved through discussion, with the final determination made by a third investigator (YT).
Inclusion and exclusion criteria.
Inclusion criteria
-
Article type
Original research
-
Language
Articles written in English
-
Content
Studies that use at least 1 large language model (LLM)
Studies that are relevant to the field of orthopedics
Exclusion criteria
-
Article type
Reviews, editorials, letters, and study protocols
-
Language
Articles written in a language other than English
-
Content
Studies that do not involve LLMs
Studies that use LLMs for tasks such as code generation, debugging, or text generation without any performance evaluation of the model
Quality Assessment of Studies
Quality assessment was conducted by 2 investigators (CZ and SL) independently. First, the study designs were identified. Studies that involved only posing questions to LLMs, did not recruit participants, and did not report a study design were classified as surveys. Given the diverse nature of the survey types included in the review, quality assessments were conducted only for studies that recruited participants. The revised Cochrane risk-of-bias tool for randomized trials [16] was used to assess randomized controlled trials (RCTs), and the CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance [17] was used to evaluate prospective or retrospective observational studies. The revised Cochrane risk-of-bias tool (version of August 22, 2019) is designed for assessing RCTs and contains 5 domains: bias arising from the randomization process, bias due to deviations from the intended interventions, bias due to missing outcome data, bias in measurement of the outcome, and bias in selection of the reported result. The CONSORT-AI guidance is a new reporting guideline specifically designed for clinical trials that assess interventions with an AI component. The quality assessment domains under this guidance include a statement of the AI algorithm used, details of how the AI intervention fits within the clinical pathway, inclusion and exclusion criteria for input data, a description of the approaches used to handle unavailable input data, a description of the input data acquisition process for the AI intervention, specifications of human-AI interaction in the collection of input data, the output of the AI algorithm, and explanations of how the AI intervention’s outputs contribute to health behavior changes. The results were cross-checked, and discrepancies were resolved through discussion, with the final determination made by another investigator (YT).
Data Extraction and Synthesis
The studies were categorized into 4 groups based on their application areas: clinical practice, education, research, and management. Data extraction and synthesis were conducted by 2 investigators (CZ and SL) independently. In addition to general characteristics, the composition of extracted data varied depending on the specific category. Details of the data extraction strategy can be found in Multimedia Appendix 3. In cases where there were inconsistencies in the process, a third investigator (XZ) participated in the discussion and made the final decision. For studies with high heterogeneity, we did not synthesize the parameters for model performance evaluation and instead focused on providing a descriptive analysis of the data. Microsoft Excel 2019 was used for data collection, analysis, and visualization.
Results
Characteristics of Included Studies
A total of 829 studies were identified; after removing duplicates and screening, 68 (8.2%) studies were selected in the literature review. The inclusion process is shown in Figure 1.
Figure 1.

Flowchart of literature screening based on the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) statement. LLM: large language model.
The application of LLMs in orthopedics involves the fields of clinical practice, education, research, and management. Of the 68 included studies, 47 (69%) focused on clinical practice (Table 1) [18-64], 12 (18%) addressed orthopedic education (Table 2) [65-76], 8 (12%) were related to scientific research (Table 3) [77-84], and 1 (1%) pertained to the field of management (Table 3) [85]. Of the 68 studies, 55 (81%) were classified as surveys; furthermore, only 8 (12%) recruited patients, only 1 (1%) was a high-quality study (RCT), and only 1 (1%) was a prospective study. Since June 2023, research on the application of LLMs in orthopedics has increased month by month (Figure 2).
Table 1.
Characteristics of the included studies focused on clinical practice.
| Study, year | Study design | Task | LLMa tools | Main evaluation metrics for model performance and their values | Enrolled participants, n | Subjective or objective assessment of the model’s performance |
| Agharia et al [18], 2024 | Survey | Formulate clinical decisions | GPT-3.5; GPT-4; Bard | Proportion of most popular response: 68% (GPT-4); 40.2% (GPT-3.5); 45.4% (Bard) | —b | Subjective |
| Anastasio et al [19], 2023 | Survey | Generate answers to clinical questions | GPT-3.5 | Ratio of responses in different quality grades: bottom-tier rating 4.5%; middle-tier rating 27.3%; top-tier rating 68.2% | — | Subjective |
| Baker et al [20], 2024 | RCTc | Assist with writing patient histories | GPT-4 | Mean time: 69.8 (SD 26.2) s; mean word count: 135.8 (SD 40.3); mean PDQI-9d score: 35.6 (SD 3.1); mean overall rating: 3.8 (SD 0.6); ratio of erroneous documents: 36% | 11 | Subjective |
| Christy et al [21], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | Ratio of appropriate responses in total responses: 78%; intraclass correlation coefficient: 0.12 | — | Subjective |
| Coraci et al [22], 2023 | Cross-sectional study | Create questionnaire for assessment | GPT-3.5 | Correlation: acceptable correlation with ODIe and QBPDSf; no statistical correlation with RMDQg or NRSh | 20 | Subjective |
| Crook et al [23], 2023 | Survey | Generate answers to clinical questions | GPT-3 | DISCERN score: 58; JAMAi benchmark score: 0/4; FREj score: 34; FKGLk score: 15 | — | Subjective |
| Daher et al [24], 2023 | Prospective study | Diagnose and manage patients | GPT-3 | Accuracy of diagnosis: 93%; accuracy of management: 83% | 29 | Objective |
| Decker et al [25], 2023 | Cross-sectional study | Generate informed consent documentation | GPT-3.5 | Mean readability, accuracy, and completeness scores (surgeons vs LLMs): readability= 15.7 vs 12.9; risks=1.7 vs 1.7; benefits=1.4 vs 2.3; alternatives=1.4 vs 2.7; overall impression=1.9 vs 2.3; composite: 1.6 vs 2.2 | — | Subjective |
| Draschl et al [26], 2023 | Survey | Generate answers to clinical questions | GPT-3.5 | 5-point Likert scores, mean: completeness=3.80 (SD 0.63); misleading=4.04 (SD 0.67); errors=4.14 (SD 0.58); up-to-dateness=3.90 (SD 0.45); suitability for patients=3.69 (SD 0.64); suitability for surgeons=3.63 (SD 0.95) | — | Subjective |
| Dubin et al [27], 2023 | Survey | Generate answers to clinical questions | GPT-3 | 25% of the questions were similar when performing a Google web search and a search of ChatGPT for all search terms; 75% of the questions were answered by government websites; 55% of the answers were different between Google web search and ChatGPT in terms of numerical questions | — | Subjective |
| Duey et al [28], 2023 | Survey | Generate answers to clinical questions | GPT-3.5; GPT-4 | Accuracy: 33% (GPT-3.5); 92% (GPT-4) | — | Subjective |
| Fabijan et al [29], 2023 | Cross-sectional study | Classify cases of single-curve scoliosis | GPT-4; Microsoft Bing with GPTl; Scholar AI Premium | GPT-4 and Scholar AI Premium excelled in classifying single-curve scoliosis with perfect sensitivity (100%) and specificity (100%) | 56 | Objective |
| Fahy et al [30], 2024 | Survey | Generate answers to clinical questions | GPT-3.5; GPT-4 | GPT-3.5 vs GPT-4 mean DISCERN score: 55.4 vs 62.09; mean reading grade level score: 18.08 vs 17.90 | — | Subjective |
| Gianola et al [31], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | Internal consistency: 49%; accuracy: 33% | — | Subjective |
| Hurley et al [32], 2024 | Survey | Generate answers to clinical questions | ChatGPT | DISCERN score: 60; JAMA benchmark score: 0; FRE score: 26.2; FKGL score: considered to be that of a college graduate | — | Subjective |
| Johns et al [33], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | Satisfaction rate: 60% | — | Subjective |
| Johns et al [34], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | DISCERN score: 41; FKGL score: 13.4; satisfaction rate: 40% | — | Subjective |
| Kaarre et al [35], 2023 | Survey | Generate answers to clinical questions | GPT-4 | Average correctness of responses for patients and physicians: 1.69 and 1.66, respectively (on a scale ranging from 0=incorrect, 1=partially correct, and 2=correct) | — | Subjective |
| Kasthuri et al [36], 2024 | Survey | Generate answers to clinical questions | Microsoft Bing with GPT-4 | Mean completeness score: 2.03; mean accuracy score: 4.49 | — | Subjective |
| Kienzle et al [37], 2024 | Survey | Generate answers to clinical questions | GPT-4 | Mean DISCERN score in overall quality: 3.675 | — | Subjective |
| Kirchner et al [38], 2023 | Survey | Rewrite patient education materials | GPT-3.5 | Mean FKGL score in patient education materials related to herniated lumbar disk, scoliosis, stenosis, TKAm, and THAn: before rewrite=9.5, 12.6, 10.9, 12.0, and 6.3, respectively; after rewrite=5.0, 5.6, 6.9, 11.6, and 6.1, respectively | — | Subjective |
| Kuroiwa et al [39], 2023 | Survey | Generate answers to clinical questions | GPT-3.5 | Ratios of correct answers: 25/25, 1/25, 24/25, 16/25, and 17/25 for carpal tunnel syndrome, cervical myelopathy, lumbar spinal stenosis, knee osteoarthritis, and hip osteoarthritis, respectively | — | Objective |
| Li et al [40], 2023 | Survey | Generate answers to clinical questions | GPT-4 | Mean accuracy score (out of 5): 4.3; mean completeness score (out of 3): 2.8 | — | Subjective |
| Li et al [41], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | 1 response was excellent, requiring no clarification; 4 responses were satisfactory, requiring minimal clarification; 3 responses were satisfactory, requiring moderate clarification; 2 responses were unsatisfactory | — | Subjective |
| Lower et al [42], 2023 | Survey | Deliver safe and coherent medical advice | GPT-4 | Mean Likert scale score: 3.2 | — | Subjective |
| Magruder et al [43], 2024 | Survey | Generate answers to clinical questions | ChatGPT | Answer grades (from 1 to 5), mean: relevance=4.43 (SD 0.77); clarity=4.22 (SD 0.86); accuracy=4.10 (SD 0.90); evidence based=3.92 (SD 1.01); completeness=3.91 (SD 0.88); consistency=3.54 (SD 1.10) | — | Subjective |
| Mika et al [44], 2023 | Survey | Generate answers to clinical questions | GPT-3.5 | 2 responses were excellent, requiring no clarification; 4 responses were satisfactory, requiring minimal clarification; 3 responses were satisfactory, requiring moderate clarification; 1 response was unsatisfactory | — | Subjective |
| Pagano et al [45], 2023 | Retrospective observational study | Formulate diagnosis and potential treatment suggestions | GPT-4 | Diagnostic accuracy: 100% for the total cases; concordance in therapeutic recommendations: 83% for the total cases | 100 | Objective |
| Mejia et al [46], 2024 | Survey | Generate answers to clinical questions | GPT-3.5; GPT-4 | Accuracy: 52% (GPT-3.5); 59% (GPT-4); overconclusiveness: 48% (GPT-3.5); 45% (GPT-4) | — | Subjective |
| Russe et al [47], 2023 | Retrospective observational study | Provide accurate fracture classification based on radiology reports | FraCChat; GPT-3.5-Turbo; GPT-4 | Accuracy: GPT 3.5=3%; GPT 4=2%; FraCChat 3.5=48%; FraCChat 4=71% | — | Objective |
| Schonfeld et al [48], 2024 | Retrospective cohort study | Predict outcome of adult spinal deformities | Gatortron | AUCo scores: 0.565 (pulmonary complication); 0.559 (neurological complication); 0.557 (sepsis); 0.508 (delirium); F1-scores: 0.545 (pulmonary complication); 0.250 (neurological complication); 0.383 (sepsis); 0.156 (delirium) | 209 | Objective |
| Seth et al [49], 2023 | Survey | Generate answers to clinical questions | ChatGPT | Mean Likert scale score: 3.1 | — | Subjective |
| Shrestha et al [50], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | Accuracy: 44%-65% for different guideline variations | — | Subjective |
| Sosa et al [51], 2024 | Survey | Generate answers to clinical questions | GPT-4; Bard; Bing AI | Ratios of appropriate answers to questions related to bone physiology: 83.3% (GPT-4); 23.3% (Bing AI); 16.7% (Bard) | — | Subjective |
| Stroop et al [52], 2023 | Survey | Generate answers to clinical questions | ChatGPT | Ratio of medically complete correct answers: 52%; ratio of medically complete and comprehensive answers: 55% | — | Subjective |
| Suthar et al [53], 2023 | Retrospective observational study | Generate diagnosis | GPT-4 | Accuracy rate in spine cases: 55% | — | Objective |
| Taylor et al [54], 2024 | Survey | Generate answers to clinical questions | ChatGPT | Ratio of surgeons who reported that the questions had been appropriately answered: 91% | — | Subjective |
| Temel et al [55], 2024 | Survey | Generate answers to clinical questions | GPT-4 | Ensuring Quality Information for Patients score: mean 43.02 (SD 6.37); FRE score: mean 26.24 (SD 13.81); FKGL score: mean 14.84 (SD 1.79) | — | Subjective |
| Tharakan et al [56], 2024 | Survey | Generate answers to clinical questions | GPT-3 | Answers provided by ChatGPT cited more academic references than those provided by a Google search (80% vs 50%) | — | Subjective |
| Truhn et al [57], 2023 | Retrospective observational study | Prioritize treatment recommendations | GPT-4 | The overall quality of the treatment recommendations was rated as good or better | 20 | Subjective |
| Warren et al [58], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | Answers to fact, policy, and value questions (mean scores): DISCERN=51, 53, and 55, respectively; JAMA benchmark=0, 0, and 0, respectively; FRE=48.3, 42.0, and 38.4, respectively; FKGL=10.3, 10.9, and 11.6, respectively | — | Subjective |
| Wilhelm et al [59], 2023 | Survey | Generate treatment recommendations | Claude-instant-v1.0; GPT 3.5-Turbo; Command-xlarge-nightly; Bloomz | Mean DISCERN quality scores: 3.4 (Claude-instant-v1.0); 2.8 (GPT 3.5-Turbo); 2.2 (Command-xlarge-nightly); 1.1 (Bloomz) | — | Subjective |
| Wright et al [60], 2024 | Survey | Generate answers to clinical questions | GPT-3.5 | Mean accuracy score: 4.26; mean comprehensiveness score: 3.79 | — | Subjective |
| Yang et al [61], 2024 | Retrospective observational study | Generate diagnosis | GPT-3.5 | Accuracy: 0.87; sensitivity: 0.99; specificity: 0.73 | 1366 | Objective |
| Yang et al [62], 2024 | Survey | Generate answers to clinical questions | ChatGPT; Bard | Concordance with the AAOSp Clinical Practice Guidelines: 80% (ChatGPT); 60% (Bard) | — | Subjective |
| Yapar et al [63], 2024 | Survey | Generate answers to clinical questions | GPT-4 | Accuracy: 79.8%; applicability: 75.2%; comprehensiveness: 70.6%; communication clarity: 75.6% | — | Subjective |
| Zhou et al [64], 2024 | Case study | Generate answers to clinical questions related to the case | GPT-3.5 | No statistical results | — | Subjective |
aLLM: large language model.
bNot applicable.
cRCT: randomized controlled trial.
dPDQI-9: Physician Documentation Quality Instrument-9.
eODI: Oswestry Disability Index.
fQBPDS: Quebec Back Pain Disability Scale.
gRMDQ: Roland-Morris Disability Questionnaire.
hNRS: numerical rating scale.
iJAMA: Journal of the American Medical Association.
jFRE: Flesch reading ease.
kFKGL: Flesch-Kincaid grade level.
lGPT: generative pretrained transformer.
mTKA: total knee arthroplasty.
nTHA: total hip arthroplasty.
oAUC: area under the curve.
pAAOS: American Academy of Orthopaedic Surgeons.
Table 2.
Characteristics of the included studies focused on orthopedic education.
| Study, year | Study design | Task | LLMa tools | Source | Questions, n | Scores or accuracy (%) |
| Cuthbert and Simpson [65], 2023 | Survey | Examination | GPT-3.5 | UKITEb | 134 | 35.8 |
| Ghanem et al [66], 2023 | Survey | Examination | GPT-4 | OITEc | 201 | 61.2 |
| Han et al [67], 2024 | Survey | Examination | GPT-3.5 | ASSHd | 1583 | 36.2 |
| Hofmann et al [68], 2023 | Survey | Examination | GPT-3.5; GPT-4 | OITE | 410 (GPT-3.5); 396 (GPT-4) | GPT-3.5: 46.3; GPT-4: 63.4 |
| Jain et al [69], 2024 | Survey | Examination | GPT-3.5 | OITE | 360 | 52.8 |
| Kung et al [70], 2023 | Survey | Examination | GPT-3.5; GPT-4 | OITE | 360 | GPT-3.5: 54.3; GPT-4: 73.6 |
| Lum [71], 2023 | Survey | Examination | GPT-3.5 | OITE | 207 | 47 |
| Massey et al [72], 2023 | Survey | Examination | GPT-3.5; GPT-4 | ResStudy Orthopaedic Examination Question Bank | 180 | GPT-3.5: 29.4; GPT-4: 47.2 |
| Ozdag et al [73], 2023 | Survey | Examination | GPT-3.5 | OITE | 102 | 45 |
| Rizzo et al [74], 2024 | Survey | Examination | GPT-3.5-Turbo; GPT-4 | OITE | 2022: 207; 2021: 213; 2020: 215 | 2022: GPT-4=67.63; GPT 3.5-Turbo=50.24; 2021: GPT-4=58.69; GPT 3.5-Turbo=47.42; 2020: GPT-4=59.53; GPT 3.5-Turbo=46.51 |
| Saad et al [75], 2023 | Survey | Examination | GPT-4 | Mock FRCS Orthe Part A | 240 | 67.5 |
| Traoré et al [76], 2023 | Survey | Examination | GPT-3.5 | EBHSf diploma examination | 18 | 0 |
aLLM: large language model.
bUKITE: United Kingdom and Ireland In-Training Examination.
cOITE: Orthopaedic Surgery In-Training Examination.
dASSH: American Society for Surgery of the Hand.
eFRCS Orth: Orthopaedic Fellow of the Royal College of Surgeons.
fEBHS: European Board of Hand Surgery.
Table 3.
Characteristics of the included studies focused on orthopedic research and management.
| Study, year | Study design | Task | LLMa tools | Input | Key findings |
| Gill et al [77], 2024 | Survey | Improve readability | GPT-3.5 | IRBb-approved orthopedic surgery research consent forms | ChatGPT can significantly improve the readability of orthopedic clinical research consent forms; 63.2% of the post-ChatGPT consent forms had at least 1 error |
| Hakam et al [78], 2024 | Survey | AIc-Generated scientific literature | GPT-3.4; You.com | Five abstracts about meniscal injuries | The AI-generated texts could not be successfully identified |
| Kacena et al [79], 2024 | Survey | Write scientific review articles | GPT-4 | Prompts | AI reduced the time for writing but had significant inaccuracies |
| Lawrence et al [80], 2024 | Survey | Generate abstract | GPT-3 | A standard set of input commands | Interrater reliability for abstract quality scores was moderate |
| Lotz et al [81], 2023 | Survey | Assist new research hypothesis exploration | Toolkit based on GPT-3.5 | Prior studies | LLMs may be useful for analyzing and distinguishing publications, as well as determining the degree to which the literature supports or contradicts emergent hypotheses |
| Methnani et al [82], 2023 | Survey | Calculate sample size | GPT-3.5 | All necessary data, such as mean, percentage SD, normal deviations, and study design | In 1 (25%) of the 4 trials, the sample size was correctly calculated |
| Nazzal et al [83], 2024 | Survey | Write a review article | GPT-4 | Prompts | The AI-only paper was the most inaccurate, with inappropriate reference use, and the AI-assisted paper had the greatest incidence of plagiarism |
| Sanii et al [84], 2023 | Survey | Perform an orthopedic surgery literature review | GPT-3; Perplexity | Standard prompts | The current iteration of ChatGPT cannot perform a reliable literature review, and Perplexity is only able to perform a limited review of the medical literature |
| Zaidat et al [85], 2023 | Retrospective cohort study | Predict CPTd codes | GPT-4 | Surgical operative notes | The AUROCe score was 0.87, and the AUPRCf score was 0.67 |
aLLM: large language model.
bIRB: institutional review board.
cAI: artificial intelligence.
dCPT: current procedural terminology.
eAUROC: area under the receiver operating characteristic curve.
fAUPRC: area under the precision-recall curve.
Figure 2.

Trends in the number of publications.
Quality Assessment of Studies
We conducted quality assessments for the 8 studies that recruited participants (Multimedia Appendices 4 and 5). The RCT study was evaluated using the revised Cochrane risk-of-bias tool for randomized trials, and it was found to have a low risk of bias in all 5 domains—bias arising from the randomization process, bias due to deviations from the intended interventions, bias due to missing outcome data, bias in measurement of the outcome, and bias in selection of the reported result—indicating high study quality. The remaining studies (7/8, 88%; observational studies) were evaluated using the CONSORT-AI guidance and were found to be of good quality.
Distribution of LLM Tools
Among all the LLM tools applied, ChatGPT was the most commonly mentioned. Other LLM tools included Bard, Microsoft Bing, Scholar AI, Perplexity, Gatortron, Claude, Command-xlarge-nightly, and Bloomz, as well as software developed by researchers based on commonly used LLM kernels. Currently, there are 2 main versions of ChatGPT available: GPT-3 (including GPT-3.5 and GPT-3.5-Turbo) and GPT-4. Most of the studies (48/68, 71%) specified the version of the tool used. The majority of the studies (25/48, 52%) only used GPT-3 or 3.5, likely due to the publication lag because these studies were conducted before the release of GPT-4. Given that GPT-4 outperforms GPT-3 in most tasks, future research should primarily use GPT-4.
Model Performance Evaluation
As shown in Tables 1-3, there is considerable heterogeneity in the definition, measurement, and evaluation of LLM performance across the included studies. Currently, there is no unified research paradigm for the application of LLMs in medicine. Therefore, this review focused on different model performance evaluation metrics according to various application categories. For clinical applications of LLMs, we were particularly concerned with the accuracy of model reasoning and the readability of the generated text; unfortunately, the majority of the studies (39/47, 83%) relied on subjective assessments of the model’s performance. In studies with objective evaluations, the heterogeneity in the subtasks performed by the LLMs (including diagnosis, classification, clinical case analysis, and case text generation) prevented us from pooling the data. For diagnostic tasks alone, the accuracy ranged from 55% to 93% [24,53]. When performing disease classification tasks, GPT-4’s accuracy ranged from 2% to 100% [29,47] (Table 1). In studies on readability, the most commonly used metrics are the Flesch Reading Ease and the Flesch-Kincaid Grade Level (FKGL) scores. The FKGL metric correlates reading difficulty with years of education, providing a straightforward reflection of the readability of generated materials. In these studies, the generated texts had FKGL scores ranging from a minimum of 5.0 [38], indicating primary school reading difficulty, to the maximum required years of education [32], showing significant variability. This variability is likely due to differences in the research questions, methodologies, prompts, and evaluators. In the educational applications (eg, answering questions from examination papers), the most frequently used test source (7/12, 58%) was the Orthopaedic Surgery In-Training Examination (OITE). The test scores are widely recognized as performance evaluation metrics for the models, with final scores ranging from 45% to 73.6% due to differences in models and test selections (Table 2). For applications of LLMs in research and management, the flexible and varied nature of the tasks led to substantial differences in performance measurement and evaluation. Therefore, we collected the model inputs and major findings for a descriptive presentation (Table 3).
Discussion
Overview
Despite the relatively short time since their introduction and the absence of rigorous and comprehensive performance evaluation in highly specialized fields such as orthopedics, it is an undeniable fact that LLMs have already been made accessible to the public. Given the increasing acceptance and widespread adoption of LLMs, it is imperative for orthopedic surgeons to possess a comprehensive understanding of their operational mechanisms and limitations. Users should also delineate the safe application boundaries while harnessing the benefits offered by LLMs, all while mitigating potential risks in their daily clinical practice. This section presents a comprehensive overview of application examples and model performance across diverse fields, while providing strategic approaches to address LLMs based on our findings. In addition, in this section, we critically evaluate research methodologies and offer potential recommendations for future investigations.
Application of LLMs in the Field of Orthopedic Education
LLMs can not only provide answers but also offer explanations and even engage in further discussions on a given topic, demonstrating potential value in orthopedic education. Several studies have evaluated the performance of ChatGPT in answering questions related to orthopedics and further discussed its value in the field of orthopedic education. The source of questions includes the OITE [66,68-71,73,74], the ResStudy Orthopaedic Examination Question Bank [72], the Fellowship of the Royal College of Surgeons (Trauma and Orthopaedic Surgery) examination [65,75], and hand surgery examinations in the United States and Europe [67,76]. Accuracy (scores) and whether the answers meet the standard are important evaluation criteria for LLM performance. Another educational indicator is the correctness and reasonableness of answer explanations. Studies evaluating OITE questions usually convert accuracy into postgraduate year (PGY) levels for evaluation. Due to differences in software applications and question selection, different studies have reported varying performances of ChatGPT. ChatGPT with GPT-4 performed at an average level ranging from PGY-2 to PGY-5 [66,68,70,74], while ChatGPT with GPT-3.5 performed slightly better than PGY-1 or below the average level of PGY-1 [68-71,73,74]. For correct answers, ChatGPT can provide explanations and reasoning processes consistent with those of examiners, which helps students understand the questions and general orthopedic principles [66,69]. However, ChatGPT failed to pass the Fellowship of the Royal College of Surgeons (Trauma and Orthopaedic Surgery) examination and hand surgery examinations in the United States and Europe [65,67,75,76]. In addition, as a language model, ChatGPT cannot analyze medical images correctly [70], limiting its role in orthopedic imaging education.
Although LLMs currently cannot fully replace orthopedic instructors, they can still serve as a valuable supplementary tool for learning. Integrating their responses with authoritative resources for verification and using appropriate prompts can optimize their capacity to offer logical explanations and foster critical thinking.
Application of LLMs in Clinical Practice
Medical Consultation and Physician-Patient Communication
One challenge faced by orthopedic physicians is that, unlike in the case of other clinical interventions, LLMs have already been integrated as medical consultation tools in the diagnosis and treatment process of numerous diseases without sufficient clinical evidence and regulatory review from authorities such as the US Food and Drug Administration. LLMs can be considered an alternative approach for patients who have sustained injuries or experience discomfort before seeking guidance from primary care physicians or specialists. When confronted with medical issues, individuals who rely heavily on the internet for problem-solving in their personal and professional lives often exhibit a tendency to seek treatment decisions on the web [30]. Compared to traditional search engines or Wikipedia, LLMs could potentially become a significant source of medical consultation information, especially in cases of nonacute diseases such as lower back pain or joint pain. Meanwhile, many physicians also hope that LLMs can help alleviate their burden of simple medical consultations and repetitive paperwork related to physician-patient communication (such as preoperative consent forms), which is considered 1 of the important factors contributing to physician burnout [86]. Although LLMs can provide concise, clarified, or simplified responses related to the given topic and deliver high-quality and empathetic answers [66,87], given their imperfect performance in addressing questions related to orthopedics [65-76], caution should be exercised regarding their reliability in orthopedic consultation scenarios.
Studies have evaluated the performance of LLMs in answering questions related to hand surgery [23], spinal cord injuries [55], joint and sports medicine [19,27,35,41,56,58], and preoperative physician-patient communication for lumbar disk herniation [52] and hip replacement surgery [44]. In these studies, the evaluation criteria of interest typically encompass the model’s answer accuracy, readability, completeness, and information sources. Evaluation methods often encompass scale assessments or subjective ratings conducted by researchers. The DISCERN score is commonly used to evaluate answer quality [19,23,58,88], while FKGL and Flesch Reading Ease scores are commonly used to measure readability [23,55,58]. The accuracy of LLMs’ responses is closely correlated with the specific topic. Questions in the field of joint and sports medicine often receive high-quality responses, while there are serious issues with the quality of answers regarding spinal cord injuries. There are also significant differences in the evaluation of the readability or comprehensibility of LLMs’ answers, with some researchers considering them to be easily understood [44,52], while studies using Flesch-related scales suggest that LLMs’ answers require a reading level of at least 10 years of education or even university level for full comprehension [23,55,58]. The underlying factors contributing to this phenomenon can be attributed to variations in question topics, prompts, and evaluation methodologies used for answer assessment. Consequently, orthopedic surgeons should exercise caution when interpreting the findings of these studies.
Although LLMs can offer more scholarly health information in comparison to search engines [27,56], they still cannot replace orthopedic physicians in medical consultation and physician-patient communication. Using LLMs as a guiding tool and maintaining communication with physicians during further diagnosis and treatment decisions may be a safer and more effective strategy.
Clinical Workflow
The performance of LLMs in orthopedic examinations suggests that they cannot handle complex tasks independently, but they hold potential to serve as valuable assistants for orthopedic physicians. One possible application is using LLMs to automate simple, repetitive tasks such as writing medical records for common orthopedic diseases [20]. In the context of complex disease management tasks, LLMs can possess a more extensive and specialized knowledge base than less experienced newly graduated physicians and assist them in various aspects of disease management. Some researchers have tested the performance of LLMs using specific clinical questions or guidelines [21,26,28,50], while others have directly inputted clinical case data into the model, allowing it to summarize and provide corresponding diagnostic or treatment decisions autonomously [18,24,45,57,64]. Currently, there is no further research on introducing LLMs into orthopedic operations, likely because of the limited availability of intelligent terminals and digital scenarios that may combine operative procedures with LLMs. Potential docking scenarios for the LLM model could include intelligent surgical applications such as mixed reality operating rooms [89,90] and autonomous laminectomy robots [91,92].
In the context of clinical practice, apart from the fundamental requirement of accurate response, time consumption and work efficiency also serve as crucial reference indicators for evaluating LLMs’ performance. Despite variations in the assessment of model accuracy across the included studies, potentially attributed to differences in research objectives, prompt design, evaluation criteria, and assessment tools, no study has presented evidence indicating that LLMs can independently perform clinical work. Therefore, the current models still require rigorous supervision during their use. An RCT study evaluating the performance of ChatGPT in assisting with orthopedic clinical documentation found that there was no significant efficiency advantage in using ChatGPT: the time taken to complete medical history writing was not superior to voice input, and instances of fabricated content were observed within the ChatGPT-generated medical histories [20].
Although LLMs currently have limitations, they remain valuable tools for orthopedic surgeons in their daily practice. It is important to approach cautiously the responses provided by LLMs and seek additional evidence and explanations from the model used when faced with unclear answers. By incorporating evidence-based medicine tools, we can ultimately achieve superior clinical diagnoses and treatment plans, thereby elevating the quality of care delivered by physicians.
Application of LLMs in the Field of Research
Research is generally considered a creative endeavor, and introducing LLMs into the field of research may offer more flexibility. Currently, there are limited attempts to use LLMs in orthopedic research. A study found that lowering the reading threshold of professional texts through LLMs can assist in improving the readability of informed consent forms for orthopedic clinical research, but the forms did not meet the recommended sixth-grade reading level set by the American Medical Association [77]. In addition, the literature summarization and generation capabilities of LLMs can contribute to independent or assisted writing of literature reviews in the orthopedic field [79,83]. On the other side of the coin, concerns about integrity arise when scholars find that the model’s output can be deceptively realistic. The abstracts generated by LLMs for studies on meniscal injuries and joint replacement were indistinguishable from those written by human researchers [78,80]. However, web-based LLMs do not perform well in tasks such as literature review or sample size estimation in sports medicine research [82,84]. Possible reasons may include the potential limitations of LLMs in meeting logical reasoning requirements and the inappropriate use of prompts. For more complex tasks, an optimization approach could involve developing task-specific toolkits based on the fundamental architecture of LLMs. The feasibility of this approach has been validated in interdisciplinary research on the management of back pain [81].
Application of LLMs in Management
Trained NLP models can convert natural language into structured data and have demonstrated superior performance in tasks involving the current procedural terminology for identifying spinal surgery records [93]. However, ChatGPT, with its larger parameters, performs weaker than NLP models in the task of identifying spinal surgery current procedural terminology codes [85]. One possible reason is that traditional NLP models have been trained on more targeted datasets, whereas researchers cannot fine-tune the backend model of ChatGPT using these data. Despite the current model’s performance limitations hindering its further application in this field, the potential advancements in “fine-tuning” techniques may enable LLMs to assume a more influential role in orthopedic management in the future.
Current Advantages and Limitations of LLMs in Orthopedic Applications
Overview
In contrast to conventional pretrained machine learning models, LLMs exhibit the advantage of versatility by accurately addressing problems across various domains without necessitating additional training on specific samples. In the field of orthopedics, another advantage of LLMs is their user-friendly and convenient nature. Users do not need to go through the long process of waiting and referral from general practitioners to specialists. By simply accessing apps equipped with LLMs, users can inquire about diverse subspecialties in orthopedics at any time and from anywhere, receiving answers promptly at a minimal cost or even free of charge. This service surpasses the capabilities of current health care systems and is unlikely to be replicated in the foreseeable future.
However, as mentioned previously, these advantages are based on unverified answers and unpredictable risks. The answers provided by LLMs for questions related to orthopedics are less robust than those for everyday common knowledge and have significant limitations in terms of accuracy, readability, reliability, and timeliness, as detailed in the following subsections.
Accuracy
Almost all studies (66/68, 97%) found errors in LLMs’ responses, with more noticeable inaccuracies in specialized areas such as hip and knee joints and hand surgery [62,67,76]. Some answers even contradicted fundamental orthopedic knowledge [52]. Therefore, some researchers argue that current expectations for guidance provided by AI platforms should be tempered by both physicians and patients [62]. Possible reasons include the limited availability of publicly accessible orthopedic data for training, especially for specialized diseases, as well as privacy concerns that restrict public access to a large amount of data. In the future, besides waiting for more powerful next-generation LLMs, using existing LLMs to learn orthopedic cases and fine-tuning them may be a potential solution to improve accuracy.
Readability
Some of the included studies (3/68, 4%) suggest that the content generated by LLMs is not satisfactory in terms of readability for the general population [23,55,58]. The potential reasons for the lack of readability may include not only the limited training data but also the quality of the trained data. By incorporating more popular science content and common clinical responses, it may be possible to address the issue of readability through fine-tuning the model.
Reliability
Different ways of asking the same question may yield completely different answers [21]. This instability, particularly in response to specific prompts, not only affects users’ experience and trust but also greatly interferes with researchers’ homogenized evaluations. It is imperative to establish standardized questioning processes and prompt criteria.
Timeliness
Training LLMs from scratch is both costly and time consuming, leading to significant retraining expenses. However, unlike everyday common knowledge, orthopedics is constantly evolving with new diagnostic and treatment approaches as well as surgical techniques. Therefore, outdated information becomes an important risk factor leading to inaccurate answers, necessitating caution in this context.
Methodological Limitations of the Selected Studies
Although there are 47 studies related to clinical issues, only 8 (17%) recruited patients [20,22,24,29,45,48,57,61]. Many of the studies (39/47, 83%) only focus on investigation and evaluation, lacking rigorous methods for clinical research, such as RCTs. Furthermore, there is a lack of research end points directly linked to patient outcomes, such as cure rates or improvements in quality of life, making it difficult to find direct evidence of prognosis. Most of the studies (46/68, 68%) rely on subjective methodologies, such as expert ratings, for model evaluation and lack objective criteria and approaches for assessment, leading to unreliable research results. Furthermore, the absence of standardized questioning paradigms has led to instability in LLM responses, posing challenges for reproducibility and limiting the reliability and clinical significance of the study findings.
Limitations of This Review
This systematic review has several limitations. First, only English-language articles were included, which may have led to the exclusion of relevant studies published in other languages. Second, due to significant heterogeneity in study designs, model tasks, and evaluation parameters among the included studies, we did not perform a comprehensive synthesis of most of the data, nor did we conduct a meta-analysis. Third, our search was restricted to commonly used medical research databases such as PubMed, Embase, and Cochrane Library, potentially overlooking relevant studies from other sources, including conference papers and gray literature. Finally, given the limited availability of rigorous clinical studies, we included a considerable number of subjective surveys. Although our objective was to provide a broad overview of LLM-related information, this may have introduced bias into the findings. These limitations are expected to be addressed as more standardized, high-quality clinical studies become available in future research.
Conclusions
Due to the current limitations of LLMs, they cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. In addition, developing task-specific downstream tools based on LLMs is also a potential solution to improve model performance for further use.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (82272577). The funding organization had no involvement in the design of the study, data collection, data analysis, data interpretation, writing of the report, or the decision to publish. Its role was strictly limited to providing financial support. We used the generative AI tool ChatGPT with GPT-4 [94] for language refinement purposes. The AI tool was only used to perform grammar checks and assist with enhancing the clarity and fluency of the writing. All intellectual contributions, including research design, data collection, analysis, and interpretation, were made by the authors. The final manuscript has been reviewed and approved by the authors, who take full responsibility for its content.
Abbreviations
- AI
artificial intelligence
- CONSORT-AI
Consolidated Standards of Reporting Trials–Artificial Intelligence
- FKGL
Flesch-Kincaid grade level
- GPT
generative pretrained transformer
- LLM
large language model
- NLP
natural language processing
- OITE
Orthopaedic Surgery In-Training Examination
- PGY
postgraduate year
- PLM
pretrained language model
- PRISMA
Preferred Reporting Items for Systematic Reviews and Meta-Analyses
- RCT
randomized controlled trial
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) checklist.
Search strategy.
Data extraction strategy.
Revised Cochrane risk-of-bias tool for randomized trials template for assessment completion.
Quality assessment of studies of large language models based on CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance.
Data Availability
All data generated or analyzed during this study are included in this published paper and its supplementary information files.
Footnotes
Authors' Contributions: NX, WL, CZ, and SL made contributions to conception and design. XZ, SZ, and YT made contributions to the acquisition, analysis, and interpretation of data. CZ, SL, and XZ drafted the manuscript, and NX, WL, SZ, YT, and SW revised it critically for important intellectual content. All authors approved the version to be published.
Conflicts of Interest: None declared.
References
- 1.Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Large language models in medicine. Nat Med. 2023 Aug 17;29(8):1930–40. doi: 10.1038/s41591-023-02448-8.10.1038/s41591-023-02448-8 [DOI] [PubMed] [Google Scholar]
- 2.Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, Min Y, Zhang B, Zhang J, Dong Z, Du Y, Yang C, Chen Y, Chen Z, Jiang J, Ren R, Li Y, Tang X, Liu Z, Liu P, Nie JY, Wen JR. A survey of large language models. arXiv. Preprint posted online on March 31, 2023. https://arxiv.org/abs/2303.18223 . [Google Scholar]
- 3.Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, Yogatama D, Bosma M, Zhou D, Metzler D, Chi EH, Hashimoto T, Vinyals O, Liang P, Dean J, Fedus W. Emergent abilities of large language models. arXiv. Preprint posted online on June 15, 2022. https://arxiv.org/abs/2206.07682 . [Google Scholar]
- 4.Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, Almeida D, Altenschmidt J, Altman S, Anadkat S, Avila R, Babuschkin I, Balaji S, Balcom V, Baltescu P, Bao H, Bavarian M, Belgum J, Bello I, Berdine J, Bernadett-Shapiro G, Berner C, Bogdonoff L, Boiko O, Boyd M, Brakman AL, Brockman G, Brooks T, Brundage M, Button K, Cai T, Campbell R, Cann A, Carey B, Carlson C, Carmichael R, Chan B, Chang C, Chantzis F, Chen D, Chen S, Degry R, Deutsch N, Deville D, Dhar A, Dohan D, Dowling S, Dunning S, Ecoffet A, Eleti A, Eloundou T, Farhi D, Fedus L, Felix N, Fishman SP, Forte J, Fulford I, Gao L, Georges E, Gibson C, Goel V, Gogineni T, Goh G, Gontijo-Lopes R, Gordon J, Grafstein M, Gray S, Greene R, Gross J, Gu SS, Guo Y, Hallacy C, Han J, Harris J, He Y, Heaton M, Heidecke J, Hesse C, Hickey A, Hickey W, Hoeschele P, Houghton B, Hsu K, Hu S, Hu X, Huizinga J, Jain S, Jain S. GPT-4 technical report. arXiv. Preprint posted online on March 15, 2023. https://arxiv.org/abs/2303.08774 . [Google Scholar]
- 5.Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, Madriaga M, Aggabao R, Diaz-Candido G, Maningo J, Tseng V. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023 Feb;2(2):e0000198. doi: 10.1371/journal.pdig.0000198. https://europepmc.org/abstract/MED/36812645 .PDIG-D-22-00371 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. JAMA. 2023 Sep 05;330(9):866–9. doi: 10.1001/jama.2023.14217.2808296 [DOI] [PubMed] [Google Scholar]
- 7.Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large language models in medicine: the potentials and pitfalls. Ann Intern Med. 2024 Feb;177(2):210–20. doi: 10.7326/m23-2772. [DOI] [PubMed] [Google Scholar]
- 8.Tang YD, Dong ED, Gao W. LLMs in medicine: the need for advanced evaluation systems for disruptive technologies. Innovation (Camb) 2024 May 06;5(3):100622. doi: 10.1016/j.xinn.2024.100622. https://linkinghub.elsevier.com/retrieve/pii/S2666-6758(24)00060-2 .S2666-6758(24)00060-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.GBD 2017 Disease and Injury Incidence and Prevalence Collaborators Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet. 2018 Nov 10;392(10159):1789–858. doi: 10.1016/S0140-6736(18)32279-7. https://linkinghub.elsevier.com/retrieve/pii/S0140-6736(18)32279-7 .S0140-6736(18)32279-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Knezevic NN, Candido KD, Vlaeyen JW, Van Zundert J, Cohen SP. Low back pain. Lancet. 2021 Jul 03;398(10294):78–92. doi: 10.1016/S0140-6736(21)00733-9.S0140-6736(21)00733-9 [DOI] [PubMed] [Google Scholar]
- 11.Duong V, Oo WM, Ding C, Culvenor AG, Hunter DJ. Evaluation and treatment of knee pain: a review. JAMA. 2023 Oct 24;330(16):1568–80. doi: 10.1001/jama.2023.19675.2811009 [DOI] [PubMed] [Google Scholar]
- 12.Yao JJ, Aggarwal M, Lopez RD, Namdari S. Current concepts review: large language models in orthopaedics: definitions, uses, and limitations. J Bone Joint Surg Am. 2024 Jun 19; doi: 10.2106/JBJS.23.01417.00004623-990000000-01136 [DOI] [PubMed] [Google Scholar]
- 13.Fayed AM, Mansur NS, de Carvalho KA, Behrens A, D'Hooghe P, de Cesar Netto C. Artificial intelligence and ChatGPT in orthopaedics and sports medicine. J Exp Orthop. 2023 Jul 26;10(1):74. doi: 10.1186/s40634-023-00642-8. https://jeo-esska.springeropen.com/articles/10.1186/s40634-023-00642-8 .10.1186/s40634-023-00642-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Merrell LA, Fisher ND, Egol KA. Large language models in orthopaedic trauma: a cutting-edge technology to enhance the field. J Bone Joint Surg Am. 2023 Sep 06;105(17):1383–7. doi: 10.2106/JBJS.23.00395.00004623-202309060-00011 [DOI] [PubMed] [Google Scholar]
- 15.Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, Shamseer L, Tetzlaff JM, Akl EA, Brennan SE, Chou R, Glanville J, Grimshaw JM, Hróbjartsson A, Lalu MM, Li T, Loder EW, Mayo-Wilson E, McDonald S, McGuinness LA, Stewart LA, Thomas J, Tricco AC, Welch VA, Whiting P, Moher D. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. 2021 Mar 29;372:n71. doi: 10.1136/bmj.n71. http://www.bmj.com/lookup/pmidlookup?view=long&pmid=33782057 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Sterne JA, Savović J, Page MJ, Elbers RG, Blencowe NS, Boutron I, Cates CJ, Cheng HY, Corbett MS, Eldridge SM, Emberson JR, Hernán MA, Hopewell S, Hróbjartsson A, Junqueira DR, Jüni P, Kirkham JJ, Lasserson T, Li T, McAleenan A, Reeves BC, Shepperd S, Shrier I, Stewart LA, Tilling K, White IR, Whiting PF, Higgins JP. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. 2019 Aug 28;366:l4898. doi: 10.1136/bmj.l4898. https://eprints.whiterose.ac.uk/150579/ [DOI] [PubMed] [Google Scholar]
- 17.Liu X, Rivera SC, Moher D, Calvert MJ, Denniston AK, SPIRIT-AICONSORT-AI Working Group Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. BMJ. 2020 Sep 09;370:m3164. doi: 10.1136/bmj.m3164. http://www.bmj.com/lookup/pmidlookup?view=long&pmid=32909959 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Agharia S, Szatkowski J, Fraval A, Stevens J, Zhou Y. The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: an analysis of ChatGPT 3.5, ChatGPT 4, and Bard. J Orthop. 2024 Apr;50:1–7. doi: 10.1016/j.jor.2023.11.063. https://linkinghub.elsevier.com/retrieve/pii/S0972-978X(23)00339-2 .S0972-978X(23)00339-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Anastasio AT, Mills FB 4th, Karavan MP Jr, Adams SB Jr. Evaluating the quality and usability of artificial intelligence-generated responses to common patient questions in foot and ankle surgery. Foot Ankle Orthop. 2023 Oct 22;8(4):24730114231209919. doi: 10.1177/24730114231209919. https://europepmc.org/abstract/MED/38027458 .10.1177_24730114231209919 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Baker HP, Dwyer E, Kalidoss S, Hynes K, Wolf J, Strelzow JA. ChatGPT's ability to assist with clinical documentation: a randomized controlled trial. J Am Acad Orthop Surg. 2024 Feb 01;32(3):123–9. doi: 10.5435/JAAOS-D-23-00474.00124635-990000000-00834 [DOI] [PubMed] [Google Scholar]
- 21.Christy M, Morris MT, Goldfarb CA, Dy CJ. Appropriateness and reliability of an online artificial intelligence platform's responses to common questions regarding distal radius fractures. J Hand Surg Am. 2024 Feb;49(2):91–8. doi: 10.1016/j.jhsa.2023.10.019.S0363-5023(23)00589-0 [DOI] [PubMed] [Google Scholar]
- 22.Coraci D, Maccarone MC, Regazzo G, Accordi G, Papathanasiou JV, Masiero S. ChatGPT in the development of medical questionnaires. The example of the low back pain. Eur J Transl Myol. 2023 Dec 15;33(4):12114. doi: 10.4081/ejtm.2023.12114. https://europepmc.org/abstract/MED/38112605 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Crook BS, Park CN, Hurley ET, Richard MJ, Pidgeon TS. Evaluation of online artificial intelligence-generated information on common hand procedures. J Hand Surg Am. 2023 Nov;48(11):1122–7. doi: 10.1016/j.jhsa.2023.08.003.S0363-5023(23)00414-8 [DOI] [PubMed] [Google Scholar]
- 24.Daher M, Koa J, Boufadel P, Singh J, Fares MY, Abboud JA. Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management? JSES Int. 2023 Nov;7(6):2534–41. doi: 10.1016/j.jseint.2023.07.018. https://linkinghub.elsevier.com/retrieve/pii/S2666-6383(23)00212-8 .S2666-6383(23)00212-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Decker H, Trang K, Ramirez J, Colley A, Pierce L, Coleman M, Bongiovanni T, Melton GB, Wick E. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open. 2023 Oct 02;6(10):e2336997. doi: 10.1001/jamanetworkopen.2023.36997. https://europepmc.org/abstract/MED/37812419 .2810364 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Draschl A, Hauer G, Fischerauer SF, Kogler A, Leitner L, Andreou D, Leithner A, Sadoghi P. Are ChatGPT's free-text responses on periprosthetic joint infections of the hip and knee reliable and useful? J Clin Med. 2023 Oct 20;12(20):6655. doi: 10.3390/jcm12206655. https://www.mdpi.com/resolver?pii=jcm12206655 .jcm12206655 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Dubin JA, Bains SS, Chen Z, Hameed D, Nace J, Mont MA, Delanois RE. Using a Google web search analysis to assess the utility of ChatGPT in total joint arthroplasty. J Arthroplasty. 2023 Jul;38(7):1195–202. doi: 10.1016/j.arth.2023.04.007.S0883-5403(23)00352-2 [DOI] [PubMed] [Google Scholar]
- 28.Duey AH, Nietsch KS, Zaidat B, Ren R, Ndjonko LC, Shrestha N, Rajjoub R, Ahmed W, Hoang T, Saturno MP, Tang JE, Gallate ZS, Kim JS, Cho SK. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. 2023 Nov;23(11):1684–91. doi: 10.1016/j.spinee.2023.07.015.S1529-9430(23)03285-0 [DOI] [PubMed] [Google Scholar]
- 29.Fabijan A, Polis B, Fabijan R, Zakrzewski K, Nowosławska E, Zawadzka-Fabijan A. Artificial intelligence in scoliosis classification: an investigation of language-based models. J Pers Med. 2023 Dec 09;13(12):1695. doi: 10.3390/jpm13121695. https://www.mdpi.com/resolver?pii=jpm13121695 .jpm13121695 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Fahy S, Oehme S, Milinkovic D, Jung T, Bartek B. Assessment of quality and readability of information provided by ChatGPT in relation to anterior cruciate ligament injury. J Pers Med. 2024 Jan 18;14(1):104. doi: 10.3390/jpm14010104. https://www.mdpi.com/resolver?pii=jpm14010104 .jpm14010104 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Gianola S, Bargeri S, Castellini G, Cook C, Palese A, Pillastrini P, Salvalaggio S, Turolla A, Rossettini G. Performance of ChatGPT compared to clinical practice guidelines in making informed decisions for lumbosacral radicular pain: a cross-sectional study. J Orthop Sports Phys Ther. 2024 Mar;54(3):222–8. doi: 10.2519/jospt.2024.12151. [DOI] [PubMed] [Google Scholar]
- 32.Hurley ET, Crook BS, Lorentz SG, Danilkowicz RM, Lau BC, Taylor DC, Dickens JF, Anakwenze O, Klifto CS. Evaluation high-quality of information from ChatGPT (artificial intelligence-large language model) artificial intelligence on shoulder stabilization surgery. Arthroscopy. 2024 Mar;40(3):726–31.e6. doi: 10.1016/j.arthro.2023.07.048.S0749-8063(23)00642-4 [DOI] [PubMed] [Google Scholar]
- 33.Johns WL, Kellish A, Farronato D, Ciccotti MG, Hammoud S. ChatGPT can offer satisfactory responses to common patient questions regarding elbow ulnar collateral ligament reconstruction. Arthrosc Sports Med Rehabil. 2024 Apr;6(2):100893. doi: 10.1016/j.asmr.2024.100893. https://linkinghub.elsevier.com/retrieve/pii/S2666-061X(24)00011-7 .S2666-061X(24)00011-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Johns WL, Martinazzi BJ, Miltenberg B, Nam HH, Hammoud S. ChatGPT provides unsatisfactory responses to frequently asked questions regarding anterior cruciate ligament reconstruction. Arthroscopy. 2024 Jul;40(7):2067–79.e1. doi: 10.1016/j.arthro.2024.01.017.S0749-8063(24)00061-6 [DOI] [PubMed] [Google Scholar]
- 35.Kaarre J, Feldt R, Keeling LE, Dadoo S, Zsidai B, Hughes JD, Samuelsson K, Musahl V. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc. 2023 Nov;31(11):5190–8. doi: 10.1007/s00167-023-07529-2. https://europepmc.org/abstract/MED/37553552 .10.1007/s00167-023-07529-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Kasthuri VS, Glueck J, Pham H, Daher M, Balmaceno-Criss M, McDonald CL, Diebo BG, Daniels AH. Assessing the accuracy and reliability of AI-generated responses to patient questions regarding spine surgery. J Bone Joint Surg Am. 2024 Jun 19;106(12):1136–42. doi: 10.2106/JBJS.23.00914.00004623-202406190-00013 [DOI] [PubMed] [Google Scholar]
- 37.Kienzle A, Niemann M, Meller S, Gwinner C. ChatGPT may offer an adequate substitute for informed consent to patients prior to total knee arthroplasty-yet caution is needed. J Pers Med. 2024 Jan 05;14(1):69. doi: 10.3390/jpm14010069. https://www.mdpi.com/resolver?pii=jpm14010069 .jpm14010069 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Kirchner GJ, Kim RY, Weddle JB, Bible JE. Can artificial intelligence improve the readability of patient education materials? Clin Orthop Relat Res. 2023 Nov 01;481(11):2260–7. doi: 10.1097/CORR.0000000000002668.00003086-202311000-00030 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Kuroiwa T, Sarcon A, Ibara T, Yamada E, Yamamoto A, Tsukamoto K, Fujita K. The potential of ChatGPT as a self-diagnostic tool in common orthopedic diseases: exploratory study. J Med Internet Res. 2023 Sep 15;25:e47621. doi: 10.2196/47621. https://www.jmir.org/2023//e47621/ v25i1e47621 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Li J, Gao X, Dou T, Gao Y, Zhu W. Assessing the performance of GPT-4 in the filed of osteoarthritis and orthopaedic case consultation. medRxiv. doi: 10.1101/2023.08.06.23293735. Preprint posted online on August 09, 2023. https://www.medrxiv.org/content/10.1101/2023.08.06.23293735v1 . [DOI] [Google Scholar]
- 41.Li LT, Sinkler MA, Adelstein JM, Voos JE, Calcei JG. ChatGPT responses to common questions about anterior cruciate ligament reconstruction are frequently satisfactory. Arthroscopy. 2024 Jul;40(7):2058–66. doi: 10.1016/j.arthro.2023.12.009.S0749-8063(23)01013-7 [DOI] [PubMed] [Google Scholar]
- 42.Lower K, Seth I, Lim B, Seth N. ChatGPT-4: transforming medical education and addressing clinical exposure challenges in the post-pandemic era. Indian J Orthop. 2023 Sep;57(9):1527–44. doi: 10.1007/s43465-023-00967-7.967 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Magruder ML, Rodriguez AN, Wong JCJ, Erez O, Piuzzi NS, Scuderi GR, Slover JD, Oh JH, Schwarzkopf R, Chen AF, Iorio R, Goodman SB, Mont MA. Assessing ability for ChatGPT to answer total knee arthroplasty-related questions. J Arthroplasty. 2024 Aug;39(8):2022–7. doi: 10.1016/j.arth.2024.02.023.S0883-5403(24)00122-0 [DOI] [PubMed] [Google Scholar]
- 44.Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am. 2023 Oct 04;105(19):1519–26. doi: 10.2106/JBJS.23.00209.00004623-202310040-00006 [DOI] [PubMed] [Google Scholar]
- 45.Pagano S, Holzapfel S, Kappenschneider T, Meyer M, Maderbacher G, Grifka J, Holzapfel DE. Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4. J Orthop Traumatol. 2023 Nov 28;24(1):61. doi: 10.1186/s10195-023-00740-4. https://europepmc.org/abstract/MED/38015298 .10.1186/s10195-023-00740-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Mejia MR, Arroyave JS, Saturno M, Ndjonko LC, Zaidat B, Rajjoub R, Ahmed W, Zapolsky I, Cho SK. Use of ChatGPT for determining clinical and surgical treatment of lumbar disc herniation with radiculopathy: a North American Spine Society guideline comparison. Neurospine. 2024 Mar;21(1):149–58. doi: 10.14245/ns.2347052.526. https://europepmc.org/abstract/MED/38291746 .ns.2347052.526 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47.Russe MF, Fink A, Ngo H, Tran H, Bamberg F, Reisert M, Rau A. Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports. Sci Rep. 2023 Aug 30;13(1):14215. doi: 10.1038/s41598-023-41512-8. https://doi.org/10.1038/s41598-023-41512-8 .10.1038/s41598-023-41512-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48.Schonfeld E, Pant A, Shah A, Sadeghzadeh S, Pangal D, Rodrigues A, Yoo K, Marianayagam N, Haider G, Veeravagu A. Evaluating computer vision, large language, and genome-wide association models in a limited sized patient cohort for pre-operative risk stratification in adult spinal deformity surgery. J Clin Med. 2024 Jan 23;13(3):656. doi: 10.3390/jcm13030656. https://www.mdpi.com/resolver?pii=jcm13030656 .jcm13030656 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Seth I, Xie Y, Rodwell A, Gracias D, Bulloch G, Hunter-Smith DJ, Rozen WM. Exploring the role of a large language model on carpal tunnel syndrome management: an observation study of ChatGPT. J Hand Surg Am. 2023 Oct;48(10):1025–33. doi: 10.1016/j.jhsa.2023.07.003.S0363-5023(23)00360-X [DOI] [PubMed] [Google Scholar]
- 50.Shrestha N, Shen Z, Zaidat B, Duey AH, Tang JE, Ahmed W, Hoang T, Restrepo Mejia M, Rajjoub R, Markowitz JS, Kim JS, Cho SK. Performance of ChatGPT on NASS clinical guidelines for the diagnosis and treatment of low back pain: a comparison study. Spine (Phila Pa 1976) 2024 May 01;49(9):640–51. doi: 10.1097/BRS.0000000000004915.00007632-990000000-00555 [DOI] [PubMed] [Google Scholar]
- 51.Sosa BR, Cung M, Suhardi VJ, Morse K, Thomson A, Yang HS, Iyer S, Greenblatt MB. Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries. J Orthop Res. 2024 Jun 21;42(6):1276–82. doi: 10.1002/jor.25782. [DOI] [PubMed] [Google Scholar]
- 52.Stroop A, Stroop T, Zawy Alsofy S, Nakamura M, Möllmann F, Greiner C, Stroop R. Large language models: are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery? Eur Spine J. 2023 Oct 11; doi: 10.1007/s00586-023-07975-z. (forthcoming)10.1007/s00586-023-07975-z [DOI] [PubMed] [Google Scholar]
- 53.Suthar PP, Kounsal A, Chhetri L, Saini D, Dua SG. Artificial intelligence (AI) in radiology: a deep dive into ChatGPT 4.0's accuracy with the American Journal of Neuroradiology's (AJNR) "case of the month". Cureus. 2023 Aug;15(8):e43958. doi: 10.7759/cureus.43958. https://europepmc.org/abstract/MED/37746411 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54.Taylor WL 4th, Cheng R, Weinblatt AI, Bergstein V, Long WJ. An artificial intelligence chatbot is an accurate and useful online patient resource prior to total knee arthroplasty. J Arthroplasty. 2024 Aug;39(8S1):S358–62. doi: 10.1016/j.arth.2024.02.005.S0883-5403(24)00104-9 [DOI] [PubMed] [Google Scholar]
- 55.Temel MH, Erden Y, Bağcıer F. Information quality and readability: ChatGPT's responses to the most common questions about spinal cord injury. World Neurosurg. 2024 Jan;181:e1138–44. doi: 10.1016/j.wneu.2023.11.062.S1878-8750(23)01625-X [DOI] [PubMed] [Google Scholar]
- 56.Tharakan S, Klein B, Bartlett L, Atlas A, Parada SA, Cohn RM. Do ChatGPT and Google differ in answers to commonly asked patient questions regarding total shoulder and total elbow arthroplasty? J Shoulder Elbow Surg. 2024 Aug;33(8):e429–37. doi: 10.1016/j.jse.2023.11.014.S1058-2746(23)00899-6 [DOI] [PubMed] [Google Scholar]
- 57.Truhn D, Weber CD, Braun BJ, Bressem K, Kather JN, Kuhl C, Nebelung S. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep. 2023 Nov 17;13(1):20159. doi: 10.1038/s41598-023-47500-2. https://doi.org/10.1038/s41598-023-47500-2 .10.1038/s41598-023-47500-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Warren E Jr, Hurley ET, Park CN, Crook BS, Lorentz S, Levin JM, Anakwenze O, MacDonald PB, Klifto CS. Evaluation of information from artificial intelligence on rotator cuff repair surgery. JSES Int. 2024 Jan;8(1):53–7. doi: 10.1016/j.jseint.2023.09.009. https://linkinghub.elsevier.com/retrieve/pii/S2666-6383(23)00257-8 .S2666-6383(23)00257-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Wilhelm TI, Roos J, Kaczmarczyk R. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J Med Internet Res. 2023 Oct 30;25:e49324. doi: 10.2196/49324. https://www.jmir.org/2023//e49324/ v25i1e49324 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 60.Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, Schmidt CM, Mir HR. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt Open. 2024 Feb 15;5(2):139–46. doi: 10.1302/2633-1462.52.BJO-2023-0113.R1. https://europepmc.org/abstract/MED/38354748 .BJO-2023-0113.R1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Yang F, Yan D, Wang Z. Large-scale assessment of ChatGPT's performance in benign and malignant bone tumors imaging report diagnosis and its potential for clinical applications. J Bone Oncol. 2024 Feb;44:100525. doi: 10.1016/j.jbo.2024.100525. https://linkinghub.elsevier.com/retrieve/pii/S2212-1374(24)00005-8 .S2212-1374(24)00005-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Yang J, Ardavanis KS, Slack KE, Fernando ND, Della Valle CJ, Hernandez NM. Chat generative pretrained transformer (ChatGPT) and bard: artificial intelligence does not yet provide clinically supported answers for hip and knee osteoarthritis. J Arthroplasty. 2024 May;39(5):1184–90. doi: 10.1016/j.arth.2024.01.029.S0883-5403(24)00027-5 [DOI] [PubMed] [Google Scholar]
- 63.Yapar D, Demir Avcı Y, Tokur Sonuvar E, Eğerci ÖF, Yapar A. ChatGPT's potential to support home care for patients in the early period after orthopedic interventions and enhance public health. Jt Dis Relat Surg. 2024 Jan 01;35(1):169–76. doi: 10.52312/jdrs.2023.1402. https://europepmc.org/abstract/MED/38108178 .jdrs.2023.1402 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 64.Zhou Y, Moon C, Szatkowski J, Moore D, Stevens J. Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis. Eur J Orthop Surg Traumatol. 2024 Feb;34(2):927–55. doi: 10.1007/s00590-023-03742-4. https://europepmc.org/abstract/MED/37776392 .10.1007/s00590-023-03742-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 65.Cuthbert R, Simpson AI. Artificial intelligence in orthopaedics: can Chat Generative Pre-trained Transformer (ChatGPT) pass section 1 of the Fellowship of the Royal College of Surgeons (trauma and orthopaedics) examination? Postgrad Med J. 2023 Sep 21;99(1176):1110–4. doi: 10.1093/postmj/qgad053.7220358 [DOI] [PubMed] [Google Scholar]
- 66.Ghanem D, Covarrubias O, Raad M, LaPorte D, Shafiq B. ChatGPT performs at the level of a third-year orthopaedic surgery resident on the orthopaedic in-training examination. JBJS Open Access. 2023;8(4):e23.00103. doi: 10.2106/JBJS.OA.23.00103. https://europepmc.org/abstract/MED/38638869 .JBJSOA-D-23-00103 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Han Y, Choudhry HS, Simon ME, Katt BM. ChatGPT's performance on the hand surgery self-assessment exam: a critical analysis. J Hand Surg Glob Online. 2024 Mar;6(2):200–5. doi: 10.1016/j.jhsg.2023.11.014. https://linkinghub.elsevier.com/retrieve/pii/S2589-5141(23)00199-8 .S2589-5141(23)00199-8 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Hofmann HL, Guerra GA, Le JL, Wong AM, Hofmann GH, Mayfield CK, Petrigliano FA, Liu JN. The rapid development of artificial intelligence: GPT-4's performance on orthopedic surgery board questions. Orthopedics. 2024;47(2):e85–9. doi: 10.3928/01477447-20230922-05. [DOI] [PubMed] [Google Scholar]
- 69.Jain N, Gottlich C, Fisher J, Campano D, Winston T. Assessing ChatGPT's orthopedic in-service training exam performance and applicability in the field. J Orthop Surg Res. 2024 Jan 03;19(1):27. doi: 10.1186/s13018-023-04467-0. https://josr-online.biomedcentral.com/articles/10.1186/s13018-023-04467-0 .10.1186/s13018-023-04467-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Kung Je, Marshall C, Gauthier C, Gonzalez Ta, Jackson JB 3rd. Evaluating ChatGPT performance on the orthopaedic in-training examination. JB JS Open Access. 2023;8(3):e23.00056. doi: 10.2106/JBJS.OA.23.00056. https://europepmc.org/abstract/MED/37693092 .JBJSOA-D-23-00056 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res. 2023 Aug 01;481(8):1623–30. doi: 10.1097/CORR.0000000000002704.00003086-990000000-01207 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg. 2023 Dec 01;31(23):1173–9. doi: 10.5435/JAAOS-D-23-00396. https://europepmc.org/abstract/MED/37671415 .00124635-990000000-00782 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 73.Ozdag Y, Hayes DS, Makar GS, Manzar S, Foster BK, Shultz MJ, Klena JC, Grandizio LC. Comparison of artificial intelligence to resident performance on upper-extremity orthopaedic in-training examination questions. J Hand Surg Glob Online. 2024 Mar;6(2):164–8. doi: 10.1016/j.jhsg.2023.10.013. https://linkinghub.elsevier.com/retrieve/pii/S2589-5141(23)00188-3 .S2589-5141(23)00188-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 74.Rizzo MG, Cai N, Constantinescu D. The performance of ChatGPT on orthopaedic in-service training exams: a comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education. J Orthop. 2024 Apr;50:70–5. doi: 10.1016/j.jor.2023.11.056.S0972-978X(23)00332-X [DOI] [PMC free article] [PubMed] [Google Scholar]
- 75.Saad A, Iyengar KP, Kurisunkal V, Botchu R. Assessing ChatGPT's ability to pass the FRCS orthopaedic part A exam: a critical analysis. Surgeon. 2023 Oct;21(5):263–6. doi: 10.1016/j.surge.2023.07.001.S1479-666X(23)00076-8 [DOI] [PubMed] [Google Scholar]
- 76.Traoré SY, Goetsch T, Muller B, Dabbagh A, Liverneaux PA. Is ChatGPT able to pass the first part of the European Board of Hand Surgery diploma examination? Hand Surg Rehabil. 2023 Sep;42(4):362–4. doi: 10.1016/j.hansur.2023.06.005.S2468-1229(23)00113-5 [DOI] [PubMed] [Google Scholar]
- 77.Gill B, Bonamer J, Kuechly H, Gupta R, Emmert S, Kurkowski S, Hasselfeld K, Grawe B. ChatGPT is a promising tool to increase readability of orthopedic research consents. J Orthop Trauma Rehab. 2024 Jan 22; doi: 10.1177/22104917231208212. [DOI] [Google Scholar]
- 78.Hakam HT, Prill R, Korte L, Lovreković B, Ostojić M, Ramadanov N, Muehlensiepen F. Human-written vs AI-generated texts in orthopedic academic literature: comparative qualitative analysis. JMIR Form Res. 2024 Feb 16;8:e52164. doi: 10.2196/52164. https://formative.jmir.org/2024//e52164/ v8i1e52164 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 79.Kacena MA, Plotkin LI, Fehrenbacher JC. The use of artificial intelligence in writing scientific review articles. Curr Osteoporos Rep. 2024 Feb;22(1):115–21. doi: 10.1007/s11914-023-00852-0. https://europepmc.org/abstract/MED/38227177 .10.1007/s11914-023-00852-0 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Lawrence KW, Habibi AA, Ward SA, Lajam CM, Schwarzkopf R, Rozell JC. Human versus artificial intelligence-generated arthroplasty literature: a single-blinded analysis of perceived communication, quality, and authorship source. Int J Med Robot. 2024 Feb 13;20(1):e2621. doi: 10.1002/rcs.2621. [DOI] [PubMed] [Google Scholar]
- 81.Lotz JC, Ropella G, Anderson P, Yang Q, Hedderich MA, Bailey J, Hunt CA. An exploration of knowledge-organizing technologies to advance transdisciplinary back pain research. JOR Spine. 2023 Dec;6(4):e1300. doi: 10.1002/jsp2.1300. https://europepmc.org/abstract/MED/38156063 .JSP21300 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 82.Methnani J, Latiri I, Dergaa I, Chamari K, Ben Saad H. ChatGPT for sample-size calculation in sports medicine and exercise sciences: a cautionary note. Int J Sports Physiol Perform. 2023 Oct 01;18(10):1219–23. doi: 10.1123/ijspp.2023-0109. [DOI] [PubMed] [Google Scholar]
- 83.Nazzal MK, Morris AJ, Parker RS, White FA, Natoli RM, Fehrenbacher JC, Kacena MA. Using AI to write a review article examining the role of the nervous system on skeletal homeostasis and fracture healing. Curr Osteoporos Rep. 2024 Feb 13;22(1):217–21. doi: 10.1007/s11914-023-00854-y. https://europepmc.org/abstract/MED/38217755 .10.1007/s11914-023-00854-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 84.Sanii RY, Kasto JK, Wines WB, Mahylis JM, Muh SJ. Utility of artificial intelligence in orthopedic surgery literature review: a comparative pilot study. Orthopedics. 2024;47(3):e125–30. doi: 10.3928/01477447-20231220-02. https://journals.healio.com/doi/abs/10.3928/01477447-20231220-02?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub0pubmed . [DOI] [PubMed] [Google Scholar]
- 85.Zaidat B, Lahoti YS, Yu A, Mohamed KS, Cho SK, Kim JS. Artificially intelligent billing in spine surgery: an analysis of a large language model. Global Spine J. 2023 Dec 26;:21925682231224753. doi: 10.1177/21925682231224753. https://journals.sagepub.com/doi/10.1177/21925682231224753?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub0pubmed . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 86.Tajirian T, Stergiopoulos V, Strudwick G, Sequeira L, Sanches M, Kemp J, Ramamoorthi K, Zhang T, Jankowicz D. The influence of electronic health record use on physician burnout: cross-sectional survey. J Med Internet Res. 2020 Jul 15;22(7):e19274. doi: 10.2196/19274. https://www.jmir.org/2020/7/e19274/ v22i7e19274 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 87.Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023 Jun 01;183(6):589–96. doi: 10.1001/jamainternmed.2023.1838. https://europepmc.org/abstract/MED/37115527 .2804309 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health. 1999 Feb;53(2):105–11. doi: 10.1136/jech.53.2.105. https://jech.bmj.com/lookup/pmidlookup?view=long&pmid=10396471 . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Dennler C, Bauer DE, Scheibler AG, Spirig J, Götschi T, Fürnstahl P, Farshad M. Augmented reality in the operating room: a clinical feasibility study. BMC Musculoskelet Disord. 2021 May 18;22(1):451. doi: 10.1186/s12891-021-04339-w. https://bmcmusculoskeletdisord.biomedcentral.com/articles/10.1186/s12891-021-04339-w .10.1186/s12891-021-04339-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 90.Verhey JT, Haglin JM, Verhey EM, Hartigan DE. Virtual, augmented, and mixed reality applications in orthopedic surgery. Int J Med Robot. 2020 Apr;16(2):e2067. doi: 10.1002/rcs.2067. [DOI] [PubMed] [Google Scholar]
- 91.Li Z, Jiang S, Song X, Liu S, Wang C, Hu L, Li W. Collaborative spinal robot system for laminectomy: a preliminary study. Neurosurg Focus. 2022 Jan;52(1):E11. doi: 10.3171/2021.10.FOCUS21499.2021.10.FOCUS21499 [DOI] [PubMed] [Google Scholar]
- 92.Li Z, Wang C, Song X, Liu S, Zhang Y, Jiang S, Ji X, Zhang T, Xu F, Hu L, Li W. Accuracy evaluation of a novel spinal robotic system for autonomous laminectomy in thoracic and lumbar vertebrae: a cadaveric study. J Bone Joint Surg Am. 2023 Jun 21;105(12):943–50. doi: 10.2106/JBJS.22.01320.00004623-202306210-00006 [DOI] [PubMed] [Google Scholar]
- 93.Kim JS, Vivas A, Arvind V, Lombardi J, Reidler J, Zuckerman SL, Lee NJ, Vulapalli M, Geng EA, Cho BH, Morizane K, Cho SK, Lehman RA, Lenke LG, Riew KD. Can natural language processing and artificial intelligence automate the generation of billing codes from operative note dictations? Global Spine J. 2023 Sep;13(7):1946–55. doi: 10.1177/21925682211062831. https://journals.sagepub.com/doi/10.1177/21925682211062831?url_ver=Z39.88-2003&rfr_id=ori:rid:crossref.org&rfr_dat=cr_pub0pubmed . [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.ChatGPT. [2024-04-29]. https://chatgpt.com .
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) checklist.
Search strategy.
Data extraction strategy.
Revised Cochrane risk-of-bias tool for randomized trials template for assessment completion.
Quality assessment of studies of large language models based on CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance.
Data Availability Statement
All data generated or analyzed during this study are included in this published paper and its supplementary information files.
