Abstract
Background
Large language models (LLMs) are increasingly used in healthcare for patient education and clinical decision support. However, systematic benchmarking in real-world clinical contexts remains limited, particularly for high-risk conditions such as hip fractures.
Objective
To evaluate and compare the performance of three state-of-the-art LLMs—DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5—in answering standardized patient questions on hip fracture management.
Methods
Thirty standardized questions covering general knowledge, diagnosis, treatment, and rehabilitation were developed by three specialists in orthopedics and traumatology. Each LLM generated responses independently. Three experienced orthopedic surgeons assessed accuracy (4-point scale) and comprehensiveness (5-point scale). Statistical analyses included Kruskal–Wallis and chi-squared tests.
Results
All models demonstrated high reliability, with 96.7% of responses rated “Good” or “Excellent” and none rated “Poor.” Mean accuracy scores were comparable across models, and comprehensiveness averaged 4.8/5. DeepSeek-V3-FW tended to provide longer, structured answers and performed best in general knowledge, while Gemini 2.0 Flash excelled in diagnosis and rehabilitation and produced the most concise responses. ChatGPT-4.5 offered shorter, conversational answers with similar accuracy and detail.
Conclusions
The three LLMs showed strong capabilities in delivering accurate and comprehensive information on hip fracture care, highlighting their potential as tools for patient education and clinical support. Differences in style and domain-specific strengths suggest complementary roles. Further research is needed to validate safety and integration into clinical workflows.
Keywords: Artificial intelligence, clinical decision support systems, health information management, hip fractures, large language models, patient education as topic
Introduction
Large language models (LLMs) are neural networks with billions of parameters, trained on massive text datasets to understand, generate, and analyze human language across various natural language processing tasks.1,2 In recent years, with the rapid advancement of technology, LLMs have gradually become part of daily life. 3 Compared to traditional search engines, LLMs are widely appreciated for their vast knowledge and strong ability to integrate information. 4 Based on the progress in natural language processing, users no longer need to use complex or technical terms to receive useful responses. 5 This has expanded the role of LLMs in learning, daily tasks, and entertainment. 6
Because LLMs are easy to access and use, many nonmedical users, especially older adults, rely on them to solve health-related problems.7,8 This has drawn heightened attention from the medical field, raising the question of whether LLMs can offer reliable medical advice to nonprofessionals.
Many studies have already explored LLMs in medicine. For example, ChatGPT-4o passed the USMLE with 90.4% accuracy, far above the 59.3% average of medical students. 9 Another study showed that ChatGPT-4o reached 92.3% diagnostic sensitivity in hip and knee osteoarthritis. 10 Even when facing complex cases including uncommon diseases, LLMs could still provide correct diagnoses in two-thirds of the situations. 11 These results suggest that LLMs hold strong potential in medical applications and assistance for health-related information.
With industrial growth and an ageing population, the incidence of bone and joint diseases is rising.12–16 Hip fracture, a common skeletal injury, often follows low-energy trauma and many of which have osteoporosis.17–19 Patients often suffer from pain, limited movement, and risks of complications including disability and mortality.20,21 Accurate emergency diagnosis is critical. Hip fractures are common injuries, particularly in older adults. LLMs are increasingly used by patients and their families to access medical information, offering accessible explanations and guidance related to such conditions.
This study aimed to evaluate whether current LLMs can provide accurate and comprehensive medical information related to hip fractures. We hypothesized that the three models under evaluation would differ in response quality. However, all models would meet basic clinical accuracy standards when assessed by specialists in orthopedics and traumatology.
In this study, we selected three widely used LLMs—DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5—to answer common questions about hip fractures. We compared the responses for length, accuracy and comprehensiveness. We also reviewed the reliability of the source links provided. This allowed us to assess whether LLMs could serve as a trustworthy source of medical education for patients with hip fractures.
Methods
Ethics
This study did not involve real human or animal subjects, so no ethical approval was required. It was an observational study, and we followed the STROBE guidelines to structure the report, including the study objective, methods, results, and limitations. 22
Study design
This study was designed as an observational, cross-sectional, model-comparison study using standardized medical questions. The overall study design is shown in Figure 1. The study was conducted from 1 March 2025 to 25 June 2025, at the Department of Orthopedics and Traumatology, Prince of Wales Hospital, The Chinese University of Hong Kong. A group of orthopedic specialists and clinical researchers carefully selected 30 multiple-choice questions related to hip fractures, divided into four categories: general knowledge, diagnosis, treatment, and rehabilitation. This classification helps evaluate the overall performance of LLMs and allows for comparison across different medical questions.
Figure 1.
Overall experimental flow for the study.
LLM: large language model.
The set of common questions was first created (YZ, TH, CL), and a special interest group reviewed and revised these questions (AM, MY, IAH, ST, TM, RMYW), selecting the ones most frequently asked by patients and healthcare providers during clinical care. The special interest group were orthopedic surgeons who had at least 10 publications related to the field of hip fractures/fragility fractures.
From 5 March 2025 to 7 April 2025, we used DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5 to generate answers to the 30 questions. To minimize potential bias caused by stochastic variation in LLM outputs, markedly inconsistent responses were regenerated using the same model to ensure comparability before evaluation. DeepSeek-V3-FW is a newly developed open-source model using a Mixture-of-Experts structure, offering strong reasoning ability with low computational cost. Gemini 2.0 Flash supports multimodal input and very long context windows (up to 1,048,576 tokens), and it runs on Google Cloud to deliver low-latency responses with source references. ChatGPT-4.5 is OpenAI's most advanced LLM, emphasizing emotional intelligence, natural dialog, and low hallucination rates. ChatGPT-4.5 and Gemini 2.0 Flash are partially paid models, while DeepSeek-V3-FW is free. To avoid context interference, we started a new conversation for each question.
Accuracy evaluation
Three orthopedic surgeons with over 10 years of experience managing hip fractures formed the rating group. We developed a four-level scoring system to assess the accuracy of each response: (a) “Poor”: Contains serious misinformation that may mislead or harm patients; (b) “Moderate”: Has moderate factual errors, unlikely to cause harm, but requires clarification; (c) “Good”: Minor factual errors, unlikely to mislead, minimal clarification needed; (d) “Excellent”: No errors, no clarification needed.
To keep the evaluation objective, all answers were randomly mixed and divided into three sets (each set included all 30 questions, 90 responses in total). Raters independently scored one set per day over three days to reduce fatigue and potential bias (Figure 1). Any identifying information about the model was hidden. In cases where the three raters did not agree, the lowest score was used.
Comprehensiveness evaluation
In addition to accuracy, raters evaluated the comprehensiveness of all responses rated above “poor.” Three specialists in orthopedics and traumatology assessed the comprehensiveness of responses by evaluating whether they covered key clinical domains related to hip fracture management, including etiology/injury mechanism, diagnostic evaluation, treatment options, postoperative risks, rehabilitation strategies, and expected recovery. Based on this, we used a five-point scale: (a) “Not comprehensive”: Key details missing; (b) “Slightly comprehensive”: Only minimal basic details; (c) “Moderately comprehensive”: Reasonable amount of detail; (d) “Comprehensive”: Covers most essential aspects; (e) “Very comprehensive”: Rich and detailed explanation.
Similar to accuracy scoring, the final comprehensiveness score was determined by taking the lowest score among all raters (Figure 1).
Statistical analysis
All statistical analyses were conducted using GraphPad Prism 10 (GraphPad Software, Boston, MA, USA) and IBM SPSS Statistics for Windows, Version 29.0 (IBM Corp., Armonk, NY, USA). To compare the performance of the three LLMs, statistical tests were selected according to data type and distribution. The Shapiro–Wilk test was used to assess the normality of continuous variables before selecting the appropriate statistical tests. Word count data followed a normal distribution; therefore, differences among LLMs were analyzed using one-way ANOVA, with Tukey's post-hoc test for pairwise comparisons.
Accuracy ratings (1–4: Excellent, Good, Moderate, Poor) and comprehensiveness ratings (1–5: Not comprehensive, slightly comprehensive, moderately comprehensive, Comprehensive, very comprehensive) were treated as ordinal variables with restricted ranges and nonnormal distributions. These were compared using the Kruskal–Wallis test, followed by Dunn's post-hoc test for pairwise contrasts. When rating categories were analyzed as nominal data, comparisons across LLMs were performed using Pearson's chi-squared test, with Fisher's exact test applied when expected cell counts were <5. Bonferroni correction was applied to adjust for multiple comparisons.
Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC). ICC values were interpreted according to conventional thresholds: <0.50 (poor), 0.50–0.75 (moderate), 0.75–0.90 (good), and >0.90 (excellent). A p-value < .05 was considered statistically significant.
Results
This study recorded and analyzed the performance of three LLMs in answering 30 standardized questions related to hip fractures. The evaluation focused on response length, accuracy, and comprehensiveness. The specific answers to all questions are listed in Supplementary Table 1.
Response length
The length of each response was recorded and summarized in Table 1. DeepSeek-V3-FW had the longest responses, with an average of 376.1 ± 71.6 characters, followed by Gemini 2.0 Flash (262.6 ± 91.01) and ChatGPT-4.5 (244 ± 36.57). Statistical analysis showed that ChatGPT-4.5's responses were significantly shorter than the other two models (p < .05), while there was no significant difference between DeepSeek-V3-FW and Gemini 2.0 Flash. A repeated-measures one-way ANOVA confirmed a robust overall effect of model type on word count, F (1.486, 43.11) = 38.80, p < .0001, R squared = 0.572 (Figure 2B).
Table 1.
Response length (words) from LLM-chatbots to hip fracture-related questions.
| Domain | No. of question | DeepSeek-V3-FW | Gemini 2.0 Flash | ChatGPT-4.5 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | SD | Median (IQR) | Min | Max | Mean | SD | Median (IQR) | Min | Max | Mean | SD | Median (IQR) | Min | Max | ||
| General questions | 7 | 376.1 | 71.6 | 396.0 (346.0–413.0) | 237 | 462 | 262.6 | 91.0 | 259.0 (201.0–326.0) | 126 | 396 | 244.0 | 36.6 | 248.0 (218.0–269.0) | 192 | 301 |
| Diagnosis | 5 | 462.6 | 45.4 | 483.0 (416.0–499.0) | 397 | 507 | 250.4 | 50.9 | 274.0 (196.0–293.0) | 186 | 299 | 217.8 | 30.8 | 214.0 (188.5–249.0) | 183 | 254 |
| Treatment | 10 | 479.4 | 57.3 | 479.0 (421.5–535.5) | 411 | 554 | 560.4 | 127.7 | 554.5 (434.5–669.0) | 413 | 554 | 300.8 | 64.9 | 317.0 (259.8–352.3) | 165 | 365 |
| Recovery and rehabilitation | 8 | 523.3 | 45.1 | 533.0 (482.0–560.0) | 456 | 588 | 593.5 | 51.5 | 605.5 (569.3–634.5) | 485 | 639 | 288.4 | 114.5 | 320.0 (283.0–358.3) | 17 | 377 |
| Total | 30 | 464.2 | 76.0 | 471.0 (412.8–531.0) | 237 | 588 | 448.1 | 181.7 | 411.5 (283.8–619.5) | 126 | 769 | 270.4 | 77.0 | 275.5 (220.3–331.3) | 17 | 377 |
LLM: large language model; SD: standard deviation; IQR: interquartile range.
Figure 2.
Three orthopedic surgeons evaluated the accuracy and comprehensiveness of the three LLMs. Heatmap (A) displays the unified accuracy scores; (B) shows the average response length (in words); (C) presents the mean accuracy scores; and (D) illustrates the mean comprehensiveness scores.
ns: not significant; LLM: large language model.
*p < .05, ***p < .001.
Accuracy of responses
Figure 2A shows the average accuracy scores, evaluated independently by the three clinical specialists. The mean scores were DeepSeek-V3-FW: 3.77 ± 0.45 (Median 4, IQR 4.0–4.0), Gemini 2.0 Flash: 3.80 ± 0.42 (Median 4, IQR 4.0–4.0), and ChatGPT-4.5: 3.79 ± 0.44 (Median 4, IQR 4.0–4.0), all within the “Good” or above range. No significant differences were found among the models, F (1.907, 169.7) = 0.16, p > .05, R squared = 0.002 (Figure 2C).
Figure 3A displays the distribution of accuracy levels based on the consensus. DeepSeek-V3-FW and Gemini 2.0 Flash had 56.7% of responses rated as “Excellent,” slightly higher than ChatGPT-4.5 (43.3%). Notably, none of the models produced responses rated as “Poor.” All three had 96.7% of responses rated as “Good” or above, indicating high reliability in handling hip fracture-related questions (Figure 3A).
Figure 3.
(A). Consensus-based accuracy ratings of LLM-chatbot responses to 30 standard hip fracture-related questions. (B). Within the four topics, the accuracy scores of LLM responses showed no statistically significant differences.
ns: not significant; LLM: large language model.
p > .05.
We also assessed inter-rater reliability among three evaluators scoring responses from three LLMs using ICC (3, k). DeepSeek-V3-FW demonstrated moderate agreement (ICC = 0.55). Gemini 2.0 Flash showed fair-to-low agreement (ICC = 0.29). ChatGPT-4.5 exhibited poor agreement (ICC = −0.18), reflecting substantial variability in ratings (Supplemental Table S4.)
The individual accuracy scores from the three raters are presented in Supplemental Table S2.
Performance by question category
Questions were divided into four categories: general knowledge, diagnosis, treatment, and recovery. Although there were several differences in specific questions, there was no statistically significant difference between LLMs in each topic (p > .05) (Figure 3B). Performance by category was as follows (Table 2):
Table 2.
Categorical statistics for the consensus accuracy score of LLM-chatbots to hip fracture-related questions.
| Domain | No. of question | DeepSeek-V3-FW | Gemini 2.0 Flash | ChatGPT-4.5 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Poor (1) | Moderate (2) | Good (3) | Excellent (4) | Poor (1) | Moderate (2) | Good (3) | Excellent (4) | Poor (1) | Moderate (2) | Good (3) | Excellent (4) | ||
| General questions | 7 | 0 (0) | 0 (0) | 1 (14.3) | 6 (85.7) | 0 (0) | 0 (0) | 4 (57.1) | 3 (42.9) | 0 (0) | 1 (14.3) | 3 (42.9) | 3 (42.9) |
| Diagnosis | 5 | 0 (0) | 1 (20) | 3 (60) | 1 (20) | 0 (0) | 1 (20) | 1 (20) | 3 (60) | 0 (0) | 0 (0) | 3 (60) | 2 (40) |
| Treatment | 10 | 0 (0) | 0 (0) | 6 (60) | 4 (40) | 0 (0) | 0 (0) | 6 (60) | 4 (40) | 0 (0) | 0 (0) | 7 (70) | 3 (30) |
| Recovery and rehabilitation | 8 | 0 (0) | 0 (0) | 2 (25) | 6 (75) | 0 (0) | 0 (0) | 1 (12.5) | 7 (87.5) | 0 (0) | 0 (0) | 3 (37.5) | 5 (62.5) |
| Total | 30 | 0 (0) | 1 (3.3) | 12(40) | 17 (56.7) | 0 (0) | 1 (3.3) | 12 (40) | 17 (56.7) | 0 0) | 1 (3.3) | 16 (53.3) | 13 (43.3) |
Notes: Numbers represent quantities corresponding to domains/categories.
Numbers within brackets are percentages.
Percentages may not add up to 100 due to rounding.
General knowledge
DeepSeek-V3-FW performed best, with 14.3% responses rated “Good,” and 85.7% rated “Excellent.” Gemini 2.0 Flash had 42.9% “Excellent” and 57.1% “Good.” ChatGPT-4.5 had 42.9% “Excellent,” 42.9% “Good,” and 14.3% “Moderate.” No statistically significant differences were observed between LLMs in each topic (F (1.531, 19.91) = 0.55, p > .05, R squared = 0.04).
Diagnosis
Gemini 2.0 Flash had the highest “Excellent” rate (60%) but also one “Moderate” score (20%). ChatGPT-4.5 had 60% of responses rated “Good” and 40% rated “Excellent.” DeepSeek-V3-FW showed weaker performance (20% “Moderate,” 60% “Good,” 20% “Excellent”). No statistically significant differences were observed between LLMs in each topic. (F (1.262, 11.36) = 1.59, P > 0.05, R squared = 0.15)
Treatment
All models performed well. DeepSeek-V3-FW and Gemini 2.0 Flash had 40% “Excellent” and 60% “Good.” ChatGPT-4.5 had slightly lower ratings with 30% “Excellent” and 70% “Good.” No statistically significant differences were observed between LLMs in each topic (F (1.753, 33.32) = 0.79, p > .05, R squared = 0.04).
Recovery
All responses were rated “Good” or above. Gemini 2.0 Flash performed best (87.5% “Excellent”), followed by DeepSeek-V3-FW (75% “Excellent”), and ChatGPT-4.5 (62.5% “Excellent”). No statistically significant differences were observed between LLMs in each topic (F (1.761, 26.41) = 0.79, p > .05, R squared = 0.05).
Comprehensiveness
As shown in Table 3, the three LLMs had similar performance in comprehensiveness. All models received an average score of 4.8 (Median 5, IQR 5.0–5.0). No statistically significant differences were observed, indicating that all models provided similarly detailed information (F (1.760, 156.6) = 0.02, p > .05, R squared = 0.0002) (Figure 2D).
Table 3.
Response comprehensiveness from LLM-chatbots to hip fracture-related questions.
| Domain | No. of question | DeepSeek-V3-FW | Gemini 2.0 Flash | ChatGPT-4.5 | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Mean | SD | Median | IQR | Mean | SD | Median | IQR | Mean | SD | Median | IQR | ||
| General questions | 7 | 4.9 | 0.4 | 5 | 5–5 | 4.7 | 0.6 | 5 | 5–5 | 4.7 | 0.5 | 5 | 5–5 |
| Diagnosis | 5 | 4.7 | 0.8 | 5 | 5–5 | 4.7 | 0.5 | 5 | 5–5 | 4.9 | 0.4 | 5 | 5–5 |
| Treatment | 10 | 4.7 | 0.6 | 5 | 5–5 | 4.9 | 0.3 | 5 | 5–5 | 4.8 | 0.4 | 5 | 5–5 |
| Recovery and rehabilitation | 8 | 4.9 | 0.3 | 5 | 5–5 | 4.9 | 0.3 | 5 | 5–5 | 4.8 | 0.4 | 5 | 5–5 |
| Total | 30 | 4.8 | 0.5 | 5 | 5–5 | 4.8 | 0.4 | 5 | 5–5 | 4.8 | 0.4 | 5 | 5–5 |
LLM: large language model; SD: standard deviation; IQR: interquartile range.
Similarly, we evaluated ICC values for the three raters across different LLMs in the comprehensiveness domain. Despite generally high comprehensiveness scores, ICC values revealed certain limitations. DeepSeek-V3-FW showed moderate agreement (ICC = 0.51). Gemini 2.0 Flash exhibited low to minimal agreement (ICC = 0.09). ChatGPT-4.5 demonstrated poor agreement in both domains, with accuracy ICC = −0.03 (Supplemental Table S4).
The individual comprehensiveness scores from the three raters are presented in Supplementary Table 3.
Discussion
This study evaluated the performance of three mainstream LLMs in handling medical questions related to hip fractures using 30 standardized items. All three models performed well in accuracy and comprehensiveness, with most responses rated “Good” or above, indicating strong potential for assisting medical information delivery.
Regarding response length, DeepSeek-V3-FW and Gemini 2.0 Flash tend to provide longer answers, whilst ChatGPT-4.5 offered shorter responses. Despite this, ChatGPT-4.5 performed comparably in accuracy and comprehensiveness, suggesting potential strengths in information density and focus for the responses. These differences also reflect variations in model architecture, training data, or fine-tuning strategies. Importantly, response length should not be interpreted as a direct indicator of quality. Response length may reflect a model's understanding of question complexity and ability to integrate information. 23 From a clinical perspective, context determines whether longer or shorter responses are more appropriate. For educational purposes, long and structured answers, as seen with DeepSeek-V3-FW and Gemini 2.0 Flash, may help users build a clearer knowledge framework. In contrast, in time-sensitive or settings where the user has background knowledge, the concise and direct answers, as seen with ChatGPT-4.5, may be more practical. Prior studies suggest that overly long texts can reduce processing efficiency and delay decision-making in urgent situations.24,25 Therefore, differences in response length should be interpreted as reflecting variations in communication style rather than as evidence that one model is better than another.
Although the overall accuracy among the three LLMs did not differ significantly, module-level analysis revealed domain-specific strengths. DeepSeek-V3-FW performed best in the “General Questions” module, with 85.7% of responses rated “Excellent,” surpassing both Gemini 2.0 FLASH and ChatGPT-4.5 (Table 2). This may reflect DeepSeek-V3-FW's advantage in general knowledge integration, logical structuring, and educational content generation, making it suitable for public health education.26,27 Gemini 2.0 Flash showed stronger performance in the “Diagnosis” and especially the “Recovery and Rehabilitation” modules, with 87.5% of responses rated “Excellent” in the latter (Table 2). This likely reflects the model's access to Google's extensive knowledge base and ability to generate time-relevant responses linked to published guidelines. Gemini 2.0 FLASH can also include hyperlinks and process multimodal inputs, offering potential for future clinical decision support.28,29 However, its reliance on dynamic online data may introduce risks related to outdated or unreliable information, which could require further attention and regulation.
Notably, in the “Treatment” module, none of the models had more than 50% of responses rated “Excellent” (Table 2). This indicates that current LLMs still have limitations in handling complex treatment decisions. Management of hip fractures requires orthopedic specialists to perform multistep clinical reasoning, integrate individual patient factors, and apply guideline-based surgical indications—elements that LLMs cannot fully account for when responding to standardized prompts. Moreover, treatment recommendations often demand personalization and risk stratification, which remain beyond the capabilities of general-purpose LLMs.30–32 These constraints likely contributed to the comparatively lower performance observed in this module.
Although none of the responses were rated as “Poor,” this does not imply that LLMs can independently provide safe clinical advice. High scores only indicate that, under controlled conditions, LLMs generally deliver good-quality information. In real clinical practice, patient questions are often incomplete, emotionally charged, and may contain misconceptions that require clarification and guidance from physicians. LLMs, however, may still produce overly confident yet ambiguous interpretations based on limited or unclear input, or offer generic treatment recommendations that overlook individual patient differences. They may also provide inappropriate descriptions of risk, potentially causing patient anxiety. Therefore, while LLMs can serve as a useful supplementary resource for patient education or as an aid to clinicians, they cannot replace professional medical consultation or individualized clinical decision-making.
The three models also differed in their functional focus and language style. DeepSeek-V3-FW emphasized logical reasoning and structured expression, often presenting responses in a clear, organized format, 33 which is suitable for learners and users seeking knowledge integration, aligning with its strong performance in general questions. Gemini 2.0 Flash was the only one that provided dynamic links to an external search engine, allowing users to access information and enhancing autonomy in health management. 34 Its multimodal capabilities offer promise for future clinical integration, though its dependency on external sources raises concerns about information stability. ChatGPT-4.5 focuses on conversational style, using plain language and concise content, making it more aligned with real-life communication. This may make ChatGPT-4.5 especially useful for patient education or doctor–patient-like interactions. 35
Regarding comprehensiveness, all three models showed similarly high scores, indicating their ability to cover relevant information on hip fracture topics has matured. However, some responses, while broad, lacked targeted advice or personalization. This highlights the need for further improvements in precision and tailored recommendations.
In our study, we also observed that interrater agreement for accuracy scores across the three LLMs were generally low. The highest agreement was for DeepSeek-V3-FW (ICC = 0.55), which represents only moderate consistency, whereas ChatGPT-4.5 showed severe disagreement (ICC = −0.18). A similar pattern emerged in the comprehensiveness assessment: ratings for DeepSeek-V3-FW demonstrated moderate agreement (ICC = 0.506), Gemini 2.0 Flash exhibited almost no agreement (ICC = 0.088), and ChatGPT-4.5 showed substantial variability among raters.
We believe that these low ICC values, despite generally high absolute scores, do not indicate flaws in our evaluation framework but rather reflect characteristics of the study. First, raters tended to assign high scores overall, resulting in a concentration of ratings in the upper range (ceiling effect), which makes ICC more sensitive to small differences among raters. Second, the responses from all three models were of relatively high quality, leading to limited variability across samples and further reducing ICC values. Therefore, even though the models achieved strong performance in both accuracy and domain-specific evaluations, the low ICC values primarily reflect the distribution of ratings rather than deficiencies in the scoring system.
However, we cannot ignore that a high ICC also indicates that LLMs often lack systematic structure when answering similar questions. Their response styles and logical organization are highly variable. Differences in emphasis, detail, and explanatory frameworks contribute to this inconsistency. As a result, even when based on high-quality foundational information, LLM outputs still show limitations in consistency and standardization. They fall short of achieving the systematic, stylistically uniform communication expected from specialists, highlighting a critical gap for real-world clinical application.
This study has several limitations. First, the question set was designed by researchers and, despite efforts to standardize content, may not fully reflect the diversity of real-world patient language and inquiry styles. Second, specialist scoring involves subjective judgment, and although a multirater consensus approach was used, bias may still exist. Future research should consider incorporating multiturn dialogs, multilingual scenarios, and patient-generated queries to assess better LLMs’ performance in complex clinical settings or situations. Further studies are also needed to explore LLM adaptability across cultures and address issues related to ethical risk, data reliability, and accountability frameworks, ensuring safe and effective integration of LLMs in clinical support systems.
Conclusion
In conclusion, this study systematically evaluated the performance of three mainstream LLMs (DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5) in handling medical questions related to hip fractures. All models demonstrated high accuracy and comprehensiveness, reflecting strong medical knowledge integration and output capabilities. Although differences existed in language style and domain-specific performance, all three showed promising potential for clinical support in hip fracture care. However, LLMs should not be regarded as a replacement for professional clinical consultation. Although they can provide high-quality foundational information, their consistency and standardization remain insufficient. They have yet to achieve the systematic, stylistically uniform communication skills expected from clinicians when communicating with patients. Further validation in real clinical environments is necessary to assess their practical value and ensure safe and standardized application in decision-support systems.
Supplemental Material
Supplemental material, sj-docx-1-dhj-10.1177_20552076251412989 for Comparative evaluation of large language models for hip fracture-related patient questions: DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5 by Yejin Zhang, Tao Huang, Chaoran Liu, Anna N Miller, Minghui Yang, Ian A Harris, Takeshi Sawaguchi, Theodore Miclau, Maoyi Tian, Chun Sing Chui, Ning Zhang, Wing Hoi Cheung and Ronald Man Yeung Wong in DIGITAL HEALTH
Acknowledgments
The authors gratefully acknowledge the invaluable support and contributions of all coauthors throughout the execution and preparation of this study.
Footnotes
ORCID iDs: Yejin Zhang https://orcid.org/0009-0006-2686-7069
Anna N Miller https://orcid.org/0000-0002-7056-8502
Ian A Harris https://orcid.org/0000-0003-0887-7627
Ronald Man Yeung Wong https://orcid.org/0000-0001-5615-6627
Ethics statement: No ethical approval was necessary as this study did not include any research involving human participants or animals.
Author contribution: All authors contributed to the conception and design of the study. Acquisition of data: Y Zhang, T Huang, C Liu, CS Chui, A Miller, M Yang, and I Harris; analysis and/or interpretation of data: Y Zhang, T Huang, C Liu, and RMY Wong; drafting the manuscript: Y Zhang, T Huang, C Liu, and RMY Wong. All authors contributed to reviewing the manuscript critically for important intellectual content. All authors had approved the version of the manuscript to be published.
Funding: The authors received no financial support for the research, authorship, and/or publication of this article.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Guarantor statement: Dr Ronald Man Yeung Wong, as the corresponding author, serves as the guarantor for this manuscript. He affirms that the study is an honest, accurate, and transparent account of the research conducted. All aspects of the study design, data collection, analysis, and interpretation have been carried out with integrity and scientific rigor. Dr Wong takes full responsibility for the content of the manuscript and confirms that no important aspects of the study have been omitted.
Supplemental material: Supplemental material for this article is available online.
References
- 1.Min B, Ross H, Sulem E, et al. Recent advances in natural language processing via large pre-trained language models: a survey. ACM Comput Surv 2024; 56: 1–40. [Google Scholar]
- 2.Bowman SR. Eight things to know about large language models. Critical AI 2024; 2. DOI: 10.1215/2834703X-11556011. [DOI] [Google Scholar]
- 3.Chang Y-C, Wang X, Wang J, et al. A survey on evaluation of large language models. ACM Trans Intell Syst Technol 2023; 15: 1–45. [Google Scholar]
- 4.Ai Q, Bai T, Cao Z, et al. Information retrieval meets large language models: a strategic report from Chinese IR community. AI Open 2023; 4: 80–90. [Google Scholar]
- 5.Pruneski JA, Pareek A, Nwachukwu BU, et al. Natural language processing: using artificial intelligence to understand human language in orthopedics. Knee Surg Sports Traumatol Arthrosc 2023; 31: 1203–1211. [DOI] [PubMed] [Google Scholar]
- 6.Chen J, Liu Z, Huang X, et al. When large language models meet personalization: perspectives of challenges and opportunities. World Wide Web 2024; 27: 42. [Google Scholar]
- 7.Sumner J, Wang Y, Tan SY, et al. Perspectives and experiences with large language models in health care: survey study. J Med Internet Res 2025; 27: e67383. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Wang A, Piao M, Zhou Y, et al. The acceptance of using large language model among older adults: a study based on TAM. Stud Health Technol Inform 2024; 315: 316–321. [DOI] [PubMed] [Google Scholar]
- 9.Bicknell BT, Butler D, Whalen S, et al. ChatGPT-4 omni performance in USMLE disciplines and clinical skills: comparative analysis. JMIR Med Educ 2024; 10: e63430. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Pagano S, Strumolo L, Michalk K, et al. Evaluating ChatGPT, Gemini and other large language models (LLMs) in orthopaedic diagnostics: a prospective clinical study. Comput Struct Biotechnol J 2025; 28: 9–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Ríos-Hoyo A, Shan NL, Li A, et al. Evaluation of large language models as a diagnostic aid for complex medical cases. Front Med (Lausanne) 2024; 11: 1380148. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Schwaiger BJ, Gersing AS, Mbapte Wamba J, et al. Can signal abnormalities detected with MR imaging in knee articular cartilage be used to predict development of morphologic cartilage defects? 48-month data from the osteoarthritis initiative. Radiology 2016; 281: 158–167. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13.Wang Y, Deng P, Liu Y, et al. Alpha-ketoglutarate ameliorates age-related osteoporosis via regulating histone methylations. Nat Commun 2020; 11: 5596. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Gomez E, Cazanave C, Cunningham SA, et al. Prosthetic joint infection diagnosis using broad-range PCR of biofilms dislodged from knee and hip arthroplasty surfaces using sonication. J Clin Microbiol 2012; 50: 3501–3508. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Zhang C, Feng J, Wang S, et al. Incidence of and trends in hip fracture among adults in urban China: a nationwide retrospective cohort study. PLoS Med 2020; 17: e1003180. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Moldovan F, Moldovan L. The impact of total hip arthroplasty on the incidence of hip fractures in Romania. J Clin Med 2025; 14: 4636. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Brännström J, Lövheim H, Gustafson Y, et al. Association between antidepressant drug use and hip fracture in older people before and after treatment initiation. JAMA Psychiatry 2019; 76: 172–179. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Chen Y, Yang C, Dai Q, et al. Gold-nanosphere mitigates osteoporosis through regulating TMAO metabolism in a gut microbiota-dependent manner. J Nanobiotechnol 2023; 21: 25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Moldovan F, Moldovan L. A modeling study for hip fracture rates in Romania. J Clin Med 2025; 14: 3162. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Guzon-Illescas O, Perez Fernandez E, Crespí Villarias N, et al. Mortality after osteoporotic hip fracture: incidence, trends, and associated factors. J Orthop Surg Res 2019; 14: 20306. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Nauth A, Creek AT, Zellar A, et al. Fracture fixation in the operative management of hip fractures (FAITH): an international, multicentre, randomised controlled trial. Lancet (London, England) 2017; 389: 1519–1527. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Cuschieri S. The STROBE guidelines. Saudi J Anaesth 2019; 13: S31–S34. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Ren C, Duan Y, Mei Y, et al. Evaluation on AGI/GPT based on the DIKWP XV: an instruction set for civil judgment document parsing. 2023.
- 24.Sarmiento LF, Lopes da Cunha P, Tabares S, et al. Decision-making under stress: a psychological and neurobiological integrative model. Brain Behav Immun - Health 2024; 38: 100766. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Arnold M, Goldschmitt M, Rigotti T. Dealing with information overload: a comprehensive review. Front Psychol 2023; 14: 1122200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Neha F, Bhati D. A survey of DeepSeek models. 2025.
- 27.Zhou M, Pan Y, Zhang Y, et al. Evaluating AI-generated patient education materials for spinal surgeries: comparative analysis of readability and DISCERN quality across ChatGPT and DeepSeek models. Int J Med Inf 2025; 198: 105871. [DOI] [PubMed] [Google Scholar]
- 28.Yang L, Xu S, Sellergren A, et al. Advancing multimodal medical capabilities of Gemini. 2024; abs/2405.03162.
- 29.Saab K, Freyberg J, Park C, et al. Advancing conversational diagnostic AI with multimodal reasoning. 2025.
- 30.Shaw M, Pelecanos AM, Mudge AM. Evaluation of internal medicine physician or multidisciplinary team comanagement of surgical patients and clinical outcomes: a systematic review and meta-analysis. JAMA Netw Open 2020; 3: e204088–e204088. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Yang Y, Jin Q, Leaman R, et al. Ensuring safety and trust: analyzing the risks of large language models in medicine. 2024.
- 32.Pressman SM, Borna S, Gomez-Cabello CA, et al. Clinical and surgical applications of large language models: a systematic review. J Clin Med 2024; 13: 3041. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Guo D, Zhu Q, Yang D, et al. DeepSeek-Coder: when the large language model meets programming–the rise of code intelligence. 2024.
- 34.Team GR, Abeyruwan S, Ainslie J, et al. Gemini robotics: bringing AI into the physical world. 2025.
- 35.Hagendorff T, Fabi S, Kosinski M. Human-like intuitive behavior and reasoning biases emerged in large language models but disappeared in ChatGPT. Nat Computat Sci 2023; 3: 833–838. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplemental material, sj-docx-1-dhj-10.1177_20552076251412989 for Comparative evaluation of large language models for hip fracture-related patient questions: DeepSeek-V3-FW, Gemini 2.0 Flash, and ChatGPT-4.5 by Yejin Zhang, Tao Huang, Chaoran Liu, Anna N Miller, Minghui Yang, Ian A Harris, Takeshi Sawaguchi, Theodore Miclau, Maoyi Tian, Chun Sing Chui, Ning Zhang, Wing Hoi Cheung and Ronald Man Yeung Wong in DIGITAL HEALTH



