Abstract
Background
The application of artificial intelligence (AI) in medical education and patient interaction is rapidly growing. Large language models (LLMs) such as GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus have shown potential in providing relevant medical information. This study aims to evaluate and compare the performance of these LLMs in answering frequently asked questions (FAQs) about Total Knee Arthroplasty (TKA), with a specific focus on the impact of role-playing prompts.
Methods
Four leading LLMs—GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus—were evaluated using ten standardized patient inquiries related to TKA. Each model produced two distinct responses per question: one generated under zero-shot prompting (question-only), and one under role-playing prompting (instructed to simulate an experienced orthopaedic surgeon). Four orthopaedic surgeons evaluated responses for accuracy and comprehensiveness on a 5-point Likert scale, along with a binary measure for acceptability. Statistical analyses (Wilcoxon rank sum and Chi-squared tests; P < 0.05) were conducted to compare model performance.
Results
ChatGPT-4 with role-playing prompts achieved the highest scores for accuracy (3.73), comprehensiveness (4.05), and acceptability (77.5%), followed closely by ChatGPT-3.5 with role-playing prompts (3.70, 3.85, 72.5%, respectively). Google Gemini and Claude 3 Opus demonstrated lower performance across all metrics. In between-model comparisons based on zero-shot prompting, ChatGPT-4 achieved significantly higher scores of both accuracy and comprehensiveness relative to Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002), and demonstrated higher acceptability than Claude 3 Opus (P = 0.006). Within-model comparisons showed role-playing significantly improved all metrics for ChatGPT-3.5 (P < 0.05) and acceptability for ChatGPT-4 (P = 0.033). No significant prompting effects were observed for Gemini or Claude.
Conclusions
This study demonstrates that role-playing prompts significantly enhance the performance of LLMs, particularly for ChatGPT-3.5 and ChatGPT-4, in answering FAQs related to TKA. ChatGPT-4, with role-playing prompts, showed superior performance in terms of accuracy, comprehensiveness, and acceptability. Despite occasional inaccuracies, LLMs hold promise for improving patient education and clinical decision-making in orthopaedic practice.
Clinical trial number
Not applicable.
Supplementary Information
The online version contains supplementary material available at 10.1186/s12911-025-03024-5.
Keywords: Large language models, Total knee arthroplasty, Artificial intelligence, Prompting
Background
Recent advancements in artificial intelligence (AI) and natural language processing (NLP) have substantially enhanced the ability of large language models (LLMs) to deliver precise and contextually relevant responses in healthcare. State-of-the-art models like ChatGPT, Google Gemini, and Claude are leading innovations in clinical decision support, patient education, and medical training [1–4]. As the internet continues to be a primary source of health information [5, 6], patients are increasingly relying on chatbots for guidance on preoperative preparation and answers to frequently asked questions (FAQs). This trend is particularly evident with the growing demand for total knee arthroplasty (TKA) [7], where patients seek trustworthy information to navigate their treatment options and optimized surgical outcomes.
Recent studies have highlighted ChatGPT’s potential as a supplementary tool for TKA preoperative education [8–13]. As LLMs continue to advance, researchers are comparing the effectiveness of different chatbots, such as GPT-3.5 and GPT-4 [14–17], as well as various models [18–22], in delivering accurate information. However, research specifically comparing the responses of different LLMs to TKA-related FAQs remains limited. Furthermore, concerns regarding the accuracy of information emphasize the need for cautious use of chatbots as standalone tools. The quality of chatbot responses is heavily influenced by prompt structure, which has led to advancement in “prompting engineering” [23]. Specifically, role-playing prompts have emerged as a factor in improving chatbot accuracy. For example, prompting ChatGPT-3.5 to act as an herbal medicine expert increased its accuracy on herb–drug interaction queries from 60–100% [24]. However, studies applying role-play prompting in the medical field remain very scarce.
To our knowledge, no prior studies have systematically compared the performance of multiple LLMs in addressing TKA inquiries using varied prompting methods. In response to this gap, our research employed four publicly available LLMs—GPT-3.5, GPT-4, Google Gemini and Claude 3 Opus—with two primary objectives: (1) Evaluate and compare the performance of these LLMs in terms of accuracy, comprehensiveness and acceptability when responding to queries related to TKA; and (2) Assess whether role-playing prompts, simulating a specialist’s perspective, serve as an effective prompting strategy for improving response quality. We hypothesized that role-playing prompts would substantially enhance the quality of AI-generated responses across multiple evaluative dimensions.
Methods
In this study, the evaluation of four leading Large Language Models (LLMs) was conducted by testing their proficiency in answering questions related to TKA. A set of ten questions, crafted by one of the researchers and refined by a panel to reflect common inquiries from patients in our clinical practice, were posed to each LLM (Table 1) [25]. This study was approved by the Institutional Review Board, and the requirement for patient consent was waived (Additional file 1). Ten questions were presented to four LLMs using two different prompting methods: zero-shot and role-play, yielding a total of eighty responses. Zero-shot prompting presented the question without any simulated instructions. In contrast, the role-play prompting involved instructing the AI to simulate an experienced orthopaedic surgeon, e.g., “I want you to act as an experienced orthopaedic surgeon specializing in total knee arthroplasty for over ten years. You have published extensively on the subject, contributed to knee implant designs, and are recognized for your innovation, empathy, and patient care.” Following this prompt, responses were generated based on this specialized persona.
Table 1.
Frequently asked questions about total knee arthroplasty
| Question | |||
|---|---|---|---|
| Q1 | What occurs during a total knee arthroplasty operation? | ||
| Q2 | When is it necessary for me to consider total knee arthroplasty? | ||
| Q3 | Are there any alternatives to total knee arthroplasty? | ||
| Q4 | Can the choice of implant material or design enhance my results after total knee arthroplasty? | ||
| Q5 | Is undergoing total knee arthroplasty considered to carry high risks? | ||
| Q6 | What is the expected recovery duration following total knee arthroplasty? | ||
| Q7 | How many years can I expect the knee arthroplasty to last? | ||
| Q8 | What are the common reasons for a knee arthroplasty to fail? | ||
| Q9 | I’ve been diagnosed with a periprosthetic infection following total knee arthroplasty. What steps should I take? | ||
| Q10 | Should I choose minimally invasive surgical techniques over conventional approaches when undergoing total knee arthroplasty? | ||
Q, question
The responses to each query were generated from March 20th to March 24th, 2024. Each model was presented with the questions, both zero-shot and role-playing prompts, only once, without any follow-up, rephrasing, or clarifications, to mirror the immediate response scenario often encountered in patient inquiries. To avoid memory bias, each LLM session was reset after every question, and unprompted and prompted responses were generated in separate sessions.
The responses from each LLM were collected, anonymized, and formatted into a uniform plain text for evaluation. Four experienced faculty members from the Department of Orthopaedic Surgery at Chung Gung Memorial Hospital, all blinded to the origin of the responses, were tasked with grading them. This blinding was a critical measure to ensure an unbiased assessment of each model’s ability to handle specialized medical queries effectively. The full list of question prompts and corresponding responses can be found in Additional file 2, and the surgeon scoring questionnaire for TKA FAQs is included as Additional file 3.
The responses to each question were compiled in a spreadsheet and sent to a panel of four orthopaedic surgeons for evaluation. The four orthopedic surgeons who evaluated the responses were independent and had no affiliations or financial interests with the companies providing the AI programs. They assessed both the accuracy and comprehensiveness of each response using a 5-point Likert scale. This method has been employed in previous studies examining the accuracy and comprehensiveness of medical information from online resources [11, 13, 18, 26–29]. Clear instructions defining each characteristic and score were provided, as shown in Table 2. To brief, Accuracy scores reflected factual correctness, ranging from significant inaccuracies (score 1) to error-free responses (score 5); comprehensiveness ranged from incomplete answers to exhaustive, in-depth responses. The final score for each response was calculated by taking the mean of the reviewers’ scores. In cases of notable discrepancies (score differences > 2 points), the disagreement was resolved through consensus discussion, resulting in a final agreed-upon score.
Table 2.
Scoring criteria for accuracy and comprehensiveness
| Score | Accuracy | Comprehensiveness | ||
|---|---|---|---|---|
| Agree with | Description | |||
| 1 | 0–25% | Responses containing significant inaccuracies that could potentially mislead patients and cause harm. | Incomplete | Responses are significantly lacking in detail and only address a minor part of the question. |
| 2 | 26–50% | Only achieves a basic level of accuracy. | Partial | Responses provide minimal and basic details, yet lack critical elements. |
| 3 | 51–75% | Responses contain potential factual inaccuracies but are unlikely to mislead or harm the patient. | Adequate | Responses offer a fair amount of detail and the main points needed to grasp the topic, yet they fall short in depth. |
| 4 | 76–99% | Nearly all key points are accurate. | Thorough | Responses encompass most of the essential aspects of the question. |
| 5 | 100% | Error-free response. | Exhaustive | Responses deliver exhaustive details and a comprehensive answer, offering deep insights into the topic. |
Acceptability, a key factor in healthcare interventions, was assessed using the method proposed by Wright [11]. This method employs a binary question for reviewers: “Would you be comfortable if this were the only information your patient received for their question?” Reviewers evaluated whether the information provided was both sufficient and safe for patients, aligning with the core principle of patient well-being [30]. For analysis, a reviewer–level approach was used, in which each rating was counted independently; the percentage of acceptability was calculated as the total number of “yes” responses divided by the total number of reviewer ratings (n = 320). Specifically, evaluators were asked to comment on the responses, indicating any inaccuracies, incompleteness, or contradictions to their clinical practice. Comments could also highlight positive aspects of the answer.
Data analyses
Continuous data were presented as means and range (minimum, maximum). Categorical data were reported as proportions and percentages. The Wilcoxon rank sum test was used to compare continuous variables. For categorical data, Pearson Chi-squared tests were applied. The 95% confidence intervals (CIs) were calculated, with statistical significance set at a P value < 0.05. To assess inter–rater reliability, intraclass correlation coefficients (ICCs) were calculated using a two–way mixed–effects model [ICC (3, k), consistency] [31]. Statistical analyses were conducted using SPSS Statistics, Version 25.0 (IBM).
Results
The analysis of four LLMs yielded a mean accuracy score of 3.38 (95% CI 2.67–4.09) across all responses, indicating alignment with 51–75% of the information presented. The mean comprehensiveness score was 3.48 (95% CI 2.65–4.31), surpassing the ‘adequate level’ threshold (Table 2). Among all responses assessed by the four reviewers, 48.8% (156/320) were deemed acceptable.
Table 3 demonstrates that ChatGPT-4 with role-playing prompts achieved the highest mean scores for accuracy (3.73) and comprehensiveness (4.05), as well as the highest acceptability rate (77.5%) among the four LLMs tested. ChatGPT-3.5 with role-playing prompts followed closely, with mean scores of 3.70 for accuracy and 3.85 for comprehensiveness, and an acceptability rate of 72.5%. Google Gemini and Claude 3 Opus showed lower performance across all metrics, with and without role-playing prompts.
Table 3.
Within-model comparisons of large language models performance in accuracy, comprehensiveness, and acceptability between zero-shot and role-playing prompts
| LLM | Metric | Zero–shot, mean (minimum, maximum) | Role–playing, mean (minimum, maximum) | P Value |
|---|---|---|---|---|
| ChatGPT-3.5 | Accuracy | 3.33 (2–5) | 3.70 (2–5) | 0.020 a |
| Comprehensiveness | 3.43 (2–5) | 3.85 (2–5) | 0.026 | |
| Acceptability (%) | 18/40 (45) | 29/40 (72.5) | 0.012 | |
| ChatGPT-4 | Accuracy | 3.45 (2–5) | 3.73 (2–5) | 0.058 |
| Comprehensiveness | 3.68 (2–5) | 4.05 (2–5) | 0.053 | |
| Acceptability (%) | 22/40 (55) | 31/40 (77.5) | 0.033 | |
| Google Gemini | Accuracy | 3.13 (2–5) | 3.25 (2–5) | 0.425 |
| Comprehensiveness | 3.18 (2–5) | 3.18 (2–5) | 1.000 | |
| Acceptability (%) | 16/40 (40) | 16/40 (40) | 1.000 | |
| Claude 3 Opus | Accuracy | 3.10 (2–5) | 3.40 (2–5) | 0.056 |
| Comprehensiveness | 3.13 (2–5) | 3.35 (2–5) | 0.117 | |
| Acceptability (%) | 10/40 (25) | 14/40 (35) | 0.329 |
Continuous data are described by mean (minimum, maximum), and categorical data as number (percent)
LLM, large language model
a Statistically significant values are italicized
Table 4 summarizes the statistical comparisons between models using responses generated under zero-shot prompting. ChatGPT-4 demonstrated significantly higher accuracy and comprehensiveness scores compared to both Google Gemini (P = 0.031 and P = 0.009, respectively) and Claude 3 Opus (P = 0.019 and P = 0.002, respectively). Additionally, ChatGPT-4 had a significantly higher acceptability rate compared to Claude 3 Opus (P = 0.006). No other between-model differences in acceptability reached statistical significance.
Table 4.
Between-model comparisons of accuracy, comprehensiveness, and acceptability for responses generated under zero-shot prompting
| LLM | Accuracy | Comprehensiveness | Acceptability |
|---|---|---|---|
| ChatGPT-3.5 vs. ChatGPT-4 | 0.430 | 0.201 | 0.371 |
| ChatGPT-3.5 vs. Google Gemini | 0.199 | 0.173 | 0.651 |
| ChatGPT-3.5 vs. Claude 3 Opus | 0.145 | 0.079 | 0.061 |
| ChatGPT-4 vs. Google Gemini | 0.031 a | 0.009 | 0.179 |
| ChatGPT-4 vs. Claude 3 Opus | 0.019 | 0.002 | 0.006 |
| Google Gemini vs. Claude 3 Opus | 0.862 | 0.756 | 0.152 |
LLM, large language model
a Statistically significant values are italicized
When prompted to act as an experienced orthopaedic surgeon specializing in TKA, role-playing significantly improved the performance of ChatGPT-3.5 across all three metrics: accuracy (P = 0.02), comprehensiveness (P = 0.026), and acceptability (P = 0.012) (Table 3). Role-playing prompts also significantly enhanced the acceptability rate of ChatGPT-4 (P = 0.033). However, no significant improvements were observed for Google Gemini or Claude 3 Opus when using role-playing prompts.
The overall ICC for accuracy and comprehensiveness ratings across all LLMs and prompting types was 0.632, indicating moderate agreement among reviewers [31]. This level of consistency supports the reliability of the expert evaluations used in this study.
Discussion
The aim of this study was to evaluate and compare the performance of four large language models — GPT-3.5, GPT-4, Google Gemini, and Claude 3 Opus — in answering frequently asked questions about TKA, with a specific focus on the impact of role-playing prompts. The authors hypothesized that role-playing prompts would enhance the accuracy, comprehensiveness, and acceptability of AI responses to TKA-related FAQs. Our major findings support this hypothesis, showing that (1) Among the four LLMs tested, ChatGPT-4 with role-playing prompts achieved the best performance in answering TKA queries, followed closely by ChatGPT-3.5 with role-playing prompts. (2) Role-playing prompts significantly improved ChatGPT-3.5’s performance across all metrics and increased ChatGPT-4’s acceptability.
In the current study, a clear disconnect was observed between the mean accuracy and comprehensiveness scores and the overall acceptability rate across the four large language models. While the average accuracy and comprehensiveness ratings were 3.38 and 3.48, respectively, only 48.8% (156/320) of responses were judged acceptable. This discrepancy likely reflects the more stringent threshold applied to acceptability, which required responses to be both factually accurate and sufficiently comprehensive for use as standalone patient information. In contrast, Likert-scale ratings allow for partial credit, whereby responses with moderate scores may still be deemed unacceptable if they omit essential clinical details or contain ambiguities.
Prior research has demonstrated ChatGPT’s capability to answer TKA queries with above-average quality [9–11, 13], reaching a performance level comparable to arthroplasty-trained nurses [8]. Studies also highlighted its capability to generate clinically relevant responses [10, 13]. However, the present study found that despite a mean accuracy rating above three (indicating agreement with more than 50% of the content), and adequate comprehensiveness, ChatGPT–3.5’s overall performance did not align with the standards set by similar studies [10, 11, 13]. Specifically, only 45% of its responses were deemed acceptable, compared to 59.2% in a prior study [11]. Upon reviewing the scoring results, responses to Questions 7 and 9 were all considered unacceptable due to the lack of probabilistic data and insufficient surgical details for periprosthetic joint infection, respectively.
Between-model comparisons revealed that ChatGPT-4 outperformed GPT-3.5 across key metrics — accuracy, comprehensiveness, and acceptability — though the differences were not significant, particularly against GPT-3.5’s role-playing prompts (Tables 3 and 4). In the Polish Medical Final Examination, GPT-4 achieved significantly higher mean accuracies compared to GPT-3.5, passing all versions of the exam [16]. Similarly, GPT-4 demonstrated superior performance in the American Academy of Ophthalmology Basic and Clinical Science Course self-assessment test [15], the Japanese Medical Licensing Examination [14], and Radiology Diagnosis Please cases [17]. These findings highlight the rapid advancements in language model capabilities and the potential of GPT-4 to surpass its predecessor, GPT-3.5, across various medical domains and question formats.
Additionally, we found ChatGPT-4 demonstrated significantly greater accuracy and comprehensiveness compared to Google Gemini and Claude 3 Opus (Table 4). Recent research has evaluated the comparative performance of various LLMs across multiple medical domains. In the field of dentistry, ChatGPT-4 outperformed ChatGPT-3.5, Bing Chat, and Google Bard in addressing clinically relevant questions [19]. Similarly, for myopia care-related queries, ChatGPT-4 exhibited superior accuracy compared to GPT-3.5 and Google Bard, while maintaining comparable comprehensiveness [18]. Furthermore, in surgical planning for ophthalmology, ChatGPT models outperformed Google Gemini in terms of the Global Quality Score [21].
ChatGPT-4’s performance has also been evaluated in standardized medical testing, with the model achieving a perfect score (100%) on the National Board of Medical Examiners exam, surpassing GPT-3.5, Claude, and Google Bard [22]. A systematic analysis published in Nature Communications further highlighted ChatGPT-4’s superior performance over both GPT-3.5 and traditional Google searches across various medical scenarios. However, the open-source LLM Llama 2 was found to perform slightly less effectively compared to its counterparts [20]. These findings underscore ChatGPT-4’s advanced capabilities and its potential to enhance clinical decision-making and medical education.
Prompt engineering is crucial for optimizing the accuracy and informativeness of AI-generated responses in medical education [23, 32]. Our study found that role-playing prompts enhance the performance of all LLMs in answering queries about TKA, significantly improving ChatGPT-3.5’s performance across all metrics and boosting ChatGPT-4’s acceptability (Table 3). These findings are supported by other research: A study from Thailand demonstrated that when ChatGPT-3.5 was prompted to assume the role of an herbalist, the accuracy of responses increased by 8.98% compared to zero-shot prompts. Specifically, this role-playing approach raised accuracy from 60 to 100% in herb-drug interactions [24]. Moreover, another study confirmed that role-play prompting surpassed standard zero-shot approaches across various reasoning benchmarks, effectively serving as an implicit trigger for the chain-of-thought process [33]. These results affirm the efficacy of role-play prompting in enhancing LLM reasoning capabilities, emphasizing its value in medical education and other fields.
Despite the potential of AI-driven responses, LLMs such as ChatGPT have raised concerns regarding reliability and consistency when applied to TKA, primarily due to limitations in training datasets. Studies have revealed issues in reliability, including the fabrication of citations and references to non-existent studies or journals [9, 12]. These inaccuracies, known as “hallucinations,” pose risks of spreading misinformation, potentially leading to an AI-driven “infodemic” that could threaten public health [34]. Moreover, ChatGPT’s response consistency to repeated queries on TKA has been debated. Some studies highlighted inconsistencies, noting inferior consistency scores [10], while others found no significant differences in accuracy or comprehensiveness upon repeated questioning [11]. In addition to these concerns, recent findings also suggested that role-play prompting presents a dual-edged phenomenon: while it improves the contextual accuracy and relevance of LLM responses, it may also amplify biases and generate harmful outputs. This trade-off is particularly concerning in high-stakes fields such as healthcare, where misaligned content could undermine trust and reinforce inequities [35].
While many of the LLM’s responses demonstrated scientific accuracy and relevance, these qualities were not consistently sufficient to meet clinical acceptability standard. Table 5 outlines several critical issues that limited their overall utility. For instance, Google Gemini overstated the effectiveness of topical medications and the role of arthroscopic surgery in osteoarthritis management, contradicting established guidelines. Claude Opus 3 also provided inaccurate information regarding implants and preoperative antibiotics, which are particularly used in patients who are septic or too unstable to wait for culture data. As a result, both models exhibited low acceptability rate (≤ 40%), with most inaccuracies observed in responses generated under zero-shot prompts rather than role-playing prompts. Additionally, response on implant design often lacked completeness and most responses omitted probabilistic data. A recurring issue with role-playing prompts was the use of the first-person perspective, which may be perceived as not objective and can introduce personal bias. We believe additional biases include the reinforcement of existing prejudices in the training data and the tendency to provide overly confident answers without sufficient evidence.
Table 5.
Evaluators’ comments on large language model responses (exact copies)
| LLM | Comment |
|---|---|
| ChatGPT-3.5 |
Zero-shot - Material choice impact survival, as well as the occurrence of TKA complications, such as PJI, loosening, instability. (Q7) Role-playing - Very friendly and easy to understand. (Q2) - Lack of Vitamin-E poly, rotating platform, and fixation method information. (Q4) |
| ChatGPT-4 |
Role-playing - The question is about what happens “during” total knee arthroplasty. The answer gives too much information about pre-OP and post-OP, which seems redundant for this specific question. (Q1) - Lack of CR/PS design, medial pivot design. (Q4) - “innovations” are not accurate, for example, MIS approach does not guarantee to lessen tissue damage and speed up recovery. (Q5) - No first-person perspective. (Q7) |
| Google Gemini |
Zero-shot - Too easy and lack of professional knowledge. (Q2) - Many misleading information (Q3) - Topical medication is not evidence based. Should specify that it may be effective only in superficial tendinitis, not osteoarthritis. - Arthroscopic surgery is not effective in OA knee, but may be effective in symptoms associated with synovitis, loose body, meniscus symptoms. - Lack in information about UKA. - Should provide approximate incidence rate of each complication. Lack of instability. (Q5) Role-playing - I don’t think it’s suitable to use a first-person perspective in an AI-generated response. (Q2) |
| Claude 3 Opus |
Zero-shot - Difficult for patients to understand. (Q2) - Should be no ceramic on ceramic design, and patient-specific implant. (Q4) - Better give probabilities. (Q8) Role-playing - Short but well-constructed. (Q2) - Point by point response is better understood. (Q5) - No first-person perspective. (Q7) - Wrong about preoperative antibiotics. (Q9) |
LLM, large language model; TKA, total knee arthroplasty; PJI, periprosthetic joint infection; Q, question; OP, operation; CR, cruciate-retaining; PS, posterior-stabilized; OA, osteoarthritis; UKA, unicompartmental knee arthroplasty
This study has several noteworthy limitations. First, the questions posed may lack comprehensiveness and were not systematically categorized, which could limit the evaluation’s scope and obscure potential pitfalls of the LLMs. Second, the LLMs utilized in this study did not have access to up-to-date external information sources, potentially compromising the recency and accuracy of their responses. Third, the specific wording of questions might have significantly influenced the results, highlighting the sensitivity of LLM responses to prompt variations. Moreover, the evaluation’s subjective nature, with only four orthopaedic surgeons assessing the responses based on subjective criteria, could introduce bias and affect the reliability of the findings. Additionally, while role-playing prompts improved expert-rated quality, it remains uncertain whether the increased technical detail enhances patient comprehension. This study focused on clinical accuracy and relevance as evaluated by orthopaedic specialists, and did not include formal readability analysis. While tools such as the Flesch-Kincaid Grade Level may provide valuable insight, assessing patient comprehension represents a distinct dimension that warrants dedicated investigation. Subsequent work should examine whether increased technical detail enhances or hinders patient understanding. Lastly, the cost of some advanced LLMs may limit their accessibility and practical application in resource-constrained settings, thus affecting the generalizability of the results.
Future research must focus on developing more comprehensive evaluation methods and addressing the inherent weaknesses of LLMs. Engaging medical professionals with up-to-date resources and expanding evaluations to include a diverse group of experts will be crucial. Through this study, we believe utilizing advanced large language models like ChatGPT-4 with role-playing prompts minimizes the risk of misinformation compared to unverified online sources. These models can streamline pre- and post-operative patient education, reducing consultation times and ensuring patients are well-informed. By handling routine patient inquiries, LLMs alleviate cognitive load, allowing surgeons to focus on critical clinical activities.
Conclusions
Role-playing prompts significantly enhanced response quality from ChatGPT-4 and GPT-3.5, yielding more contextually rich and detailed answers. ChatGPT-4, particularly with role-playing, demonstrated superior accuracy, comprehensiveness, and acceptability compared to other models. However, occasional inaccuracies and incomplete information persist across models, highlighting the need for further refinement to ensure consistent reliability. These findings underscore the potential of LLMs, especially with role-specific prompts, to streamline patient education, reduce consultation times, and ensure patients are well-informed, thereby easing cognitive demands on surgeons and allowing them to concentrate on essential clinical tasks.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary Material 1: Additional file 1.pdf: Institutional Review Board certification
Supplementary Material 2: Additional file 2.docx: Detailed Responses from Four Large Language Models to Total Knee Arthroplasty Queries
Supplementary Material 3: Additional file 3.docx: Surgeon Scoring Questionnaire for Total Knee Arthroplasty Frequently Asked Questions
Acknowledgements
Not applicable.
Abbreviations
- AI
Artificial Intelligence
- CIs
Confidence Intervals
- CR
Cruciate-Retaining
- FAQ
Frequently Asked Questions
- IRB
Institutional Review Board
- LLM
Large Language Model
- MIS
Minimally Invasive Surgery
- NLP
Natural Language Processing
- OA
Osteoarthritis
- OP
Operation
- PJI
Periprosthetic Joint Infection
- PS
Posterior-Stabilized
- Q
Question
- TKA
Total Knee Arthroplasty
- UKA
Unicompartmental Knee Arthroplasty
Author contributions
All authors contributed to the understanding and design of the study. C.P.Y. and Y.C.L. formulated the idea and designed the study. Y.C.C. collected responses from LLMs. C.P.Y., Y.C.L., H.S., and S.H.L. evaluated and scored the responses. Y.C.C., S.H.Lin, and S.C.F. conducted the analysis. The first draft of the article was written by Y.C.C. C.C.H. analyzed the manuscript critically. All authors have read and approved the submitted version.
Funding
Not applicable.
Data availability
Responses from large language models and the questionnaire are provided in Additional files 2 and 3. Scoring data are available from the corresponding author upon reasonable request.
Declarations
Ethics approval and consent to participate
This study was approved by the Institutional Review Board (IRB) of Chang Gung Medical Foundation (IRB No.: 202500062B0, Additional file 1). The study adhered to the ethical principles outlined in the Declaration of Helsinki. As this study evaluated large language model responses without involving human participants, patient data, or identifiable information, the Institutional Review Board of Chang Gung Medical Foundation (IRB No.: 202500062B0) waived the requirement for patient consent. However, the IRB approved and obtained informed consent for the orthopedic surgeons’ evaluation of the responses.
Consent for publication
Not applicable. This study did not involve human participants or data requiring consent for publication.
Declaration of Generative AI and AI-assisted technologies in the writing process
During the preparation of this work, the authors used ChatGPT in order to improve readability of the manuscript. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Cheng-Pang Yangs and Yu-Chih Lin contributed equally to this manuscript.
Contributor Information
Cheng-Pang Yang, Email: ronnie80097@gmail.com.
Yu-Chih Lin, Email: B101092127@tmu.edu.tw.
References
- 1.OpenAI. Introducing ChatGPT. https://openai.com/blog/chatgpt. Accessed April 06, 2024.
- 2.Pichai S, Hassabis D. Introducing Gemini: our largest and most capable AI model. https://blog.google/technology/ai/google-gemini-ai/#sundar-note. Accessed April 6, 2024.
- 3.Anthropic. Introducing Claude. https://www.anthropic.com/news/introducing-claude. Accessed April 06, 2024.
- 4.Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large Language models in medicine. Nat Med. 2023;29(8):1930–40. [DOI] [PubMed] [Google Scholar]
- 5.Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW. Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep. 2019;134(6):617–25. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Calixte R, Rivera A, Oridota O, Beauchamp W, Camacho-Rivera M. Social and demographic patterns of health-related internet use among adults in the united States: a secondary data analysis of the health information National trends survey. Int J Environ Res Public Health. 2020;17(18). [DOI] [PMC free article] [PubMed]
- 7.Singh JA, Yu S, Chen L, Cleveland JD. Rates of total joint replacement in the united States: future projections to 2020–2040 using the National inpatient sample. J Rheumatol. 2019;46(9):1134–40. [DOI] [PubMed] [Google Scholar]
- 8.Bains SS, Dubin JA, Hameed D, Sax OC, Douglas S, Mont M, et al. Use and application of large Language models for patient questions following total knee arthroplasty. J Arthroplasty. 2024. 10.1016/j.arth.2024.03.017. [DOI] [PubMed] [Google Scholar]
- 9.Kienzle A, Niemann M, Meller S, Gwinner C. ChatGPT May offer an adequate substitute for informed consent to patients prior to total knee arthroplasty-yet caution is needed. J Pers Med. 2024;14(1). [DOI] [PMC free article] [PubMed]
- 10.Magruder ML, Rodriguez AN, Wong JCJ, Erez O, Piuzzi NS, Scuderi GR, et al. Assessing ability for ChatGPT to answer total knee arthroplasty-related questions. J Arthroplasty. 2024. 10.1016/j.arth.2024.02.023. [DOI] [PubMed] [Google Scholar]
- 11.Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, et al. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt Open. 2024;5(2):139–46. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Yang J, Ardavanis KS, Slack KE, Fernando ND, Della Valle CJ, Hernandez NM. Chat generative pretrained transformer (ChatGPT) and Bard: artificial intelligence does not yet provide clinically supported answers for hip and knee osteoarthritis. J Arthroplasty. 2024;39(5):1184–90. [DOI] [PubMed] [Google Scholar]
- 13.Zhang S, Liau ZQG, Tan KLM, Chua WL. Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement. Knee Surg Relat Res. 2024;36(1):15. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Takagi S, Watari T, Erabi A, Sakaguchi K. Performance of GPT-3.5 and GPT-4 on the Japanese medical licensing examination: comparison study. JMIR Med Educ. 2023;9e48002. [DOI] [PMC free article] [PubMed]
- 15.Taloni A, Borselli M, Scarsi V, Rossi C, Coco G, Scorcia V, et al. Comparative performance of humans versus GPT-4.0 and GPT-3.5 in the self-assessment program of American academy of ophthalmology. Sci Rep. 2023;13(1):18562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Rosol M, Gasior JS, Laba J, Korzeniewski K, Mlynczak M. Evaluation of the performance of GPT-3.5 and GPT-4 on the Polish medical final examination. Sci Rep. 2023;13(1):20512. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Li D, Gupta K, Bhaduri M, Sathiadoss P, Bhatnagar S, Chong J, Comparing. GPT-3.5 and GPT-4 accuracy and drift in radiology diagnosis please cases. Radiology. 2024;310(1):e232411. [DOI] [PubMed] [Google Scholar]
- 18.Lim ZW, Pushpanathan K, Yew SME, Lai Y, Sun CH, Lam JSH et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine. 2023;95104770. [DOI] [PMC free article] [PubMed]
- 19.Giannakopoulos K, Kavadella A, Aaqel Salim A, Stamatopoulos V, Kaklamanos EG. Evaluation of the performance of generative AI large Language models chatGPT, Google bard, and Microsoft Bing chat in supporting evidence-based dentistry: comparative mixed methods study. J Med Internet Res. 2023;25e51580. [DOI] [PMC free article] [PubMed]
- 20.Sandmann S, Riepenhausen S, Plagwitz L, Varghese J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun. 2024;15(1):2050. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Carla MM, Gambini G, Baldascino A, Giannuzzi F, Boselli F, Crincoli E, et al. Exploring AI-chatbots’ capability to suggest surgical planning in ophthalmology: ChatGPT versus Google gemini analysis of retinal detachment cases. Br J Ophthalmol. 2024. 10.1136/bjo-2023-325143. [DOI] [PubMed] [Google Scholar]
- 22.Abbas A, Rehman MS, Rehman SS. Comparing the performance of popular large Language models on the National board of medical examiners sample questions. Cureus. 2024;16(3):e55991. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Heston T, Khun C. Prompt engineering in medical education. Int Med Educ. 2023;2(3):198–205. [Google Scholar]
- 24.Poonprasartporn A, Uttaranakorn P, Chuaybamroong R. Comparing the accuracy and referencing of ChatGPT’s responses to herbal medicine queries: A zero-shot versus roleplay prompting approach. Thai J Pharm Sci. 2023;47(4).
- 25.Chang Gung Memorial Hospital. Frequently asked questions about joint replacement. https://www1.cgmh.org.tw/intr/intr2/c3270/faq_joint.html. Accessed 20 March 2024.
- 26.Dy CJ, Taylor SA, Patel RM, Kitay A, Roberts TR, Daluiski A. The effect of search term on the quality and accuracy of online information regarding distal radius fractures. J Hand Surg Am. 2012;37(9):1881–7. [DOI] [PubMed] [Google Scholar]
- 27.Garcia GH, Taylor SA, Dy CJ, Christ A, Patel RM, Dines JS. Online resources for shoulder instability: what are patients reading? J Bone Joint Surg Am. 2014;96(20):e177. [DOI] [PubMed] [Google Scholar]
- 28.Mathur S, Shanti N, Brkaric M, Sood V, Kubeck J, Paulino C, et al. Surfing for scoliosis: the quality of information available on the internet. Spine. 2005;30(23):2695–700. [DOI] [PubMed] [Google Scholar]
- 29.Wang D, Jayakar RG, Leong NL, Leathers MP, Williams RJ, Jones KJ. Evaluation of the quality, accuracy, and readability of online patient resources for the management of articular cartilage defects. Cartilage. 2017;8(2):112–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Sekhon M, Cartwright M, Francis JJ. Acceptability of healthcare interventions: an overview of reviews and development of a theoretical framework. BMC Health Serv Res. 2017;17(1):88. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016;15(2):155–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.OpenAI. Prompt engineering. https://platform.openai.com/docs/guides/prompt-engineering. Accessed 12 May 2024.
- 33.Kong A, Zhao S, Chen H, Li Q, Qin Y, Sun R et al. Better zero-shot reasoning with role-play prompting. ArXiv, abs/230807702. 2023.
- 34.De Angelis L, Baglivo F, Arzilli G, Privitera GP, Ferragina P, Tozzi AE et al. ChatGPT and the rise of large Language models: the new AI-driven infodemic threat in public health. Front Public Health. 2023;111166120. [DOI] [PMC free article] [PubMed]
- 35.Zhao J, Qian Z, Cao J, Wang Y, Ding Y. Role-play paradox in large Language models: reasoning performance gains and ethical dilemmas. ArXiv Preprint arXiv:240913979. 2024.
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Material 1: Additional file 1.pdf: Institutional Review Board certification
Supplementary Material 2: Additional file 2.docx: Detailed Responses from Four Large Language Models to Total Knee Arthroplasty Queries
Supplementary Material 3: Additional file 3.docx: Surgeon Scoring Questionnaire for Total Knee Arthroplasty Frequently Asked Questions
Data Availability Statement
Responses from large language models and the questionnaire are provided in Additional files 2 and 3. Scoring data are available from the corresponding author upon reasonable request.
