ABSTRACT
Introduction
The effectiveness of ChatGPT responses to common living kidney donation (LKD) queries remains unclear.
Methods
We surveyed nephrologists and living kidney donors/candidates to evaluate ChatGPT‐3.5's accuracy, comprehensiveness, and clarity in answering common donation questions in English and French. Ratings used a 5‐point Likert scale, with percentage agreement and modified Fleiss’ Kappa measuring inter‐rater consistency.
Results
The evaluation of ChatGPT‐3.5's responses varied between nephrologists and kidney donors/candidates. Nephrologists showed moderate percentage agreement for English responses (50%–59%) and poor agreement for French responses (9%–45%). Kidney donors/candidates exhibited high agreement for English (90%–100%) but low for French (0%–77%). Inter‐rater agreement among nephrologists was moderate for both English (Kappa 0.74, 95% CI: 0.67, 0.79, p < 0.0001) and French (Kappa 0.70, 95% CI: 0.64, 0.77, p < 0.0001). In contrast, inter‐rater agreement was poor among donors/candidates for both English (Kappa −0.10, 95% CI: −0.14, −0.07, p = 0.99) and French (Kappa −0.03, 95% CI: −0.07, 0, p = 0.81).
Conclusion
ChatGPT 3.5's responses to common LKD queries demonstrated limited agreement among nephrologists and kidney donors/donor candidates, highlighting its lack of reliability as a supplement to existing educational materials for living kidney donor programs in English and French.
1. Introduction
Although extensively used by the general public—with 300 million active weekly users as in December 2024 – the effectiveness of ChatGPT in providing medical information to the public is unclear [1, 2, 3]. ChatGPT is a type of artificial intelligence, trained on a large database of internet text to generate human‐like responses [4]. The use of ChatGPT may boost health literacy by making critical health information easily accessible to the public. However, its effectiveness in providing specific living kidney donation (LKD) education is unknown. In addition, while many patients turn to ChatGPT for medical inquiries across various fields, including cardiovascular health, surgical procedures, and organ transplantation, literature has shown inconsistencies in its responses [4, 5, 6, 7, 8]. Despite generally providing accurate responses to patient queries, ChatGPT can produce inaccuracies, particularly when addressing complex medical questions [4, 6]. Additionally, studies suggest that patients and those without a healthcare background may have different perceptions of ChatGPT's ability to provide correct and unharmful medical information, and different ratings of its reliability in patient care compared to those with a healthcare background [9, 10].
Accurate and comprehensive LKD education is essential for obtaining informed consent and for bridging the knowledge gap for donors [11]. Various resources, including center‐based, home‐based and technology‐based modalities have been implemented across transplant centers to improve the stagnant rates of LKD but the utility of artificial intelligence, a simple and inexpensive tool, is unclear [12]. It is vital that we also ensure optimal public education about LKD in both of Canada's official languages, English and French. Therefore, this study aims to inform whether ChatGPT is a reliable LKD resource in English and French.
2. Methods
We prompted ChatGPT‐3.5 (OpenAI, 2023) to provide a list of frequently asked questions about LKD and among 43 questions; we identified the 22 most pertinent questions by research team consensus (Table S1). We then posed each of these questions to ChatGPT‐3.5 in English and recorded the responses. We posed the same questions in French and recorded the responses in French. We prompted the model five times to ensure reproducibility (Table S2).
We created a survey in Microsoft Forms to evaluate the quality of information provided by ChatGPT‐3.5 by grading the accuracy, comprehensiveness, and clarity of each of the responses in English and French. We pilot tested this survey with one nephrologist for readability. We first emailed all five local, bilingual (French/English) kidney transplant/nephrology specialists, inviting them to participate. They assessed response appropriateness using a 5‐point Likert scale for accuracy, comprehensiveness, and clarity in English and French. We also evaluated whether responses were felt to be superior in English or French (Table 1A,B). To minimize bias, participants were unaware of the LKD information source, and survey completion implied consent.
TABLE 1A.
Likert scale for assessment of English and French responses to ChatGPT‐3.5 queries.
Rating | Accuracy | Comprehensiveness | Clarity |
---|---|---|---|
1 | Very accurate | Very comprehensive | Very clear |
2 | Accurate | Comprehensive | Clear |
3 | Somewhat inaccurate | Somewhat incomprehensive | Somewhat unclear |
4 | Inaccurate | Incomprehensive | Unclear |
5 | Very inaccurate | Very incomprehensive | Very unclear |
TABLE 1B.
Likert scale for assessment of language of ChatGPT‐3.5 responses.
Rating | Language |
---|---|
1 | English response superior overall |
2 | French response superior overall |
3 | No difference between English and French responses |
Next, we recruited consecutive bilingual living kidney donors or potential donors during their nephrology clinic visits. After obtaining verbal consent, we emailed them the same survey administered to nephrologists via email. While our goal was to recruit five donors or donor candidates to match our nephrologist population, we included six, as we received two final responses on the same day. Given the highly specialized nature of transplant knowledge in this sub‐domain, increasing the sample size was unlikely to yield a meaningful improvement in inter‐rater agreement. All recruitment took place between October 21, 2024 and December 20, 2024.
We measured the percentage agreement for ChatGPT responses deemed “very accurate or accurate”, “very comprehensive or comprehensive”, and “very clear or clear” by all nephrologists or kidney donors/donor candidates. High agreement was defined as greater than 90% [13]. We also measured multi‐rater agreement between reviewers using Fleiss’ Kappa, for both the nephrologists and the kidney donors/donor candidates [14]. Fleiss’ Kappa values are reported for overall agreement, as well as for the individual categories of accuracy, comprehensiveness, and clarity. Kappa values are interpreted as follows: ≤0.00 indicates poor agreement, 0.01–0.20 represents slight agreement, 0.21–0.40 denotes fair agreement, 0.41–0.60 signifies moderate agreement, 0.61–0.80 indicates substantial agreement, and 0.81–1.00 reflects almost perfect agreement [15]. Both percentage agreement and inter‐rater reliability were evaluated to provide a more comprehensive and accurate assessment of inter‐rater reliability [16]. To assess the readability of each ChatGPT output in English, we applied the Flesch Reading Ease (FRE) and Flesch–Kincaid Reading Grade Level (FKRGL) formulas. Microsoft Word was used to calculate the scores. Higher FRE scores (90–100) and lower FKRGL scores less than 12.0) indicate easier readability.
The study was approved by the Ottawa Health Science Network Research Ethics Board (20240578‐01H) and the study protocol was posted to Open Science Framework (https://osf.io/9zxk3).
3. Results
3.1. Assessment of the Accuracy, Comprehensiveness, and Clarity of the Information by Nephrologists and Kidney Donors/Donor Candidates
Of the five invited nephrologists, all participated in the survey, along with six of the 18 invited kidney donors or donor candidates. Among the respondents, seven identified French as their first language. Table 2 presents a summary of the accuracy, comprehensiveness, and clarity ratings for each ChatGPT‐3.5 response, along with a language comparison, among nephrologists. The percentage agreement among the five nephrologists for responses rated as “very accurate/accurate,” “very comprehensive/comprehensive,” and “very clear/clear” was 50.0%, 59.1%, and 27.3%, respectively, for English responses, and 45.5%, 9.1%, and 22.7%, respectively, for French responses. Similarly, among the six kidney donors/donor candidates who completed the survey, the agreement on the accuracy, comprehensiveness, and clarity of ChatGPT‐3.5's responses also varied between English and French. For English responses, the percentage agreement for ratings of “very accurate/accurate,” “very comprehensive/comprehensive,” and “very clear/clear” by all six kidney donors/donor candidates was 90.9%, 86.4%, and 100%, respectively. In contrast, for French responses, the agreement among donors/donor candidates was 0%, 77.3%, and 77.3%, respectively (Table 3). Taken together, these results show that percentage agreement was only high among donor/donor candidates for accuracy and clarity of English responses.
TABLE 2.
Evaluation of ChatGPT‐3.5 responses to common living kidney donation‐related questions by five nephrologists.
Question (English/French) | Number of reviewers who felt that the response was very accurate or accurate | Number of reviewers who felt that the response was very comprehensive or comprehensive | Number of reviewers who felt that the response was very clear or clear | Which response is superior in terms of accuracy, comprehensiveness and clarity? | |||||
---|---|---|---|---|---|---|---|---|---|
English | French | English | French | English | French | English | French | Equal | |
Q1. How do I initiate the process of becoming a living kidney donor?/ Comment puis‐je initier le processus de devenir un donneur vivant de rein? | 4/5 | 4/5 | 5/5 | 3/5 | 3/5 | 3/5 | 2 | 1 | 2 |
Q2. What tests are involved in the living kidney donor evaluation process?/ Quels tests sont inclus dans le processus d'évaluation du donneur vivant de rein? | 5/5 | 5/5 | 4/5 | 5/5 | 3/5 | 4/5 | 1 | 3 | 1 |
Q3. How long does it take to undergo the evaluation process for a living kidney donor?/ Combien de temps prend le processus d'évaluation pour devenir donneur vivant de rein? | 4/5 | 5/5 | 4/5 | 5/5 | 4/5 | 4/5 | 0 | 4 | 1 |
Q4. What is the eligibility criteria for becoming a living kidney donor?/ Quels sont les critères d'éligibilité pour devenir donneur vivant de rein? | 3/5 | 3/5 | 4/5 | 3/5 | 4/5 | 4/5 | 2 | 3 | 0 |
Q5. How long is the recovery period after living kidney donation surgery?/Quelle est la durée de la période de récupération après la chirurgie de don de rein vivant? | 5/5 | 3/5 | 5/5 | 2/5 | 4/5 | 2/5 | 3 | 1 | 1 |
6. Will I experience pain during or after the kidney donation surgery?/Vais‐je ressentir de la douleur pendant ou après la chirurgie de don de rein? | 5/5 | 5/5 | 5/5 | 3/5 | 5/5 | 4/5 | 2 | 1 | 2 |
Q7. What are the potential risks and complications associated with living kidney donation?/Quels sont les risques et complications potentiels associés au don de rein vivant? | 5/5 | 5/5 | 4/5 | 2/5 | 4/5 | 3/5 | 2 | 1 | 2 |
Q8. What are the risks and success rates associated with living kidney transplantation?/Quels sont les risques et taux de réussite associés à la transplantation rénale de donneur vivant? | 4/5 | 5/5 | 3/5 | 3/5 | 3/5 | 4/5 | 2 | 1 | 2 |
Q9. How long does the entire donation process typically take from start to finish?/Combien de temps dure généralement l'ensemble du processus de don, du début à la fin? | 4/5 | 4/5 | 4/5 | 4/5 | 3/5 | 4/5 | 1 | 1 | 3 |
Q10. What impact does living kidney donation have on long‐term health and life expectancy?/Quel impact le don de rein vivant a‐t‐il sur la santé à long terme et l'espérance de vie? | 5/5 | 2/5 | 5/5 | 3/5 | 5/5 | 2/5 | 3 | 0 | 2 |
Q11. What support services are available for living kidney donors?/Quels services de soutien sont disponibles pour les donneurs de rein vivant? | 5/5 | 4/5 | 5/5 | 3/5 | 5/5 | 5/5 | 2 | 1 | 2 |
Q12. Can I change my mind about donating at any point in the process?/Puis‐je changer d'avis à n'importe quel moment du processus de don? | 5/5 | 5/5 | 5/5 | 2/5 | 5/5 | 4/5 | 4 | 0 | 1 |
13. Are there restrictions on activities or lifestyle changes after living kidney donation?/Y a‐t‐il des restrictions sur les activités ou des changements de mode de vie après le don de rein vivant? | 4/5 | 5/5 | 4/5 | 2/5 | 4/5 | 4/5 | 2 | 1 | 2 |
Q14. Are there financial costs associated with living kidney donation, and who covers them?/Y a‐t‐il des coûts financiers associés au don de rein vivant, et qui les prend en charge? | 5/5 | 4/5 | 5/5 | 2/5 | 5/5 | 5/5 | 4 | 0 | 1 |
Q15. Can I donate if I have a pre‐existing medical condition, and how does it affect eligibility?/Puis‐je faire un don si j'ai une condition médicale préexistante, et comment cela affecte‐t‐il l'éligibilité? | 5/5 | 4/5 | 4/5 | 2/5 | 3/5 | 3/5 | 4 | 1 | 0 |
Q16. How often will I need to follow up with medical professionals after the living donor kidney transplant?/À quelle fréquence devrai‐je faire des suivis avec des professionnels de la santé après la transplantation rénale du donneur vivant? | 4/5 | 2/5 | 5/5 | 3/5 | 3/5 | 1/5 | 2 | 0 | 3 |
Q17. What are the psychological aspects of living kidney donation, and is counseling provided?/Quels sont les aspects psychologiques du don de rein vivant, et des séances de counseling sont‐elles prévues? | 5/5 | 5/5 | 5/5 | 3/5 | 5/5 | 4/5 | 2 | 1 | 2 |
Q18. How is confidentiality of the donor‐recipient information maintained?/ Comment la confidentialité des informations entre le donneur et le receveur est‐elle préservée? | 4/5 | 4/5 | 5/5 | 4/5 | 4/5 | 5/5 | 2 | 0 | 3 |
Q19. How does the matching process work for living kidney donors and recipients?/ Comment fonctionne le processus d'appariement pour les donneurs de rein vivant et les receveurs? | 4/5 | 4/5 | 5/5 | 2/5 | 4/5 | 4/5 | 3 | 0 | 2 |
Q20. Can I donate a kidney to a stranger or someone not directly known to me?/ Puis‐je faire don d'un rein à un inconnu ou à quelqu'un que je ne connais pas directement? | 3/5 | 5/5 | 5/5 | 4/5 | 3/5 | 5/5 | 1 | 3 | 1 |
Q21. Can family and friends be tests to see if they are a match for living kidney donation?/ Les membres de la famille et les amis peuvent‐ils être testés pour voir s'ils sont compatibles pour le don de rein vivant? | 5/5 | 5/5 | 5/5 | 4/5 | 4/5 | 5/5 | 2 | 1 | 2 |
Q22. Can I still donate if I have a different blood type than the recipient?/ Puis‐je quand même faire un don si j'ai un groupe sanguin différent du receveur? | 3/5 | 4/5 | 3/5 | 3/5 | 3/5 | 3/5 | 0 | 1 | 4 |
TABLE 3.
Evaluation of ChatGPT‐3.5 responses to common living kidney donation‐related questions by six kidney donors/donor candidates.
Question (English/French) | Number of reviewers who felt that the response was very accurate or accurate | Number of reviewers who felt that the response was very comprehensive or comprehensive | Number of reviewers who felt that the response was very clear or clear | Which response is superior in terms of accuracy, comprehensiveness and clarity? | |||||
---|---|---|---|---|---|---|---|---|---|
English | French | English | French | English | French | English | French | Equal | |
Q1. How do I initiate the process of becoming a living kidney donor?/ Comment puis‐je initier le processus de devenir un donneur vivant de rein? | 6/6 | 4/6 | 6/6 | 6/6 | 6/6 | 5/6 | 4 | 0 | 2 |
Q2. What tests are involved in the living kidney donor evaluation process?/ Quels tests sont inclus dans le processus d'évaluation du donneur vivant de rein? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 5/6 | 4 | 0 | 2 |
Q3. How long does it take to undergo the evaluation process for a living kidney donor?/ Combien de temps prend le processus d'évaluation pour devenir donneur vivant de rein? | 6/6 | 4/6 | 6/6 | 5/6 | 6/6 | 6/6 | 2 | 1 | 3 |
Q4. What is the eligibility criteria for becoming a living kidney donor?/ Quels sont les critères d'éligibilité pour devenir donneur vivant de rein? | 6/6 | 4/6 | 6/6 | 5/6 | 6/6 | 5/6 | 4 | 0 | 2 |
Q5. How long is the recovery period after living kidney donation surgery?/ Quelle est la durée de la période de récupération après la chirurgie de don de rein vivant? | 5/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 4 | 1 | 1 |
6. Will I experience pain during or after the kidney donation surgery?/ Vais‐je ressentir de la douleur pendant ou après la chirurgie de don de rein? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 3 | 0 | 3 |
Q7. What are the potential risks and complications associated with living kidney donation?/ Quels sont les risques et complications potentiels associés au don de rein vivant? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 3 | 0 | 3 |
Q8. What are the risks and success rates associated with living kidney transplantation?/ Quels sont les risques et taux de réussite associés à la transplantation rénale de donneur vivant? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 3 | 0 | 3 |
Q9. How long does the entire donation process typically take from start to finish?/ Combien de temps dure généralement l'ensemble du processus de don, du début à la fin? | 5/6 | 3/6 | 5/6 | 5/6 | 6/6 | 6/6 | 3 | 0 | 3 |
Q10. What impact does living kidney donation have on long‐term health and life expectancy?/ Quel impact le don de rein vivant a‐t‐il sur la santé à long terme et l'espérance de vie? | 6/6 | 3/6 | 6/6 | 4/6 | 6/6 | 4/6 | 3 | 0 | 3 |
Q11. What support services are available for living kidney donors?/ Quels services de soutien sont disponibles pour les donneurs de rein vivant? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 2 | 0 | 4 |
Q12. Can I change my mind about donating at any point in the process?/ Puis‐je changer d'avis à n'importe quel moment du processus de don? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 1 | 0 | 5 |
13. Are there restrictions on activities or lifestyle changes after living kidney donation?/ Y a‐t‐il des restrictions sur les activités ou des changements de mode de vie après le don de rein vivant? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 2 | 0 | 4 |
Q14. Are there financial costs associated with living kidney donation, and who covers them?/ Y a‐t‐il des coûts financiers associés au don de rein vivant, et qui les prend en charge? | 6/6 | 4/6 | 6/6 | 6/6 | 6/6 | 6/6 | 3 | 1 | 2 |
Q15. Can I donate if I have a pre‐existing medical condition, and how does it affect eligibility?/ Puis‐je faire un don si j'ai une condition médicale préexistante, et comment cela affecte‐t‐il l'éligibilité? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 2 | 0 | 4 |
Q16. How often will I need to follow up with medical professionals after the living donor kidney transplant?/ À quelle fréquence devrai‐je faire des suivis avec des professionnels de la santé après la transplantation rénale du donneur vivant? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 5/6 | 2 | 0 | 4 |
Q17. What are the psychological aspects of living kidney donation, and is counseling provided?/Quels sont les aspects psychologiques du don de rein vivant, et des séances de counseling sont‐elles prévues? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 2 | 0 | 4 |
Q18. How is confidentiality of the donor‐recipient information maintained?/Comment la confidentialité des informations entre le donneur et le receveur est‐elle préservée? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 1 | 0 | 5 |
Q19. How does the matching process work for living kidney donors and recipients?/Comment fonctionne le processus d'appariement pour les donneurs de rein vivant et les receveurs? | 6/6 | 5/6 | 5/6 | 5/6 | 6/6 | 6/6 | 2 | 0 | 4 |
Q20. Can I donate a kidney to a stranger or someone not directly known to me?/Puis‐je faire don d'un rein à un inconnu ou à quelqu'un que je ne connais pas directement? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 2 | 1 | 3 |
Q21. Can family and friends be tests to see if they are a match for living kidney donation?/Les membres de la famille et les amis peuvent‐ils être testés pour voir s'ils sont compatibles pour le don de rein vivant? | 6/6 | 5/6 | 6/6 | 6/6 | 6/6 | 6/6 | 2 | 1 | 3 |
Q22. Can I still donate if I have a different blood type than the recipient?/Puis‐je quand même faire un don si j'ai un groupe sanguin différent du receveur? | 6/6 | 5/6 | 5/6 | 6/6 | 6/6 | 6/6 | 2 | 1 | 3 |
3.2. Inter‐Rater Agreement Among Nephrologists and Kidney Donors/Candidates
Overall inter‐rater agreement for nephrologists' ratings was moderate and statistically significant for English and French responses (modified Fleiss' Kappa 0.74, 95% CI: 0.67, 0.79, p < 0.0001, and 0.70, 95% CI: 0.64, 0.77, p < 0.0001, respectively), suggesting that the moderate agreement observed is greater than what would be expected by chance alone. For donor/donor candidate ratings, the modified Fleiss’ Kappa was −0.10 (95% CI: −0.14, −0.07, p = 0.99) for English responses and −0.03 (95% CI: −0.07, −0.00, p = 0.81) for French responses, suggesting poor inter‐rater agreement. Table 4 shows the modified Fleiss’ Kappa values for inter‐rater agreement for each language, by accuracy, comprehension, and clarity. There was statistically significant interrater agreement among donor/donor candidates for accuracy of French responses (modified Fleiss’ Kappa 0.13, 95% CI: 0.08, 0.16, p = 0.02). Both nephrologists and donors/donor candidates showed poor agreement on whether one language had superior responses (modified Fleiss’ Kappa −0.19, 95% CI: −0.38, −0.02, p = 0.02) for nephrologists and −0.20 (95% CI: −0.21, −0.18, p = 1.00) for donors.
TABLE 4.
Modified Fleiss’ Kappa values for Nephrologists’ and kidney donor/donor candidates’ ratings of the accuracy, comprehensiveness and clarity of English and French ChatGPT responses.
Interrater agreement amongst nephrologists | |||
---|---|---|---|
Language | Category | Fleiss’ Kappa (95% CI) | p value |
English | Accuracy | −0.13 (−0.19, −0.08) | 0.97 |
Comprehensiveness | −0.005 (−0.04, 0.04) | 0.55 | |
Clarity | −0.14 (−0.19, −0.10) | 0.99 | |
Overall a | 0.74 (0.67, 0.79) | <0.0001 | |
French | Accuracy | −0.12 (−0.19, −0.05) | 0.93 |
Comprehensiveness | 0.02 (−0.02, 0.07) | 0.34 | |
Clarity | −0.06 (−0.38, −0.03) | 0.99 | |
Overall a | 0.70 (0.64, 0.77) | <0.0001 |
Interrater agreement amongst kidney donors/donor candidates | |||
---|---|---|---|
Language | Category | Fleiss’ Kappa (95% CI) | p value |
English | Accuracy | −0.19 (−0.23, −0.15) | 0.99 |
Comprehensiveness | −0.20 (−0.23, −0.16) | 0.99 | |
Clarity | −0.18 (−0.20, −0.17) | 1.00 | |
Overall a | −0.10 (−0.14, −0.07) | 0.99 | |
French | Accuracy | 0.13 (0.08, 0.16) | 0.015 |
Comprehensiveness | −0.16 (−0.20, −0.13) | 0.98 | |
Clarity | −0.20 (−0.24, −0.16) | 0.99 | |
Overall a | −0.03 (−0.07, 0) | 0.81 |
Overall refers to inter‐rater agreement across all three categories of accuracy, comprehensiveness, and clarity.
3.3. Readability of ChatGPT Responses
The median (IQR) FRE readability score was 29.3 (24.5, 34.0) and the median (IQR) FKRGL score was 13.4 (12.4, 14.0) indicating that the responses were difficult to understand and readability was at the level of a college graduate.
4. Discussion
Our study shows low to moderate percentage agreement for the highest two rating categories of responses, with moderate interrater agreement among nephrologists' ratings of all ChatGPT responses to common LKD queries in both English and French. These findings suggest that nephrologists did not consistently agree on the highest rating categories (very accurate/accurate, very comprehensive/comprehensive, very clear/clear), although there was moderate overall agreement across all response categories. Among kidney donors/donor candidates, there was high percentage agreement for high accuracy and high clarity in English responses only, but poor interrater agreement across all responses overall. This indicates limited consensus on the highest rating categories (very accurate/accurate, very comprehensive/comprehensive, very clear/clear) and a lack of consistency across all response options. The responses exceed the sixth grade reading level typically recommended for patient‐facing materials. Reviewers showed poor agreement on whether English or French responses were superior, with lower inter‐rater agreement for French responses. Overall, these results suggest that ChatGPT 3.5 responses to common LKD questions may not be suitable for use in English and French public education.
Our findings align with prior research examining ChatGPT's appropriateness for patient inquiries. Responses to physician nephrology self‐testing questions were below the passing threshold and responses to kidney transplantation queries have been deemed inaccurate [4, 17]. Another study showed that medical professionals deemed 81% of responses to asthma queries to have generally high accuracy, but had concerns about the lack of depth in certain responses and concluded poor suitability of ChatGPT as a substitute for professional medical advice [18]. Although expert reviewers deemed 85% of its responses to hypertension queries “appropriate”, with superior answers in English than Japanese, and acceptable responses to liver transplantation (LT) queries, we examined accuracy, comprehensiveness, and clarity of all responses and required both a high level of percentage agreement and interrater variability to draw conclusions about ChatGPT 3.5 suitability in providing living kidney donor education [5, 6].
There is limited literature exploring patients' perspectives on using ChatGPT for health‐related information. Armbruster et al. compared patient and physician ratings of responses generated by ChatGPT versus an expert panel for health‐related questions. Both groups found ChatGPT's responses more useful than those from the expert panel [9]. However, while physicians rated potentially harmful responses lower, patients consistently rated ChatGPT's responses more positively. These findings suggest that patients may have difficulty recognizing harmful or misleading advice, which could result in missed diagnostic or therapeutic opportunities. Similarly, in our study, both donors and donor candidates largely agreed on whether ChatGPT 3.5's responses were accurate, comprehensive, and clear in English, but showed poor agreement in French, with low interrater consistency across all responses. This may indicate gaps in patients' ability to identify inaccurate or misleading information. Alternatively, poor agreement on whether English or French responses were superior may reflect the model's limited multilingual training. This may have contributed to reduced clarity, accuracy, and consistency in French outputs. These findings underscore the need to evaluate and optimize model performance across languages to ensure equitable patient communication. Future work could explore more structured, context‐aware prompt design to support patients in crafting effective queries and receiving responses linked to credible sources. Enhancing the clarity and reliability of French‐language outputs is also an important direction for improving patient education.
Despite the strengths of having five bilingual nephrologists, and six bilingual patients, exploring ChatGPT's ability to provide medical information in both English and French and measuring both percentage agreement and inter‐rater agreement using Fleiss’ Kappa, our study has limitations. Newer ChatGPT versions may yield different results, affecting reproducibility. ChatGPT is not specifically designed for medical information provision. Other large learning models may perform better in French such as Gemini, whose version 1.5 has been shown to outperform ChatGPT 3.5 in French pathology answers to exam questions [19]. The selected questions may also not encompass all pertinent LKD‐related inquiries. Sample size and chance alone may explain why the percentage agreement was higher than inter‐rater agreement. We did not measure the literacy level of study participants, which may have influenced the ratings of the responses but reflects real‐world scenarios. With these limitations, transplant programs may be reluctant to use ChatGPT as an education tool due to low or moderate agreement on the highest rating categories for the responses, and the poor readability.
ChatGPT did not provide highly accurate, comprehensive, and clear information in response to prompting questions related to LKD in English or French, highlighting substantial gaps and the need to approach its medical advice with caution. Therefore, it may not be a suitable complement to existing kidney donor program educational materials.
Author Contributions
Ann Bugeja conceived the idea, designed the analysis, and supervised the project. Ann Bugeja, Ria Singla, and Januvi Jegatheswaran designed the survey data tool. Ria Singla collected the data. Ann Bugeja, Ria Singla, and Taddele Kibret analyzed the data. Ria Singla and Sumiya Lodhi wrote the manuscript with support from Ann Bugeja, Tamara Glavinovic, Jolanta Karpinski, Kevin Burns, David Massicotte‐Azarniouch, Rinu Pazhekattu, and Manish M Sood provided logical data interpretation and helped with presenting results. All authors provided critical feedback and helped shape the research, analysis, and manuscript.
Conflicts of Interest
There were no conflicts.
Supporting information
Supporting File: ctr70303‐sup‐0001‐SuppMat.pdf
Acknowledgments
We thank the participants in this study for their time and effort.
Singla R., Lodhi S., Kibret T., et al. “Accuracy, Clarity, and Comprehensiveness of ChatGPT Outputs for Commonly Asked Questions About Living Kidney Donation.” Clinical Transplantation 39, no. 9 (2025): 39, e70303. 10.1111/ctr.70303
Funding: There was no funding for this work.
Data Availability Statement
The data that supports the findings of this study are available in the supplementary material of this article.
References
- 1. OpenAI . ChatGPT. (March 8 version). Published online 2024. Accessed March 8, 2024. https://openai.com/chatgpt.
- 2. Sallam M., “ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns,” Healthcare 11, no. 6 (2023): 887, 10.3390/healthcare11060887. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Homolak J., “Opportunities and Risks of ChatGPT in Medicine, Science, and Academic Publishing: A Modern Promethean Dilemma,” Croatian Medical Journal 64, no. 1 (2023): 1–3, 10.3325/cmj.2023.64.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Rawashdeh B., Kim J., AlRyalat S. A., Prasad R., and Cooper M., “ChatGPT and Artificial Intelligence in Transplantation Research: Is It Always Correct?,” Cureus 15, no. 7 (2023): e42150, 10.7759/cureus.42150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5. Yano Y., Nishiyama A., Suzuki Y., et al., “Relevance of ChatGPT's Responses to Common Hypertension‐Related Patient Inquiries,” Hypertension 81, no. 1 (2024):e1–e4, 10.1161/HYPERTENSIONAHA.123.22084. [DOI] [PubMed] [Google Scholar]
- 6. Endo Y., Sasaki K., Moazzam Z., et al., “Quality of ChatGPT Responses to Questions Related to Liver Transplantation,” Journal of Gastrointestinal Surgery 27, no. 8 (2023): 1716–1719, 10.1007/s11605-023-05714-9. [DOI] [PubMed] [Google Scholar]
- 7. Ayo‐Ajibola O., Davis R. J., Lin M. E., Riddell J., and Kravitz R. L., “Characterizing the Adoption and Experiences of Users of Artificial Intelligence–Generated Health Information in the United States: Cross‐Sectional Questionnaire Study,” Journal of Medical Internet Research 26 (2024): e55138, 10.2196/55138. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8. Mika A. P., Martin J. R., Engstrom S. M., Polkowski G. G., and Wilson J. M., “Assessing ChatGPT Responses to Common Patient Questions Regarding Total Hip Arthroplasty,” Journal of Bone and Joint Surgery 105, no. 19 (2023): 1519–1526, 10.2106/JBJS.23.00209. [DOI] [PubMed] [Google Scholar]
- 9. Armbruster J., Bussmann F., Rothhaas C., Titze N., Grützner P. A., and Freischmidt H., “Doctor ChatGPT, “Can You Help Me?” The Patient's Perspective: Cross‐Sectional Study,” Journal of Medical Internet Research 26 (2024): e58831, 10.2196/58831. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Chen S. Y., Kuo H. Y., and Chang S. H., “Perceptions of ChatGPT in Healthcare: Usefulness, Trust, and Risk,” Frontiers in Public Health 12 (2024): 1457131, 10.3389/fpubh.2024.1457131. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Waterman A. D., Morgievich M., Cohen D. J., et al., “Living Donor Kidney Transplantation: Improving Education Outside of Transplant Centers About Live Donor Transplantation—Recommendations From a Consensus Conference,” Clinical Journal of the American Society of Nephrology 10, no. 9 (2015): 1659–1669, 10.2215/CJN.00950115. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Hunt H. F., Rodrigue J. R., Dew M. A., et al., “Strategies for Increasing Knowledge, Communication, and Access to Living Donor Transplantation: An Evidence Review to Inform Patient Education,” Current Transplantation Reports 5, no. 1 (2018): 27–44, 10.1007/s40472-018-0181-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Kottner J., Audigé L., Brorson S., et al., “Guidelines for Reporting Reliability and Agreement Studies (GRRAS) Were Proposed,” Journal of Clinical Epidemiology 64, no. 1 (2011): 96–106, 10.1016/j.jclinepi.2010.03.002. [DOI] [PubMed] [Google Scholar]
- 14. Nelson K. P. and Edwards D., “Measures of Agreement Between Many Raters for Ordinal Classifications,” Statistics in Medicine 34, no. 23 (2015): 3116–3132, 10.1002/sim.6546. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Landis J. R. and Koch G. G., “The Measurement of Observer Agreement for Categorical Data,” Biometrics 33, no. 1 (1977): 159–174. [PubMed] [Google Scholar]
- 16. McHugh M. L., “Interrater Reliability: The Kappa Statistic,” Biochemica Medica (Zagreb) 22, no. 3 (2012): 276–282. [PMC free article] [PubMed] [Google Scholar]
- 17. Miao J., Thongprayoon C., Garcia Valencia O. A., et al., “Performance of ChatGPT on Nephrology Test Questions,” CJASN 19, no. 1 (2024): 35–43, 10.2215/CJN.0000000000000330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Høj S., Thomsen S. F., Ulrik C. S., Meteran H., Sigsgaard T., and Meteran H., “Evaluating the Scientific Reliability of ChatGPT as a Source of Information on Asthma,” Journal of Allergy and Clinical Immunology: Global 3, no. 4 (2024): 100330, 10.1016/j.jacig.2024.100330. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Tarris G. and Martin L., “Performance Assessment of ChatGPT 4, ChatGPT 3.5, Gemini Advanced Pro 1.5 and Bard 2.0 to Problem Solving in Pathology in French Language,” Digital Health 11 (2025): 11, 10.1177/20552076241310630. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supporting File: ctr70303‐sup‐0001‐SuppMat.pdf
Data Availability Statement
The data that supports the findings of this study are available in the supplementary material of this article.