Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

Shih-Yi Lin; Chang-Cheng Jiang; Kin-Man Law; Pei-Chun Yeh; Min-Kuang Tsai; Chu-Lin Chou; I-Kuan Wang; I-Wen Ting; Yu-Wei Chen; Che-Yi Chou; Ming-Han Hsieh; Heng-Chih Pan; Sung-Lin Hsieh; Chien-Hua Chiu; Pei-Wen Lee; Yu-Cyuan Hong; Ying-Yu Hsu; Huey-Liang Kuo; Shu-Woei Ju; Chia-Hung Kao

doi:10.1177/20552076251342067

. 2025 Jun 2;11:20552076251342067. doi: 10.1177/20552076251342067

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

Shih-Yi Lin ^1,^2,^*, Chang-Cheng Jiang ^1,^2,^*, Kin-Man Law ^3,^*, Pei-Chun Yeh ⁴, Min-Kuang Tsai ⁵, Chu-Lin Chou ⁶, I-Kuan Wang ^1,², I-Wen Ting ^1,², Yu-Wei Chen ⁵, Che-Yi Chou ⁷, Ming-Han Hsieh ^1,², Heng-Chih Pan ⁸, Sung-Lin Hsieh ^1,², Chien-Hua Chiu ⁹, Pei-Wen Lee ¹⁰, Yu-Cyuan Hong ^1,², Ying-Yu Hsu ¹¹, Huey-Liang Kuo ^1,², Shu-Woei Ju ^1,², Chia-Hung Kao ^1,^4,^12,^13,^✉

PMCID: PMC12134521 PMID: 40469778

Abstract

Importance

Compares the responses of four AI models to common nephrology-related questions encountered in clinical settings.

Objective

To evaluate generative AI models in enhancing nephrology patient communication and education.

Design

Generative AI in Nephrology

Setting

In a study conducted from December 8–12, 2023, and October 21–23, 2024, IT engineers evaluated GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 for nephrology patient communication and education, querying each with 21 nephrology questions and three renal biopsy reports, repeated for consistency.

Intervention(s) (for clinical trials) or Exposure(s) (for observational studies)

None.

Main Outcome(s) and Measure(s)

Fifteen nephrologists and one nephrology researcher assessed responses for Appropriateness, Helpfulness, Consistency, and human-like empathy, with rating scale (1–4). Using Shapiro–Wilk and Mann–Whitney U tests with Holm correction, along with TF-IDF, BertScore, and ROUGE were used. The study compared the performance of GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 nephrology-related questions.

Results

GPT-4o consistently achieved high scores in Appropriateness (3.39 ± 0.7) and Helpfulness (3.24 ± 0.73), while PaLM 2 demonstrated the highest consistency score (3.0 ± 0.86). In empathy, GPT-4 achieved the highest overall score (80.73%), excelling in patient-centric scenarios, followed by GPT-4o (76.56%). PaLM 2 showed competitive empathy in specific cases, despite scoring lower in consistency and Appropriateness.

For Kidney-Related Queries, GPT-4o excelled in relevance metrics, achieving the highest BertScore (0.57) and ROUGE for one-word metrics (0.54). Gemini 1.0 Ultra led in generating coherent responses for Renal Biopsy Reports with the highest TF-IDF (0.56) and ROUGE for longest similar sentences (0.47). All 101 references provided by GPT-4 were 100% accurate.

Conclusions and Relevance

GPT-4o emerged as the most accurate and consistent model across most evaluation categories, while GPT-4 demonstrated superior empathy and balanced performance. PaLM 2 and Gemini 1.0 Ultra showed strengths in specific areas, highlighting the potential for tailored applications of generative AI in nephrology clinical practice.

Keywords: Generative AI, nephrology, GPT-4, GPT-4o, Gemini 1.0 Ultra, PaLM 2

Key points

Question: Many have expressed confusion about the information on these pre-End Stage Kidney Disease (pre-ESKD) program, struggling to understand it due to the rigid and generic presentation, which does not address individual needs effectively.
Findings: Generative AI models like GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 demonstrate potential in aiding clinical renal services by providing appropriate and empathetic responses to patient questions.
Meaning: Exploration of utilizing LLMs in clinical renal practice holds great potential and importance.

Introduction

Taiwan is globally recognized for its advancements in digital innovation, with a report at the start of 2023 indicating that there were 21.68 million internet users in the country, translating to an impressive internet penetration rate of 90.7%.¹ Despite this digital prowess and the abundant availability of health-related information online, a significant challenge persists for the general public and patients in terms of interpreting this data.^2–5 This difficulty often stems from limited medical knowledge, leading to confusion and potentially misleading conclusions.⁶ While many hospitals offer health education on their websites, the content is often too technical for the general public to easily understand. Additionally, the Taiwan National Health Insurance Administration launched the pre-end stage kidney disease (pre-ESKD) program in November 2006, specifically targeting high-risk individuals at stages 3B-5 of chronic kidney disease (CKD). This initiative, along with increased government efforts to educate the public about the symptoms of kidney damage, has raised awareness. However, this issue is particularly noticeable in our clinical practice with renal patients. Many have expressed confusion about the information on these websites struggling to understand it due to the rigid and generic presentation, which does not address individual needs effectively.^7,8

Thus, in response to this information gap, the advent of Generative AI technology in early 2022 offers a promising solution.⁹ This technology claims to provide tailored answers and mimic diverse roles based on user inputs, potentially bridging the gap between complex medical information and patient comprehension in a manner that is both accessible and personalized to individual needs.¹⁰ The integration of LLMs in nephrology shows both potential and limitations across clinical decision support, patient education, and administrative tasks. A 2023 study on ChatGPT's performance in nephrology test questions revealed variability in accuracy and consistency, indicating the need for further validation.¹¹ A systematic review by Unger et al. highlighted LLMs’ potential in streamlining workflows, predicting disease progression, interpreting laboratory data, managing renal diets, and enhancing patient education.¹² For instance, GPT-4 demonstrated a 90–94% accuracy in managing continuous renal replacement therapy alarms, thereby reducing ICU alarm fatigue.¹² In another study, Miao et al. addressed challenges like hallucinations—where LLMs produce incorrect or irrelevant outputs—by integrating Retrieval-Augmented Generation (RAG) strategies. They customized a ChatGPT model aligned with the KDIGO 2023 guidelines for CKD, showcasing its potential in providing specialized, accurate medical advice.¹³ Pham et al. evaluated ChatGPT-4's accuracy in triaging nephrology patient messages, finding a 93% correctness rate. This underscores AI's potential to enhance operational efficiency and patient care in outpatient clinics.¹⁴ However, a comparative study by Wu et al. revealed that while GPT-4 answered 73.3% of nephrology multiple-choice questions correctly, open-source LLMs like Koala 7B and Falcon 7B lagged, with correctness rates between 17.1% and 25.5%. This indicates that not all LLMs are equally adept in specialized medical domains.¹⁵ Collectively, these studies illustrate the promising applications of LLMs in nephrology, while also emphasizing the need for continuous evaluation and customization to ensure accuracy and reliability in clinical settings.

Therefore, we conducted a study to evaluate the effectiveness of Generative AI as a supportive tool for medical inquiries related to nephrology, particularly in the context of increasing outpatient visits in Taiwan. This study examines whether AI can enhance patient communication and education by assisting medical professionals in responding to common nephrology-related questions encountered in clinical settings.

The study compares the responses of four AI models—GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2—in terms of their ability to provide accurate, understandable, and empathetic explanations. Additionally, we assess the capability of these AI systems to interpret and explain renal biopsy reports to patients and their families, covering three distinct types of biopsy results.

This research is designed with the primary goal of supporting individual physicians by helping them bridge communication gaps with patients. However, if the study demonstrates promising results, it could also inform institutional applications, such as hospital-wide AI integration in nephrology outpatient clinics and dialysis centers. In such settings, AI could assist in standardizing patient education, improving workflow efficiency, and supplementing physician–patient interactions, ultimately enhancing the quality of care on a larger scale.

Methods

The study included 15 nephrologists and one nephrology researcher, with participants representing a diverse range of healthcare facilities, from dialysis-focused primary care clinics to medical centers across northern, central, and southern Taiwan. Institutions involved were Taipei Medical University Hospital, Keelung Chang Gung Memorial Hospital, China Medical University Hospital, Asia University Hospital, Kaohsiung Chang Gung Memorial Hospital, and Hongde Clinic. The 15 nephrologists had received their nephrology fellowship training at various major medical centers, including National Taiwan University Hospital, Tri-Service General Hospital, Taipei Veterans General Hospital, China Medical University Hospital, Linkou Chang Gung Memorial Hospital, and Kaohsiung Chang Gung Memorial Hospital. The current study explored the potential of integrating AI platforms, specifically GPT-4 and GPT-4o (OpenAI), PaLM 2 (accessed via Google Bard), and Gemini 1.0 Ultra (Google). It is important to note that Bard is a conversational AI service developed by Google and not a model itself. While Bard was initially powered by LaMDA, it was updated in May 2023 to use PaLM 2 as its underlying model. Therefore, the responses generated via Bard in this study reflect the performance of PaLM 2. At the time of our evaluation (December 2023), Bard was confirmed to be operating on the PaLM 2 model, based on official statements from Google. All models were accessed through their respective official graphical user interfaces (GUIs), simulating typical end-user interactions. No APIs or automated scripts were used. Specifically, Gemini 1.0 Ultra was accessed via the official Gemini service interface provided by Google.”

A total of 21 nephrology-related questions and 3 renal biopsy reports were curated by experienced nephrologists to reflect common inquiries in nephrology services. The study comprised 21 questions, divided into three distinct categories: 7 questions focusing on dialysis-related issues (Q1–Q7), 7 questions addressing kidney examinations (Q8–Q14), and another 7 pertaining to general kidney-related issues (Q15–Q21). Additionally, the results of three renal biopsy reports were analyzed, revealing diagnoses of IgA nephropathy, focal segmental glomerulosclerosis, and membranous nephropathy, in that order (Q22–Q24) (Table 1 and Appendix 1).

Table 1.

Comprehensive performance evaluation of ai models (GPT-4, Gemini 1.0 Ultra, PaLM 2, GPT-4o) on kidney health and reports.

Category	No.	Question
Dialysis-Related Issues (Q1–Q7)	Q1	Does kidney dialysis hurt?
	Q2	What preparations do I need to make before each hemodialysis session?
	Q3	What should I pay attention to between dialysis sessions?
	Q4	I just had surgery for an arteriovenous fistula for dialysis, what should I be aware of?
	Q5	As a dialysis patient, what foods should I avoid?
	Q6	What are the complications associated with kidney dialysis?
	Q7	If my kidney function is poor, do I have to undergo dialysis, or are there other options?
Kidney Examinations (Q8–Q14)	Q8	What should I prepare for before undergoing a kidney ultrasound?
	Q9	Why is a kidney ultrasound needed to assess kidney function?
	Q10	Is there a difference between the first urine in the morning and urine at other times when testing a single urine sample?
	Q11	What does the presence of blood in the urine mean?
	Q12	How accurate is a kidney ultrasound in diagnosing kidney cancer?
	Q13	What should I be aware of before undergoing a kidney biopsy?
	Q14	What should I pay attention to after having a kidney biopsy?
Kidney-Related Queries (Q15–Q21)	Q15	If my urine is foamy, does it mean my kidneys are failing?
	Q16	If my urine smells bad, does it indicate a problem with my kidneys?
	Q17	What foods should I eat to improve my kidney function?
	Q18	Are there any medications that can enhance my kidney function?
	Q19	My estimated glomerular filtration rate is 65 ml/min/1.73m2; a friend told me this means I have stage 2 chronic kidney disease, is that true?
	Q20	If I suspect poor kidney function, what tests can I undergo?
	Q21	What information should I prepare for the doctor before visiting a nephrology clinic?
Renal Biopsy Reports (Q22–Q24)	Q22	Full text in Appendix 1
	Q23	Full text in Appendix 1
	Q24	Full text in Appendix 1

Open in a new tab

The renal biopsy reports presented in this study were constructed using fictitious data. These reports were created based on templates from our institution, ensuring they reflected realistic clinical scenarios. Our study recorded all responses generated by GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 without applying any filtering criteria. Every response was documented as output by the AI models to ensure an unbiased evaluation of their performance. This approach allows for a comprehensive analysis of the accuracy, consistency, and Appropriateness of AI-generated responses in nephrology-related inquiries.

The nephrologist team comprised 15 nephrologists and nephrology researcher, each with extensive experience in clinical nephrology care, had practice durations ranging from 5 to more than 20 years.

Data collection

From December 8, 2023, to December 12, 2023, GPT-4, Gemini 1.0 Ultra, and PaLM 2 were used for evaluation. Additionally, from October 21, 2024, to October 23, 2024, GPT-4o was assessed. During these periods, a team of IT engineers queried the AI models using a predetermined set of nephrology-related questions and renal biopsy reports. Each question and report were entered individually and repeated three times consecutively to evaluate the consistency of the responses. The outputs from GPT-4, Gemini 1.0 Ultra, PaLM 2, and GPT-4o for all 21 questions and 3 biopsy reports were then compiled, anonymized, and prepared for further evaluation.

To provide an objective evaluation, we employed analytical tools inclusive of Term Frequency-Inverse Document Frequency (TF-IDF),¹⁶ BertScore,¹⁶ and ROUGE (Recall-Oriented Understudy for Gisting Evaluation) for 1-word, 2-word, and longest similar sentence scores,¹⁷ which were utilized to compare the performance of the four language models – GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 – in terms of consistency across four categories pertaining to renal issues and examinations.

After evaluating four AI models on Appropriateness, Helpfulness, and Human-Like Empathy, the model with the highest overall ratings was identified as the top performer. This model was then asked to provide references for each question posed during the evaluation.

Evaluation process

An interactive website was developed to present these responses in a questionnaire format. Fifteen nephrologists and one nephrology researcher, blinded to the origin of the responses, were tasked to evaluate them. They rated each response based on Appropriateness, Helpfulness, and Consistency using a 4-point scale (1 indicating the least and 4 the highest score). Helpfulness measures how much a tool aids clinicians in tasks like diagnosis and treatment planning, leading to efficiency and improved accuracy. Consistency ensures reliable performance across different scenarios, enhancing clinical reliability and reducing the need for repeated checks. Appropriateness assesses whether the tool suits the specific clinical context and patient population, ensuring relevance and precision in decision-making. Additionally, they evaluated human-like empathy through a binary categorization method, labeling each response as either demonstrating no human-like empathy or showcasing human-like empathy. The evaluation criteria and categories were directly based on a precedent study which assessed the performance of GPT-4 in providing patient information in the context of PET/CT imaging.¹⁸ From November 5, 2024, to November 10, 2024, an IT engineer was responsible for verifying the accuracy of the references provided by the top-performing generative AI model. The verification focused solely on the existence of these sources, without assessing if the content of the references directly correlated with the responses and questions posed.

Blinding and anonymity

The study was conducted in a double-blind manner. The IT engineers could not identify the nephrologists participating in the study, as the nephrologists used code names for their website interactions. Additionally, the nephrologists were unaware of their colleagues’ participation and could not access each other's ratings. They were also blinded to the source of the responses, not knowing whether a response came from GPT-4, Gemini 1.0 Ultra, PaLM 2, and GPT-4o.

Data analysis

Each question and report were individually inputted and queried three times across three AI platforms: GPT-4, Gemini 1.0 Ultra, PaLM 2, and GPT-4o. This process resulted in a total of 288 responses generated by GPT-4, Gemini 1.0 Ultra, PaLM 2, and GPT-4o. Subsequently, these responses were evaluated based on four criteria: Appropriateness, Helpfulness, Consistency, and Human-Like Empathy.

The IT engineers collected and analyzed the data, calculating the final results and relevant statistics. For the evaluation of Appropriateness, Helpfulness, and Consistency, we calculated the mean and standard deviation for each parameter. In the case of Human-Like Empathy, a categorical variable, we determined the percentage distribution. Additionally, we assigned a scoring system for Human-Like Empathy: “no human-like empathy” was allocated zero points, and “human-like empathy” was assigned one point. The average of these scores was also computed. Statistical analyses were conducted to assess differences in model performance across evaluation dimensions. The Shapiro–Wilk test was first applied to evaluate whether the data followed a normal distribution. Since the data did not meet the assumption of normality, the Kruskal–Wallis H test, a non-parametric method, was employed to determine whether there were significant differences in scores across the models. For questions showing significant differences in the Kruskal–Wallis H test, pairwise comparisons between models were performed using the Mann–Whitney U test. To address the risk of Type I error due to multiple comparisons, Holm's correction was applied to adjust the p-values. Holm's method was chosen as it provides a more flexible approach than Bonferroni correction, maintaining greater statistical power while effectively controlling the family-wise error rate. This analysis framework ensured the robust identification of significant differences while minimizing the risk of false positives. A p-value of less than 0.05 was considered statistically significant.

TF-IDF calculates the importance of a word in a document, with Term Frequency (TF) counting the occurrences of a word, normalized by the document's length, and Inverse Document Frequency (IDF) assessing the rarity of the word across all documents using a logarithmic scale. The TF-IDF score is then obtained by multiplying TF with IDF. BertScore measures the similarity between the generated responses and a reference text by computing the cosine similarity of their word embeddings, with the final score being the average of these similarities. For ROUGE, which is used for evaluating summaries or translations, we applied ROUGE-1 and ROUGE-2 to measure the overlap of single- and two-word phrases, respectively, between the AI text and reference texts, while ROUGE-L assesses the longest matching sequence of words. These methods employ a scale ranging from 0 to 1, where 0 indicates no similarity and 1 represents identical or maximum similarity.

Ethical considerations

This study did not involve human participants or patient data. The evaluation was conducted using standardized nephrology-related questions and renal biopsy reports. Additionally, all biopsy reports were fully de-identified, and the study received Institutional Review Board (IRB) approval under CMUH108-REC2-167, confirming that ethical considerations were addressed appropriately.

The study has been reported in line with the STROCSS criteria.¹⁶

Results

Table 2 presents the evaluation results from 15 nephrologists and one nephrology researcher, comparing the outputs of GPT-4, Gemini 1.0 Ultra, PaLM 2, and GPT-4o across all 21 questions and the three renal biopsy reports. Regarding the overall performance in Appropriateness, the average scores (Mean ± SD) were as follows: GPT-4 scored 3.18 ± 0.86, GPT-4o scored 3.39 ± 0.7, Gemini 1.0 Ultra 3.02 ± 0.95, and PaLM 2 3.09 ± 0.92. In terms of Helpfulness, the respective average scores were 3.1 ± 0.87 for GPT-4, 3.24 ± 0.73 for GPT-4o, 2.98 ± 0.92 for Gemini 1.0 Ultra, and 3.01 ± 0.9 for PaLM 2. Finally, assessing Consistency across the four models for each question, the average scores were 1.62 ± 0.85 for GPT-4, 1.5 ± 0.82 for GPT-4o, 1.53 ± 0.83 for Gemini 1.0 Ultra, and 3.0 ± 0.86 for PaLM 2. These results demonstrate nuanced differences in the performance of each AI model across the evaluated criteria. The results showed GPT-4o displayed relatively consistent scores in all categories. Gemini 1.0 Ultra showed a comparable level of performance, with scores slightly lower than GPT-4o but more uniform across different categories. In contrast, PaLM 2 achieved a significantly higher average score of 3.0 ± 0.86, suggesting a notable difference in performance of consistency compared to the other models.

Table 2.

Comparative analysis of AI models across different categories.

Categories No./models	Appropriateness (Avg ± SD)				Helpfulness (Avg ± SD)				Consistency (Avg ± SD)
Categories No./models	GPT-4	GPT-4o	Gemini 1.0 Ultra	PaLM 2	GPT-4	GPT-4o	Gemini 1.0 Ultra	PaLM 2	GPT-4	GPT-4o	Gemini 1.0 Ultra	PaLM 2
Q1	3.29 ± 0.97	3.38 ± 0.64	3.17 ± 0.86	3.21 ± 0.94	3.12 ± 1.04	3.19 ± 0.67	3.17 ± 0.75	3.15 ± 0.95	1.88 ± 1.15	1.81 ± 0.98	1.31 ± 0.6	3.25 ± 0.86
Q2	3.54 ± 0.77	3.65 ± 0.48	3.02 ± 1.0	2.92 ± 0.94	3.48 ± 0.82	3.38 ± 0.64	3.12 ± 0.96	2.98 ± 0.91	1.25 ± 0.77	1.19 ± 0.54	1.44 ± 0.81	3.0 ± 0.82
Q3	3.4 ± 0.82	3.54 ± 0.65	3.0 ± 1.03	3.1 ± 0.93	3.33 ± 0.88	3.38 ± 0.64	3.02 ± 0.98	3.04 ± 0.94	1.38 ± 0.72	1.38 ± 0.62	2.12 ± 1.26	3.31 ± 0.7
Q4	3.35 ± 0.76	2.94 ± 1.16	3.25 ± 0.84	3.06 ± 1.06	3.33 ± 0.81	2.75 ± 1.19	3.15 ± 0.87	3.06 ± 0.93	1.25 ± 0.58	1.69 ± 1.25	1.44 ± 0.81	3.06 ± 0.93
Q5	3.27 ± 0.74	3.29 ± 0.82	2.88 ± 1.08	2.98 ± 1.06	3.15 ± 0.82	3.21 ± 0.71	2.81 ± 1.0	2.92 ± 0.94	1.31 ± 0.79	1.44 ± 0.89	1.69 ± 1.01	2.81 ± 1.05
Q6	3.25 ± 0.73	3.27 ± 0.89	2.83 ± 1.0	3.17 ± 0.91	3.1 ± 0.83	3.15 ± 0.8	2.77 ± 1.02	3.1 ± 0.88	1.31 ± 0.79	1.5 ± 0.73	1.44 ± 1.03	3.06 ± 0.85
Q7	3.12 ± 0.79	3.27 ± 0.82	2.85 ± 0.95	3.17 ± 0.91	3.02 ± 0.86	3.0 ± 0.97	2.81 ± 0.87	2.96 ± 0.94	1.88 ± 1.09	1.5 ± 0.73	2.12 ± 1.09	3.0 ± 0.82
Q8	3.0 ± 0.92	3.35 ± 0.67	2.71 ± 1.03	2.96 ± 1.05	2.98 ± 0.89	3.31 ± 0.66	2.6 ± 1.09	3.0 ± 0.95	1.69 ± 0.95	1.25 ± 0.58	1.44 ± 0.89	3.19 ± 0.83
Q9	3.38 ± 0.73	3.44 ± 0.65	3.27 ± 0.79	3.1 ± 0.78	3.31 ± 0.69	3.33 ± 0.66	3.15 ± 0.82	3.02 ± 0.89	1.69 ± 1.01	1.56 ± 0.81	1.44 ± 0.63	3.06 ± 0.85
Q10	3.27 ± 0.87	3.29 ± 0.71	2.96 ± 0.85	3.02 ± 0.81	3.1 ± 0.88	3.23 ± 0.63	2.88 ± 0.87	2.79 ± 0.87	1.62 ± 0.89	1.56 ± 0.73	1.56 ± 0.89	2.88 ± 0.72
Q11	3.21 ± 0.71	3.23 ± 0.72	3.17 ± 0.91	3.1 ± 0.86	2.94 ± 0.78	3.08 ± 0.77	2.94 ± 0.98	3.04 ± 0.87	1.88 ± 1.02	1.12 ± 0.5	1.81 ± 1.11	3.0 ± 0.89
Q12	3.21 ± 0.8	3.42 ± 0.61	3.1 ± 0.9	3.27 ± 0.82	3.12 ± 0.87	3.29 ± 0.74	3.17 ± 0.78	3.06 ± 0.84	1.5 ± 0.89	1.81 ± 1.17	1.56 ± 0.96	3.0 ± 0.82
Q13	3.06 ± 1.02	3.42 ± 0.58	2.9 ± 0.99	3.15 ± 0.82	2.85 ± 1.13	3.38 ± 0.67	3.06 ± 0.7	3.15 ± 0.77	2.0 ± 0.0	1.81 ± 1.05	1.31 ± 0.87	3.12 ± 0.72
Q14	3.31 ± 0.95	3.58 ± 0.65	2.92 ± 1.13	3.21 ± 0.87	3.17 ± 0.95	3.44 ± 0.71	2.98 ± 1.06	3.06 ± 0.89	2.06 ± 1.29	1.62 ± 1.09	1.56 ± 0.96	3.06 ± 0.85
Q15	3.09 ± 0.93	3.53 ± 0.62	3.03 ± 1.0	3.0 ± 0.88	2.81 ± 1.0	3.31 ± 0.64	3.06 ± 0.98	3.0 ± 0.76	1.81 ± 0.91	1.38 ± 0.62	1.69 ± 1.01	2.94 ± 0.77
Q16	3.02 ± 0.79	3.5 ± 0.67	3.19 ± 0.92	3.05 ± 0.88	2.8 ± 0.89	3.27 ± 0.74	3.11 ± 0.89	2.91 ± 0.87	1.56 ± 0.89	1.19 ± 0.54	1.19 ± 0.54	2.94 ± 0.85
Q17	3.06 ± 0.93	3.33 ± 0.72	3.04 ± 0.92	3.1 ± 0.78	3.06 ± 0.84	3.23 ± 0.78	3.1 ± 0.86	2.96 ± 0.8	1.62 ± 0.72	1.44 ± 0.89	1.25 ± 0.45	2.94 ± 0.77
Q18	2.98 ± 1.02	3.65 ± 0.56	2.67 ± 1.12	2.96 ± 1.01	3.12 ± 0.94	3.52 ± 0.65	2.75 ± 1.08	3.0 ± 0.97	1.25 ± 0.58	1.31 ± 0.6	1.31 ± 0.6	3.06 ± 0.93
Q19	3.06 ± 0.95	2.73 ± 1.2	2.73 ± 1.01	2.62 ± 1.12	3.0 ± 0.92	2.79 ± 1.03	2.77 ± 0.97	2.69 ± 0.99	1.69 ± 0.87	1.75 ± 0.93	1.94 ± 1.0	2.25 ± 1.0
Q20	3.46 ± 0.77	3.67 ± 0.52	3.1 ± 1.04	3.21 ± 0.97	3.44 ± 0.77	3.4 ± 0.71	3.08 ± 0.92	3.19 ± 0.87	1.38 ± 0.62	1.62 ± 0.89	1.62 ± 0.96	3.12 ± 0.89
Q21	3.5 ± 0.8	3.65 ± 0.56	3.27 ± 0.84	3.33 ± 0.83	3.44 ± 0.87	3.46 ± 0.54	3.1 ± 0.86	3.21 ± 0.92	1.19 ± 0.54	1.31 ± 0.87	1.62 ± 0.72	3.25 ± 0.86
Q22	2.77 ± 1.04	3.44 ± 0.65	3.27 ± 0.82	3.25 ± 0.84	2.69 ± 0.93	3.23 ± 0.69	3.04 ± 0.9	3.02 ± 0.96	2.19 ± 1.22	1.69 ± 0.95	1.12 ± 0.34	2.88 ± 1.02
Q23	2.83 ± 1.0	3.33 ± 0.69	3.06 ± 0.93	3.06 ± 0.98	2.94 ± 0.84	3.1 ± 0.72	2.88 ± 0.96	2.96 ± 0.94	1.69 ± 1.01	1.44 ± 0.63	1.56 ± 0.89	2.69 ± 1.08
Q24	3.0 ± 0.77	3.4 ± 0.64	3.1 ± 0.83	3.12 ± 1.0	3.0 ± 0.71	3.25 ± 0.6	2.98 ± 0.91	3.06 ± 0.91	1.69 ± 1.01	1.62 ± 1.02	1.25 ± 0.58	3.19 ± 0.75
Mean ± SD (Average)	3.18 ± 0.86	3.39 ± 0.7	3.02 ± 0.95	3.09 ± 0.92	3.1 ± 0.87	3.24 ± 0.73	2.98 ± 0.92	3.01 ± 0.9	1.62 ± 0.85	1.5 ± 0.82	1.53 ± 0.83	3.0 ± 0.86

Open in a new tab

Table 3 showed further analysis of the performance of four generative AI models across three categories. In the category of Dialysis-Related Issues (Q1–Q7), the performance was assessed based on Appropriateness, Helpfulness, and Consistency. GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 scored an average of 3.32 ± 0.80, 3.33 ± 0.78, 3.00 ± 0.96, and 3.09 ± 0.96, respectively, in Appropriateness. For Helpfulness, the scores were highest with 3.22 ± 0.87 for GPT-4. In terms of Consistency, the PaLM 2 models scored highest with 3.07 ± 0.86.

Table 3.

Comparative performance of AI models on kidney-related queries.

Category	Model	Appropriateness (Avg ± SD)	Helpfulness (Avg ± SD)	Consistency (Avg ± SD)
Dialysis-Related Issues (Q1–Q7)	GPT-4	3.32 ± 0.80	3.22 ± 0.87	1.46 ± 0.84
	GPT-4o	3.33 ± 0.78	3.15 ± 0.80	1.50 ± 0.82
	Gemini 1.0 Ultra	3.00 ± 0.96	2.98 ± 0.92	1.65 ± 0.95
	PaLM 2	3.09 ± 0.96	3.03 ± 0.93	3.07 ± 0.86
Kidney Examinations (Q8–Q14)	GPT-4	3.21 ± 0.86	3.07 ± 0.88	1.78 ± 0.86
	GPT-4o	3.39 ± 0.66	3.29 ± 0.69	1.54 ± 0.85
	Gemini 1.0 Ultra	3.00 ± 0.94	2.97 ± 0.90	1.53 ± 0.90
	PaLM 2	3.12 ± 0.86	3.02 ± 0.87	3.04 ± 0.81
Kidney-Related Queries (Q15–Q21)	GPT-4	3.17 ± 0.88	3.10 ± 0.89	1.50 ± 0.73
	GPT-4o	3.44 ± 0.69	3.28 ± 0.73	1.43 ± 0.76
	Gemini 1.0 Ultra	3.00 ± 0.98	3.00 ± 0.94	1.52 ± 0.75
	PaLM 2	3.04 ± 0.92	2.99 ± 0.88	2.93 ± 0.87
Renal Biopsy Reports (Q22–Q24)	GPT-4	2.87 ± 0.93	2.88 ± 0.83	1.85 ± 1.08
	GPT-4o	3.39 ± 0.66	3.19 ± 0.67	1.58 ± 0.87
	Gemini 1.0 Ultra	3.15 ± 0.86	2.97 ± 0.92	1.31 ± 0.60
	PaLM 2	3.15 ± 0.94	3.01 ± 0.94	2.92 ± 0.95

Open in a new tab

Performance metrics are presented as average scores (Avg) with standard deviations (SD) for three categories: Appropriateness, Helpfulness, and Consistency.

In the Kidney Examinations category (Q8–Q14), GPT-4o achieved a score of 3.39 ± 0.66 in Appropriateness, and also outperformed the others in Helpfulness with a score of 3.29 ± 0.69. PaLM 2 led in Consistency, scoring 3.04 ± 0.81.

For Kidney-Related Queries (Q15–Q21), GPT-4o registered the highest score in Appropriateness (3.44 ± 0.69) and Helpfulness (3.28 ± 0.73). In Consistency, PaLM 2 led with a score of 2.93 ± 0.87. In Biopsy-Related Issues, GPT-4o scoring 3.39 ± 0.66 in Appropriateness, 3.19 ± 0.67 in Helpfulness, and Gemini 1.0 Ultra 2.92 ± 0.95 in Consistency.

Table 4 presents the Shapiro–Wilk test results for assessing the normality of score distributions across three evaluation categories—Appropriateness, Helpfulness, and Inconsistency—for each question (Q1–Q24) and each language model (GPT-4, GPT-4o, Gemini 1.0 Ultra, PaLM 2). For the Appropriateness assessment, GPT-4 demonstrated normal or near-normal distribution in eight questions (Q6, Q7, Q8, Q13, Q17, Q22, Q23, and Q24), with the highest p-value observed for Q22 (p = 0.828), indicating the closest fit to a normal distribution. GPT-4o, by contrast, showed normal or near-normal distribution in only two questions (Q1 and Q19), suggesting a limited alignment with normality for its data. Gemini 1.0 Ultra exhibited normal or near-normal distributions in 10 questions (Q3, Q6, Q7, Q8, Q10, Q12, Q13, Q19, Q23, and Q24), while PaLM 2 achieved normal or near-normal distributions in nine questions (Q2, Q5, Q9, Q10, Q12, Q16, Q19, Q22, and Q23), with notable consistency in Q16 and Q22.

Table 4.

Shapiro–Wilk normality test results for Appropriateness, Helpfulness, and Inconsistency evaluations across different LLM models (GPT-4, GPT-4o, Gemini 1.0 Ultra, PaLM 2) for each question.

Categories	Appropriate
Statistics	Statistic				Degrees of freedom				Significance
No./model	GPT-4	GPT-4o	Gemini 1.0 Ultra	PaLM 2	GPT-4	GPT-4o	Gemini 1.0 Ultra	PaLM 2	GPT-4	GPT-4o	Gemini 1.0 Ultra	PaLM 2
Q1	0.844	0.9	0.829	0.831	16	16	16	16	0.011	0.079	0.007	0.007
Q2	0.69	0.716	0.884	0.903	16	16	16	16	0.000	0.045	0.045	0.089
Q3	0.757	0.799	0.906	0.885	16	16	16	16	0.001	0.003	0.099	0.047
Q4	0.793	0.814	0.864	0.865	16	16	16	16	0.002	0.004	0.022	0.023
Q5	0.837	0.796	0.834	0.898	16	16	16	16	0.009	0.002	0.008	0.075
Q6	0.915	0.828	0.892	0.868	16	16	16	16	0.140	0.007	0.059	0.025
Q7	0.941	0.833	0.918	0.846	16	16	16	16	0.362	0.008	0.158	0.012
Q8	0.924	0.885	0.933	0.856	16	16	16	16	0.195	0.046	0.269	0.017
Q9	0.864	0.839	0.781	0.904	16	16	16	16	0.022	0.009	0.002	0.093
Q10	0.886	0.844	0.897	0.894	16	16	16	16	0.049	0.011	0.072	0.064
Q11	0.874	0.842	0.85	0.866	16	16	16	16	0.031	0.01	0.013	0.024
Q12	0.834	0.879	0.897	0.889	16	16	16	16	0.008	0.038	0.072	0.054
Q13	0.943	0.856	0.893	0.877	16	16	16	16	0.383	0.017	0.062	0.035
Q14	0.782	0.74	0.844	0.846	16	16	16	16	0.002	0.000	0.011	0.012
Q15	0.797	0.741	0.865	0.901	16	16	16	16	0.003	0.000	0.023	0.118
Q16	0.869	0.788	0.831	0.918	16	16	16	16	0.027	0.002	0.007	0.154
Q17	0.889	0.875	0.869	0.876	16	16	16	16	0.053	0.032	0.027	0.033
Q18	0.85	0.695	0.866	0.831	16	16	16	16	0.014	0.000	0.024	0.007
Q19	0.877	0.939	0.945	0.946	16	16	16	16	0.034	0.336	0.421	0.431
Q20	0.766	0.701	0.816	0.814	16	16	16	16	0.001	0.000	0.004	0.004
Q21	0.668	0.682	0.814	0.779	16	16	16	16	0.000	0.000	0.004	0.001
Q22	0.969	0.823	0.775	0.794	16	16	16	16	0.828	0.006	0.001	0.002
Q23	0.933	0.86	0.893	0.931	16	16	16	16	0.269	0.019	0.062	0.257
Q24	0.887	0.857	0.898	0.841	16	16	16	16	0.05	0.017	0.074	0.01
Helpful
Q1	0.876	0.935	0.884	0.854	16	16	16	16	0.034	0.297	0.045	0.016
Q2	0.690	0.838	0.792	0.928	16	16	16	16	0.000	0.009	0.002	0.227
Q3	0.790	0.864	0.894	0.910	16	16	16	16	0.002	0.022	0.064	0.115
Q4	0.745	0.837	0.882	0.833	16	16	16	16	0.001	0.009	0.042	0.008
Q5	0.843	0.840	0.875	0.900	16	16	16	16	0.011	0.010	0.032	0.082
Q6	0.910	0.880	0.911	0.880	16	16	16	16	0.118	0.039	0.123	0.039
Q7	0.897	0.910	0.850	0.923	16	16	16	16	0.071	0.114	0.013	0.188
Q8	0.884	0.931	0.957	0.887	16	16	16	16	0.045	0.255	0.602	0.050
Q9	0.857	0.839	0.821	0.867	16	16	16	16	0.018	0.009	0.005	0.024
Q10	0.888	0.860	0.901	0.898	16	16	16	16	0.051	0.019	0.083	0.074
Q11	0.880	0.894	0.896	0.924	16	16	16	16	0.039	0.065	0.071	0.194
Q12	0.831	0.801	0.900	0.887	16	16	16	16	0.007	0.003	0.081	0.050
Q13	0.915	0.920	0.890	0.897	16	16	16	16	0.142	0.168	0.055	0.071
Q14	0.849	0.843	0.849	0.855	16	16	16	16	0.013	0.011	0.013	0.016
Q15	0.932	0.878	0.848	0.892	16	16	16	16	0.266	0.037	0.013	0.060
Q16	0.966	0.915	0.867	0.918	16	16	16	16	0.770	0.140	0.024	0.154
Q17	0.927	0.841	0.894	0.890	16	16	16	16	0.221	0.010	0.064	0.056
Q18	0.845	0.772	0.887	0.908	16	16	16	16	0.011	0.001	0.050	0.108
Q19	0.867	0.947	0.952	0.873	16	16	16	16	0.024	0.449	0.527	0.030
Q20	0.764	0.824	0.870	0.859	16	16	16	16	0.001	0.006	0.027	0.019
Q21	0.729	0.676	0.919	0.833	16	16	16	16	0.000	0.000	0.162	0.008
Q22	0.946	0.881	0.902	0.895	16	16	16	16	0.426	0.041	0.085	0.067
Q23	0.910	0.883	0.898	0.937	16	16	16	16	0.115	0.043	0.074	0.312
Q24	0.889	0.808	0.929	0.891	16	16	16	16	0.053	0.003	0.231	0.058
Inconsistency
Q1	0.749	0.796	0.587	0.778	16	16	16	16	0.001	0.002	0.000	0.001
Q2	0.379	0.405	0.597	0.825	16	16	16	16	0.000	0.000	0.000	0.006
Q3	0.577	0.648	0.775	0.788	16	16	16	16	0.000	0.000	0.001	0.002
Q4	0.507	0.568	0.564	0.741	16	16	16	16	0.000	0.000	0.000	0.000
Q5	0.466	0.575	0.710	0.872	16	16	16	16	0.000	0.000	0.000	0.029
Q6	0.466	0.697	0.477	0.832	16	16	16	16	0.000	0.000	0.000	0.008
Q7	0.780	0.697	0.851	0.825	16	16	16	16	0.001	0.000	0.014	0.006
Q8	0.751	0.507	0.575	0.787	16	16	16	16	0.001	0.000	0.000	0.002
Q9	0.710	0.678	0.695	0.832	16	16	16	16	0.000	0.000	0.000	0.008
Q10	0.670	0.738	0.688	0.814	16	16	16	16	0.000	0.000	0.000	0.004
Q11	0.781	0.273	0.743	0.859	16	16	16	16	0.002	0.000	0.001	0.019
Q12	0.637	0.701	0.653	0.825	16	16	16	16	0.000	0.000	0.000	0.006
Q13	0.649	0.757	0.414	0.814	16	16	16	16	0.001	0.001	0.000	0.004
Q14	0.745	0.631	0.653	0.832	16	16	16	16	0.001	0.000	0.000	0.008
Q15	0.738	0.648	0.710	0.819	16	16	16	16	0.000	0.000	0.000	0.005
Q16	0.688	0.405	0.405	0.796	16	16	16	16	0.000	0.000	0.000	0.002
Q17	0.768	0.575	0.546	0.803	16	16	16	16	0.001	0.000	0.000	0.003
Q18	0.507	0.587	0.587	0.848	16	16	16	16	0.000	0.000	0.000	0.013
Q19	0.718	0.786	0.828	0.833	16	16	16	16	0.000	0.002	0.006	0.008
Q20	0.648	0.670	0.707	0.827	16	16	16	16	0.000	0.000	0.000	0.006
Q21	0.405	0.414	0.768	0.778	16	16	16	16	0.000	0.000	0.001	0.001
Q22	0.808	0.751	0.398	0.862	16	16	16	16	0.003	0.001	0.000	0.021
Q23	0.710	0.695	0.688	0.869	16	16	16	16	0.000	0.000	0.000	0.026
Q24	0.710	0.655	0.507	0.809	16	16	16	16	0.000	0.000	0.000	0.004

Open in a new tab

For the Helpfulness evaluation, GPT-4 exhibited normal or near-normal distributions in 10 questions (Q6, Q7, Q10, Q13, Q15, Q16, Q17, Q22, Q23, and Q24). GPT-4o displayed normality in seven questions (Q1, Q7, Q8, Q11, Q13, Q16, and Q19). Gemini 1.0 Ultra outperformed in this category, with 13 questions showing normal or near-normal distributions (Q3, Q6, Q8, Q10, Q11, Q12, Q13, Q17, Q18, Q19, Q21, Q23, and Q24), with the highest p-value for Q8 (p = 0.602). PaLM 2 exhibited the strongest performance, with 15 questions achieving normal or near-normal distributions (Q2, Q3, Q5, Q7, Q8, Q10, Q11, Q13, Q15, Q16, Q17, Q18, Q22, Q23, and Q24).

In contrast, the Inconsistency evaluation revealed significant deviations from normality across all models, with only a few questions approaching normal distributions. However, none met the assumptions of a standard normal distribution. These findings suggest that inconsistency scores are heavily influenced by variations in response consistency across models, resulting in strongly asymmetric or anomalous distribution patterns.

Table 5 demonstrated the results of performance of generative AIs in empathy. The comparative analysis of human-like empathy demonstrated by responses from GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 across 24 questions revealed notable variations in performance among the models. On average, GPT-4 achieved the highest empathy score (80.73%), followed by GPT-4o (76.56%), PaLM 2 (75.96%), and Gemini 1.0 Ultra (68.58%). GPT-4 consistently outperformed other models in most questions, with standout performances in Q2 (95.83%), Q16 (89.58%), and Q24 (87.50%), reflecting its strong ability to generate empathetic responses across various scenarios.

Table 5.

Comparative analysis of human-like empathy demonstration in responses by AI models across 24 questions.

Question	GPT-4%(Count)	GPT-4o %(Count)	Gemini 1.0 Ultra %(Count)	PaLM 2%(Count)
Q1	77.08% (37)	75.00% (36)	87.50% (42)	79.17% (38)
Q2	95.83% (46)	79.17% (38)	72.92% (35)	87.50% (42)
Q3	85.42% (41)	70.83% (34)	75.00% (36)	85.42% (41)
Q4	85.42% (41)	62.50% (30)	62.50% (30)	91.67% (44)
Q5	85.42% (41)	79.17% (38)	47.92% (23)	70.83% (34)
Q6	77.08% (37)	75.00% (36)	47.92% (23)	70.83% (34)
Q7	87.50% (42)	79.17% (38)	64.58% (31)	70.83% (34)
Q8	83.33% (40)	83.33% (40)	77.08% (37)	79.17% (38)
Q9	64.58% (31)	64.58% (31)	60.42% (29)	60.42% (29)
Q10	75.00% (36)	72.92% (35)	70.83% (34)	66.67% (32)
Q11	62.50% (30)	66.67% (32)	77.08% (37)	68.75% (33)
Q12	70.83% (34)	72.92% (35)	81.25% (39)	70.83% (34)
Q13	70.83% (34)	85.42% (41)	75.00% (36)	75.00% (36)
Q14	83.33% (40)	83.33% (40)	70.83% (34)	81.25% (39)
Q15	85.42% (41)	75.00% (36)	81.25% (39)	87.50% (42)
Q16	89.58% (43)	75.00% (36)	85.42% (41)	81.25% (39)
Q17	85.42% (41)	79.17% (38)	58.33% (28)	60.42% (29)
Q18	85.42% (41)	70.83% (34)	43.75% (21)	77.08% (37)
Q19	77.08% (37)	83.33% (40)	75.00% (36)	75.00% (36)
Q20	79.17% (38)	75.00% (36)	43.75% (21)	68.75% (33)
Q21	85.42% (41)	87.50% (42)	64.58% (31)	83.33% (40)
Q22	72.92% (35)	79.17% (38)	79.17% (38)	75.00% (36)
Q23	85.42% (41)	79.17% (38)	58.33% (28)	79.17% (38)
Q24	87.50% (42)	83.33% (40)	85.42% (41)	77.08% (37)
Average	80.73% (39)	76.56% (37)	68.58% (33)	75.96% (36)

Open in a new tab

GPT-4o demonstrated robust performance in questions such as Q13 (85.42%) and Q21 (87.50%), where it surpassed GPT-4 in demonstrating empathy. PaLM 2 also showed competitive performance, excelling in Q4 (91.67%) and Q15 (87.50%), highlighting its capability to match or even exceed GPT-4 in specific contexts. However, Gemini lagged behind in overall empathy, particularly in questions like Q5 (47.92%) and Q6 (47.92%), where its scores were significantly lower compared to the other models.

Despite these variations, all models showed comparable performance in certain questions, such as Q8 and Q24, where empathy scores were closely aligned across models. This indicates that while some generative AI models consistently excel in demonstrating empathy, the context and nature of the questions also play a significant role in their relative performance. Overall, GPT-4 emerged as the most consistent and empathetic model, while GPT-4o and PaLM 2 showed specific strengths, and Gemini 1.0 Ultra demonstrated room for improvement in empathetic communication.

The Kruskal–Wallis test was used to evaluate the performance of the four large language models (LLPs; GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2) across three evaluation categories: Appropriateness, Helpfulness, and Inconsistency, measured across multiple questions (Q1–Q24) (Table 6). In the Appropriateness category, GPT-4o consistently demonstrated higher mean ranks across most questions, with notable performance in Q20 (mean rank 3.667) and Q21 (mean rank 3.646). PaLM 2 also showed competitive results, particularly in Q21 (mean rank 3.333). GPT-4 and Gemini 1.0 Ultra achieved moderate performance in this category, with occasional peaks, such as Q10 for GPT-4 (mean rank 3.271) and Q13 for Gemini 1.0 Ultra (mean rank 3.271). For Helpfulness, GPT-4o again displayed strong results, leading in questions such as Q18 (mean rank 3.521) and Q20 (mean rank 3.396). PaLM 2 performed consistently across the questions, with mean ranks generally above 3.0. Gemini 1.0 Ultra and GPT-4 ranked similarly, with comparable performance in questions like Q10 and Q13. In the Inconsistency category, PaLM 2 had the highest scores in several questions, reflecting less variability in its outputs, as seen in Q1 (mean rank 3.25) and Q21 (mean rank 3.25). GPT-4o demonstrated lower ranks for inconsistency, with standout performance in Q2 (mean rank 1.188) and Q16 (mean rank 1.188), indicating fewer inconsistencies. GPT-4 and Gemini 1.0 Ultra showed moderate inconsistency rankings, often lower than PaLM 2 but higher than GPT-4o.

Table 6.

Kruskal–Wallis test mean ranks for different LLM models (GPT-4, GPT-4o, Gemini 1.0 Ultra, PaLM 2) performance across evaluation categories.

Categories		Appropriate (N = 16)	Helpful (N = 16)	Inconsistent (N = 16)
No.	Model	Mean rank	Mean rank	Mean rank
Q1	GPT-4	3.292	3.125	1.875
	GPT-4o	3.375	3.188	1.813
	Gemini 1.0 Ultra	3.167	3.167	1.313
	PaLM 2	3.208	3.146	3.25
Q2	GPT-4	3.542	3.479	1.25
	GPT-4o	3.646	3.375	1.188
	Gemini 1.0 Ultra	3.021	3.125	1.438
	PaLM 2	2.917	2.979	3
Q3	GPT-4	3.396	3.333	1.375
	GPT-4o	3.542	3.375	1.375
	Gemini 1.0 Ultra	3	3.021	2.125
	Bard	3.104	3.042	3.313
Q4	GPT-4	3.354	3.333	1.25
	GPT-4o	2.938	2.75	1.688
	Gemini 1.0 Ultra	3.25	3.146	1.438
	PaLM 2	3.063	3.063	3.063
Q5	GPT-4	3.271	3.146	1.313
	GPT-4o	3.292	3.208	1.438
	Gemini 1.0 Ultra	2.875	2.812	1.688
	PaLM 2	2.979	2.917	2.813
Q6	GPT-4	3.25	3.104	1.313
	GPT-4o	3.271	3.146	1.5
	Gemini 1.0 Ultra	2.833	2.771	1.438
	PaLM 2	3.167	3.104	3.063
Q7	GPT-4	3.125	3.021	1.875
	GPT-4o	3.271	3	1.5
	Gemini 1.0 Ultra	2.854	2.813	2.125
	PaLM 2	3.167	2.958	3
Q8	GPT-4	3	2.979	1.688
	GPT-4o	3.354	3.313	1.25
	Gemini 1.0 Ultra	2.708	2.604	1.438
	PaLM 2	2.958	3	3.188
Q9	GPT-4	3.375	3.312	1.688
	GPT-4o	3.438	3.333	1.563
	Gemini 1.0 Ultra	3.271	3.146	1.438
	PaLM 2	3.104	3.021	3.063
Q10	GPT-4	3.271	3.104	1.625
	GPT-4o	3.292	3.229	1.563
	Gemini 1.0 Ultra	2.958	2.875	1.563
	PaLM 2	3.021	2.792	2.875
Q11	GPT-4	3.208	2.937	1.875
	GPT-4o	3.229	3.083	1.125
	Gemini 1.0 Ultra	3.167	2.938	1.813
	PaLM 2	3.104	3.042	3
Q12	GPT-4	3.208	3.125	1.5
	GPT-4o	3.417	3.292	1.813
	Gemini 1.0 Ultra	3.104	3.167	1.563
	PaLM 2	3.271	3.062	3
Q13	GPT-4	3.063	2.854	2
	GPT-4o	3.417	3.375	1.813
	Gemini 1.0 Ultra	2.896	3.063	1.313
	PaLM 2	3.146	3.146	3.125
Q14	GPT-4	3.313	3.167	2.063
	GPT-4o	3.583	3.437	1.625
	Gemini 1.0 Ultra	2.917	2.979	1.563
	PaLM 2	3.208	3.063	3.063
Q15	GPT-4	3.094	2.813	1.813
	GPT-4o	3.531	3.313	1.375
	Gemini 1.0 Ultra	3.031	3.063	1.688
	PaLM 2	3	3	2.938
Q16	GPT-4	3.016	2.797	1.563
	GPT-4o	3.5	3.266	1.188
	Gemini 1.0 Ultra	3.188	3.109	1.188
	PaLM 2	3.047	2.906	2.938
Q17	GPT-4	3.062	3.062	1.625
	GPT-4o	3.333	3.229	1.438
	Gemini 1.0 Ultra	3.042	3.104	1.25
	PaLM 2	3.104	2.958	2.938
Q18	GPT-4	2.979	3.125	1.25
	GPT-4o	3.646	3.521	1.313
	Gemini 1.0 Ultra	2.667	2.75	1.313
	PaLM 2	2.958	3	3.063
Q19	GPT-4	3.062	3	1.688
	GPT-4o	2.729	2.792	1.75
	Gemini 1.0 Ultra	2.729	2.771	1.938
	PaLM 2	2.625	2.688	2.25
Q20	GPT-4	3.458	3.437	1.375
	GPT-4o	3.667	3.396	1.625
	Gemini 1.0 Ultra	3.104	3.083	1.625
	PaLM 2	3.208	3.188	3.125
Q21	GPT-4	3.5	3.437	1.188
	GPT-4o	3.646	3.458	1.313
	Gemini 1.0 Ultra	3.271	3.104	1.625
	PaLM 2	3.333	3.208	3.25
Q22	GPT-4	2.771	2.688	2.188
	GPT-4o	3.438	3.229	1.688
	Gemini 1.0 Ultra	3.271	3.042	1.125
	PaLM 2	3.25	3.021	2.875
Q23	GPT-4	2.833	2.937	1.688
	GPT-4o	3.333	3.104	1.438
	Gemini 1.0 Ultra	3.063	2.875	1.563
	PaLM 2	3.063	2.958	2.688
Q24	GPT-4	3	3	1.688
	GPT-4o	3.396	3.25	1.625
	Gemini 1.0 Ultra	3.104	2.979	1.25
	PaLM 2	3.125	3.062	3.188

Open in a new tab

Table 7 provides the Kruskal–Wallis test results, comparing the performance of four language models (GPT-4, GPT-4.0, Gemini 1.0 Ultra, PaLM 2) across different evaluation criteria (Appropriateness, Helpfulness, Inconsistency). The evaluation covers each question (Q1–Q24) to assess performance differences across models for these questions.

Table 7.

Kruskal–Wallis test results for model comparisons across different evaluation criteria (appropriate, helpful, inconsistent).

	Appropriate			Helpful			Inconsistent
No.	Kruska–Wallis H	df	Asymp. Sig.	Kruskal–Wallis H	df	Asymp. Sig.	Kruskal–Wallis H	df	Asymp. Sig.
Q1	0.650	3	0.885	0.207	3	0.977	23.750	3	0.000**
Q2	9.851	3	0.020*	4.533	3	0.209	33.166	3	0.000**
Q3	4.539	3	0.209	3.567	3	0.312	29.271	3	0.000**
Q4	0.778	3	0.855	2.248	3	0.523	23.425	3	0.000**
Q5	2.112	3	0.550	2.290	3	0.514	20.040	3	0.000**
Q6	2.040	3	0.564	1.643	3	0.650	27.232	3	0.000**
Q7	2.305	3	0.511	0.758	3	0.860	16.946	3	0.001**
Q8	5.229	3	0.156	5.742	3	0.125	28.683	3	0.000**
Q9	1.919	3	0.589	1.492	3	0.684	23.092	3	0.000**
Q10	3.268	3	0.352	3.467	3	0.325	21.349	3	0.000**
Q11	0.427	3	0.935	0.731	3	0.866	25.021	3	0.000**
Q12	1.006	3	0.800	1.302	3	0.729	19.007	3	0.000**
Q13	3.144	3	0.370	4.506	3	0.212	31.557	3	0.000**
Q14	3.266	3	0.352	2.237	3	0.525	16.641	3	0.001**
Q15	4.543	3	0.209	3.623	3	0.305	21.664	3	0.000**
Q16	4.127	3	0.248	4.027	3	0.259	34.996	3	0.000**
Q17	1.197	3	0.754	1.724	3	0.632	28.234	3	0.000**
Q18	8.064	3	0.045*	5.156	3	0.161	32.453	3	0.000**
Q19	2.860	3	0.414	1.333	3	0.721	3.300	3	0.348
Q20	4.736	3	0.192	2.722	3	0.437	24.350	3	0.000**
Q21	2.719	3	0.437	3.796	3	0.284	33.826	3	0.000**
Q22	7.590	3	0.055	6.002	3	0.112	21.292	3	0.000**
Q23	3.121	3	0.373	1.103	3	0.776	13.532	3	0.0004*
Q24	2.400	3	0.494	0.663	3	0.882	28.138	3	0.000**

Open in a new tab

Df: degrees of freedom; Asymp Sig: asymptotic significance (p-value).

* p < 0.05.

** p < 0.001.

The Kruskal–Wallis H test revealed significant differences among the models in specific evaluation dimensions. For Appropriateness, significant differences were observed in questions Q2 (H = 9.85, p = 0.0199) and Q18 (H = 8.06, p = 0.0447), whereas the Helpfulness dimension showed no significant differences across all questions, with p-values exceeding 0.05. In contrast, the Inconsistency dimension revealed significant differences for most questions (p < 0.05), except for Q19. Post-hoc analysis using the Mann–Whitney U test with Holm correction further explored these differences. In the Appropriateness dimension, the significant differences in Q2 and Q18 identified in the Kruskal–Wallis test were not retained after correction, suggesting the differences among models were not statistically robust. For the Inconsistency dimension, GPT-4, GPT-4o, and Gemini 1.0 Ultra demonstrated similar patterns in their responses for most questions, whereas PaLM 2 consistently exhibited significant differences compared to the other models, particularly in questions 1–18, 20, 21, and 24, even after Holm correction. However, no significant differences were observed between GPT-4 and GPT-4o, or between GPT-4o and Gemini 1.0 Ultra, in most questions, and any initially significant results, such as for Q11, lost significance after correction. In contrast, PaLM 2 displayed significant differences compared to the other models across most questions, except for Q19, highlighting its distinct response patterns. Overall, while the differences among models were limited in the Appropriateness and Helpfulness dimensions, the Inconsistency dimension revealed substantial distinctions, with PaLM 2 showing notably divergent behavior compared to GPT-4, GPT-4o, and Gemini 1.0 Ultra. This underscores the unique response patterns exhibited by PaLM 2 in comparison to its counterparts.

Table 8 showed the comparative analysis of language models for kidney-related content. The comparative analysis of language models (GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2) across four kidney-related content categories—Dialysis-Related Issues (Q1–Q7), Kidney Examinations (Q8–Q14), Kidney-Related Queries (Q15–Q21), and Renal Biopsy Reports (Q22–Q24)—revealed distinct performance patterns. For Dialysis-Related Issues, GPT-4o achieved the highest BertScore (0.54 ± 0.02) and ROUGE (one word) score (0.50 ± 0.02), while PaLM 2 led in TF-IDF (0.54 ± 0.10) and ROUGE (longest similar sentences) score (0.38 ± 0.09). Notably, Gemini 1.0 Ultra showed competitive ROUGE (two words) performance (0.28 ± 0.11), highlighting its strength in multi-word content generation. In the Kidney Examinations category, PaLM 2 exhibited the highest TF-IDF (0.60 ± 0.12) and strong ROUGE (longest similar sentences) performance (0.36 ± 0.03), while GPT-4o maintained superiority in BertScore (0.54 ± 0.03) and ROUGE (one word) (0.49 ± 0.03). Gemini 1.0 Ultra's performance in this category was more variable, showing lower scores in ROUGE (one word) and ROUGE (two words), but its TF-IDF (0.49 ± 0.10) was comparable to GPT-4 and GPT-4o. For Kidney-Related Queries, GPT-4o again demonstrated strong performance, achieving the highest BertScore (0.57 ± 0.06) and ROUGE (one word) (0.54 ± 0.06), while PaLM 2 performed well in ROUGE (two words) (0.31 ± 0.05). GPT-4 displayed consistent performance across all metrics, particularly in ROUGE (one word) (0.54 ± 0.08). Gemini 1.0 Ultra exhibited moderate performance, with slightly lower scores in multi-word ROUGE metrics. In the Renal Biopsy Reports category, Gemini Pro stood out with the highest scores in TF-IDF (0.56 ± 0.13), ROUGE (longest similar sentences) (0.47 ± 0.10), and ROUGE (two words) (0.38 ± 0.14), reflecting its ability to generate coherent and contextually relevant outputs for complex content. GPT-4 and GPT-4o performed consistently in ROUGE (one word) and BertScore metrics, with GPT-4o achieving a BertScore of 0.46 ± 0.05. PaLM 2, however, showed the weakest performance in this category, with lower scores across most metrics, particularly in ROUGE (longest similar sentences) (0.15 ± 0.02).

Table 8.

Comparative analysis of language models in kidney-related content.

Category	Model	TF-IDF (key words)	BertScore	ROUGE (one word)	ROUGE (two words)	ROUGE (longest similar sentences)
Dialysis-Related Issues (Q1–Q7)	GPT-4	0.49 ± 0.07	0.51 ± 0.04	0.49 ± 0.05	0.23 ± 0.04	0.34 ± 0.03
	Gemini 1.0 Ultra	0.44 ± 0.12	0.52 ± 0.11	0.23 ± 0.04	0.28 ± 0.11	0.40 ± 0.11
	PaLM 2	0.54 ± 0.10	0.51 ± 0.07	0.34 ± 0.03	0.32 ± 0.10	0.38 ± 0.09
	GPT-4o	0.48 ± 0.06	0.54 ± 0.02	0.50 ± 0.02	0.23 ± 0.02	0.37 ± 0.03
Kidney Examinations (Q8–Q14)	GPT-4	0.46 ± 0.13	0.51 ± 0.06	0.48 ± 0.07	0.22 ± 0.05	0.34 ± 0.05
	Gemini 1.0 Ultra	0.49 ± 0.10	0.49 ± 0.10	0.22 ± 0.05	0.22 ± 0.12	0.35 ± 0.11
	PaLM 2	0.60 ± 0.12	0.50 ± 0.03	0.34 ± 0.05	0.28 ± 0.04	0.36 ± 0.03
	GPT-4o	0.46 ± 0.10	0.54 ± 0.03	0.49 ± 0.03	0.22 ± 0.03	0.35 ± 0.04
Kidney-Related Queries (Q15–Q21)	GPT-4	0.55 ± 0.07	0.54 ± 0.05	0.54 ± 0.08	0.27 ± 0.06	0.38 ± 0.06
	Gemini 1.0 Ultra	0.46 ± 0.14	0.53 ± 0.16	0.27 ± 0.06	0.30 ± 0.19	0.39 ± 0.17
	PaLM 2	0.49 ± 0.09	0.53 ± 0.07	0.38 ± 0.06	0.31 ± 0.05	0.39 ± 0.07
	GPT-4o	0.56 ± 0.08	0.57 ± 0.06	0.54 ± 0.06	0.28 ± 0.07	0.40 ± 0.08
Renal Biopsy Reports (Q22–Q24)	GPT-4	0.54 ± 0.03	0.51 ± 0.03	0.49 ± 0.03	0.22 ± 0.00	0.34 ± 0.03
	Gemini 1.0 Ultra	0.56 ± 0.13	0.55 ± 0.11	0.22 ± 0.00	0.38 ± 0.14	0.47 ± 0.10
	PaLM 2	0.40 ± 0.09	0.44 ± 0.05	0.34 ± 0.03	0.15 ± 0.02	0.27 ± 0.03
	GPT-4o	0.45 ± 0.08	0.46 ± 0.05	0.46 ± 0.05	0.18 ± 0.04	0.26 ± 0.03

Open in a new tab

Overall, GPT-4o consistently demonstrated high accuracy and relevance across most categories, particularly excelling in single-word metrics, while PaLM 2 showed strength in multi-word and TF-IDF metrics for certain content types. Gemini 1.0 Ultra's performance was variable but excelled in generating contextually rich outputs for renal biopsy-related tasks. GPT-4 provided balanced and reliable results across categories, maintaining competitive scores throughout.

Regarding the accuracy of references provided by GPT-4 across 24 questions, a total of 131 references were initially given. After removing 30 duplicates, the final count stood at 101 unique references. These comprised 4 from books, 86 from medical and clinical journals, and 11 from websites. Upon manual verification, it was confirmed that all 101 references were valid and existing, resulting in a 100% accuracy rate for the references provided by GPT-4. Responses by GPT-4, as well as responses by Gemini 1.0 Ultra, PaLM 2, and GPT-4o were included in Appendices 1, 2, 3, and 4. Detailed references provided by GPT-4 along with website linkages were in Appendix 5.

Discussion

This is the first study to involve sixteen nephrologists and nephrology researchers, each with over five years of experience, from nephrology clinics to medical centers across Taiwan, to evaluate the feasibility of incorporating generative AI in clinical renal settings. Notably, as Gemini 1.0 Ultra and PaLM 2 have since been merged into a single platform, this study uniquely captures and compares the distinct characteristics of these generative AI tools prior to their combination, enabling a meaningful discussion of their differential capabilities and applications.

GPT-4o demonstrated relatively higher performance scores in the categories of Helpfulness and Appropriateness, though it showed lower scores in consistency, suggesting that the model's output remained stable across different contexts. These metrics—Helpfulness, Consistency, and Appropriateness—are crucial in clinical practice as they directly impact time efficiency and accuracy. High scores in these areas can streamline workflows, reduce redundancies, and enhance diagnostic and treatment precision. In this study, we evaluated GPT-4o across various domains of nephrology clinical services, including Dialysis-Related Issues (Q1–Q7), Kidney Examinations (Q8–Q14), Kidney-Related Queries (Q15–Q21), and Renal Biopsy Report Interpretation (Q22–Q24). GPT-4o consistently outperformed other models in subjective assessments by nephrologists, highlighting its potential and advantage as a decision-support tool in clinical renal settings. When integrated into an electronic health record (EHR) system,¹⁹ GPT-4o could provide timely, maybe more evidence-based recommendations and saving time. This preliminary study aims to develop a communication tool designed to facilitate more efficient and effective interactions between clinicians and patients. The goal is to save time while delivering information in a manner that aligns with what clinicians intend to convey, ensuring it is easily understood by patients. The results of our study appear promising, suggesting that the tool could be helpful in achieving these objectives. However, further research focusing on real-world implementation and practical application is necessary to validate these findings and assess its impact in clinical settings.

Human-like empathy, a crucial metric in healthcare, plays a significant role in enhancing patient satisfaction by fostering a sense of being understood. While empathy may not directly save time, it minimizes the need for prolonged face-to-face interactions to address patient concerns, thereby contributing to more efficient and satisfactory clinical encounters. LLMs, particularly ChatGPT, have demonstrated the ability to exhibit human-like empathy in healthcare settings, a finding that aligns with our study results.^20,21 Qian et al. have proposed methods to further enhance empathetic responses in these models, while Elyoseph et al. highlighted ChatGPT's superior emotional awareness compared to humans, showcasing its ability to effectively understand and conceptualize emotions.^20,22 Considering that training clinicians to demonstrate human-like empathy is time-intensive and can be limited by individual personality traits,²¹ the observed empathy scores of at least 76% for GPT and GPT-4o in our study suggest that generative AI could serve as a valuable tool for providing empathetic responses to clinical renal questions. This capability has the potential to augment patient interactions in nephrology, improving communication and fostering trust between clinicians and patients.

Our study demonstrated no significant differences in response patterns between GPT-4, GPT-4o, and Gemini 1.0 Ultra for most questions. However, critical distinctions arise in their data sourcing and internet access. PaLM 2 AI, integrated into Gemini Pro, leverages Google's extensive search network to provide real-time internet connectivity, allowing it to deliver up-to-date insights on current events and topics.²³ This real-time access offers a significant advantage for PaLM 2 AI and Gemini 1.0 Ultra. In contrast, GPT-4 operate without direct internet connectivity, relying on datasets updated only until 2021 in our study while GPT-4o would update information timely.²⁴ Consequently, their responses may not include the most recent information, which is a notable limitation for addressing current topics. In the context of our renal clinical study, where queries and renal biopsy results did not involve novel exams or treatments, GPT-4, GPT-4o, and Gemini 1.0 Ultra exhibited comparable performance levels, while PaLM 2's performance was significantly lower. Although PaLM 2 is now integrated into Gemini 1.0 Ultra, our findings highlight the importance of pre-evaluating the performance of LLMs across various aspects before incorporating them into clinical settings.

We further contextualized by the question categories to provide a comprehensive evaluation of generative AI model performance in clinical nephrology scenarios. The normality results highlight key differences in the models’ abilities to produce consistent distributions of scores across evaluation categories—Appropriateness, Helpfulness, and Inconsistency. For Appropriateness, GPT-4 demonstrated near-normal distributions across questions addressing dialysis-related issues (e.g. Q6), kidney examinations (e.g. Q8 and Q13), and renal biopsy reports (e.g. Q22–Q24). In contrast, Gemini 1.0 Ultra performed well across multiple categories but had stronger coverage in kidney examination-related questions (e.g. Q3, Q8, and Q12). PaLM 2 achieved normality in questions like Q16 and Q22, showing strengths in kidney-related queries and biopsy discussions. These results suggest a nuanced distribution of strengths among models, with GPT-4 generally excelling in breadth and PaLM 2 showing situational performance advantages. In the Helpfulness evaluation, Gemini 1.0 Ultra outperformed other models, achieving near-normal distributions in kidney examinations and dietary or medication-related queries (e.g. Q6, Q10, Q18). PaLM 2 exhibited the strongest overall performance, particularly in dialysis-related questions such as Q2 and Q7, as well as in patient-facing issues like diet (Q5) and kidney symptom interpretations (Q15). GPT-4 also excelled, particularly in providing recommendations across a range of topics, but GPT-4o lagged, reflecting less consistency in its ability to align responses with clinical expectations. Empathy scores further illuminated these findings. GPT-4 consistently demonstrated the highest empathy levels, excelling in patient-facing questions such as the implications of symptoms (Q16) and post-biopsy care (Q24). PaLM 2 showed competitive empathy in dietary and symptom-related questions (e.g. Q5 and Q15), while GPT-4o had standout performances in specialized areas, such as the evaluation of CKD (e.g. Q19). However, Gemini 1.0 Ultra lagged in empathy, particularly in addressing patient fears and practical concerns (e.g. Q5 and Q6), highlighting an area for model improvement. These findings underline the need for tailored applications of generative AI models in clinical settings. While GPT-4 consistently performs well across categories, PaLM 2 and GPT-4o demonstrate contextual strengths in specific queries, suggesting that a hybrid approach may maximize utility. Meanwhile, the results for Gemini indicate potential for refinement to improve both consistency and empathy. By aligning model performance with the nature of clinical questions, healthcare professionals can better leverage AI tools for nephrology and other specialties, ensuring patient care and communication remain at the forefront. Future work should aim to optimize these models for both general clinical use and scenario-specific applications.²⁵

The consistency evaluation, subjectively assessed by nephrologists, was further validated using objective metrics such as TF-IDF, BertScore, and ROUGE, which revealed nuanced differences in the capabilities of the evaluated models. Given the complexity of explaining renal biopsy reports to patients compared to more direct-targeted questions, our focus shifted to this domain. Gemini 1.0 Ultra demonstrated notable strengths in summarizing intricate content, while GPT-4 excelled in accuracy and relevance, as evidenced by superior TF-IDF and BertScore results.

In our study, we employed a methodology where each nephrology-related question and renal biopsy report was submitted to the AI models three times consecutively. This approach was carefully chosen to assess the consistency and reliability of the responses generated by each model. Since LLMs operate as probabilistic systems, their responses can vary even when presented with the same input multiple times. By repeating each query three times, we aimed to detect any variability and evaluate the stability of the models’ outputs over repeated interactions. This aligns with existing research, such as the study by Shaier et al.,²⁶ which examined the consistency of LLM-generated responses to health-related questions across different languages, highlighting the significance of repeated testing in understanding AI behavior. Beyond consistency, this three-repetition approach was selected as a practical and statistically meaningful strategy for evaluating AI performance. While larger sample sizes might offer more precision, repeating each query three times provides a balance between feasibility and statistical relevance. Prior research, such as Funk et al.,²⁷ investigated ChatGPT's response consistency by administering medical examination questions three times, demonstrating that this approach effectively identifies inconsistencies while maintaining efficiency in data collection and analysis. In medical practice, the reliability of AI-generated responses is critical, particularly when AI is used to support physician–patient communication and patient education. If an AI system provides widely varying responses to identical queries, it raises concerns about its dependability in real-world clinical settings. The study by Yun et al.²⁸ examined AI's role in questionnaire-based interactions, emphasizing that maintaining consistency in AI-generated responses is essential for ensuring trust and reliability in healthcare applications. Our methodology allows us to determine whether AI-generated information remains stable across repeated interactions, ensuring that these models can serve as reliable tools in patient communication and medical education. Further, we asked repeated questions within the same session to allow the model to utilize in-context learning, as this approach more accurately reflects real-world clinical conditions. In actual medical practice, patients and physicians engage in continuous conversations where prior context influences subsequent discussions. By maintaining the same session, we aimed to assess the model's ability to adapt, retain relevant information, and provide contextually appropriate responses, simulating how AI might function as a supportive tool in clinical nephrology settings.

These findings underscore the unique strengths of each model. GPT-4 and GPT-4o stand out for their detailed and precise responses, making them suitable for scenarios requiring high accuracy. Gemini 1.0 Ultra is effective for generating coherent extended narratives, particularly for complex and nuanced topics like renal biopsy interpretation. PaLM 2, on the other hand, shows strength in keyword-focused and concise responses, making it effective for straightforward, targeted queries. This analysis highlights the importance of selecting AI models based on specific clinical demands to ensure optimal application and effectiveness in patient communication and care.

It's notable that all references provided by ChatGPT were verified as existing, contrasting with previous studies that cautioned against potential fake references generated by ChatGPT.^29–32 Our study suggests improved reliability in GPT-4's references, which may be attributed to our IT engineer's effective prompting techniques. Future investigations could delve deeper into the relevance and accuracy of the references in relation to the provided information.

We observed that some responses generated by AI, which appeared relevant to Kidney-Related Queries, were mistakenly presented in the context of Renal Biopsy Reports. This phenomenon highlights the importance of having experienced clinicians supervise AI-generated content, as these responses can be a mix of fact and fiction, delivered with an air of authority. Additionally, while PaLM 2 scored relatively low across the categories of Appropriateness, Helpfulness, and Consistency as evaluated by experienced nephrologists, its empathy scores were notably higher, particularly in Q4 (91.67%) and Q15 (87.50%). However, PaLM 2's performance in terms of Appropriateness, Helpfulness, and Consistency for Q4 and Q8 did not align with its high empathy scores, pointing to a peculiar phenomenon: responses that are less accurate or inconsistent may still come across as empathetic and persuasive. This suggests that PaLM 2's inconsistency could reflect a manifestation of generative AI hallucinations, providing varied and sometimes incorrect answers to the same question. Such findings underscore the need for careful scrutiny of AI outputs, especially when addressing complex or sensitive clinical topics. Challenges like hallucinations, sycophancy, contextual misunderstandings, and limitations with complex numerical data reinforce the importance of human oversight and the integration of specialized systems.³³ Developing in-house or domain-specific models and aligning AI outputs with clinical guidelines can improve reliability. Future efforts should focus on enhancing contextual understanding and precision to optimize AI's role in clinical care.

We envisioned integrating generative AI into clinical nephrology practice as a supportive tool rather than a direct decision-making system. At the point of care, AI could assist in streamlining pre-consultation procedures by gathering patient histories, summarizing key concerns, and structuring responses for clinicians to review. This would enhance workflow efficiency by reducing administrative burdens and allowing physicians to focus on critical clinical decision-making. However, generative AI would not replace established clinical guidelines or protocols but rather complement existing procedures by improving patient engagement and comprehension.

Since we use generative AI as a supportive tool in clinical nephrology, its application is strictly limited to providing general knowledge for kidney disease patients, covering aspects such as dietary habits, important precautions, caregiving methods, and explanations of kidney biopsy data. These topics are widely recognized and standardized within Taiwan's nephrology practice, making them non-sensitive and generalizable, ensuring that no confidential patient information is involved in AI interactions.

We acknowledge concerns that data input into LLMs could be used beyond regulatory boundaries for model training. However, our study design minimizes this risk by restricting AI applications to non-personalized patient education rather than clinical decision-making. When explaining kidney biopsy data, the AI is only used to describe general pathological findings, such as common classifications of glomerular diseases, their clinical significance, and basic interpretations that align with nephrology textbooks and established guidelines. The AI does not process individual patient biopsy reports, ensuring that no sensitive medical records are exposed or stored externally.

Furthermore, compliance with hospital privacy policies and Taiwan's national healthcare regulations ensures that all AI interactions remain within ethical and legal boundaries. If future AI implementation expands to include patient-specific discussions, we will explore the use of on-premise AI models or privacy-enhanced AI solutions that prevent external data exposure. By structuring our study to focus strictly on general nephrology education, including kidney biopsy explanations, we effectively eliminate privacy risks while maximizing AI's benefits in patient communication and understanding.

If the study results are promising, we aim to implement AI in outpatient settings (general nephrology and dialysis units), where it could provide educational materials tailored to patients’ conditions, facilitate treatment adherence, and help triage common concerns before physician visits. Future integration would require institutional validation, regulatory oversight, and alignment with clinical nephrology protocols to ensure safety, accuracy, and ethical application in patient care.

Several limitations of this study should be acknowledged. First, the scoring was subjective, conducted by 16 experienced nephrologists and nephrology researchers from various regions and institutions across Taiwan, ranging from dialysis clinics to medical centers. These physicians also received fellowship training at diverse institutions. While the subjective evaluation aligns with the study's aim of assessing generative AI suitability for nephrology clinical settings, it reflects the perspectives of nephrologists rather than patients. This preliminary study excluded patient input, as responses that seem satisfactory to patients may not necessarily be accurate or clinically appropriate. Thus, the limitation of this study is the subjective nature of evaluating human-like empathy in generative AI responses. While the scoring was conducted by experienced nephrologists using a structured and standardized framework, the interpretation of empathy remains inherently subjective and may vary among evaluators. Additionally, the absence of objective metrics, such as linguistic or affective analysis, limits the ability to quantify empathy with greater precision. Future research should explore incorporating objective measures alongside expert evaluations to enhance the reliability and validity of empathy assessments in AI-generated responses. Second, this study is an initial evaluation and has not been tested in real-world clinical environments. Consequently, it does not address potential challenges that may arise when generative AI is applied in actual clinical settings. The lack of a comprehensive world model in AI systems further limits contextual appropriateness, emphasizing the need for ongoing improvements in contextual understanding. Third, the study's assumption that LLMs must be used warrants reconsideration. Alternatives, such as developing in-house models tailored to specific medical domains or enhancing existing LLMs with domain-specific training, may yield more reliable and customized solutions. Fourth, the use of a limited number of renal biopsy reports, focusing primarily on commonly encountered cases. While this approach was suitable for a preliminary evaluation of generative AI performance, it may not fully capture the variability and complexity of real-world renal biopsy scenarios. The lack of standardized analysis across a broader dataset of renal biopsies reports limits the generalizability of the findings. Future research should address these challenges, incorporating real-world testing, patient perspectives, and more robust methodologies, more sample size, more dialysis-related questions like “what do I need to pay attention to after dialysis” to optimize AI's role in nephrology clinical care. Regulatory compliance is another key factor. AI use must align with hospital policies, national regulations, and international standards (GDPR, HIPAA, WHO). Taiwan supports AI-driven medical advancements, allowing multicenter trials while enforcing data protection laws. Fifth, one limitation of this study is that the questionnaire used has not been formally validated. As this study is a pilot investigation, the primary aim was to explore preliminary findings rather than establish definitive conclusions. Although the questionnaire was developed based on relevant literature and expert consensus to ensure content validity, it has not undergone formal psychometric validation, such as reliability and validity testing. Future studies should consider conducting a structured validation process, including pilot testing with a larger sample, to ensure the robustness and generalizability of the findings. Finally, to address privacy concerns, we limited AI applications to general nephrology education, including dietary guidance, caregiving methods, and kidney biopsy explanations based on established references, avoiding patient-specific data. By focusing on non-sensitive educational content, we ensure compliance and effectiveness in enhancing communication in nephrology practice.

Conclusion

This study highlights the potential and limitations of generative AI models in nephrology clinical settings. While models like GPT-4 and GPT-4o demonstrated strengths in accuracy and relevance, and PaLM 2 showed notable empathy, inconsistencies and hallucinations underline the need for expert supervision. Subjective evaluations by experienced nephrologists provided valuable insights, but real-world testing and patient perspectives remain crucial for further validation. Our study explores generative AI in clinical nephrology, balancing its potential benefits and challenges. AI can enhance patient education, communication, and workflow efficiency, but must remain a supportive tool rather than a decision-maker. By assisting in gathering patient histories and summarizing key concerns, AI can streamline outpatient and dialysis care, allowing physicians to focus on complex cases while ensuring accuracy, safety, and ethical compliance.

LLMs are evolving at a continuous and rapid pace, and with their rapid development, integrating these constantly changing models into clinical practice presents a significant challenge. This study aims to evaluate the capabilities and performance of LLMs as a supportive bridge between patients and doctors, rather than as a decision-making tool. Since the focus is on facilitating communication and patient education, rather than clinical judgment, the findings remain relevant despite technological advancements. If the performance of the current LLMs in this study proves effective, it strongly suggests that future iterations, likely to be even more advanced and refined, will be even better suited for this supportive role in clinical practice.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251342067 - Supplemental material for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

sj-docx-1-dhj-10.1177_20552076251342067.docx^{(49.4KB, docx)}

Supplemental material, sj-docx-1-dhj-10.1177_20552076251342067 for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation by Shih-Yi Lin, Chang-Cheng Jiang, Kin-Man Law, Pei-Chun Yeh, Min-Kuang Tsai, Chu-Lin Chou, I-Kuan Wang, I-Wen Ting, Yu-Wei Chen, Che-Yi Chou, Ming-Han Hsieh, Heng-Chih Pan, Sung-Lin Hsieh, Chien-Hua Chiu, Pei-Wen Lee, Yu-Cyuan Hong, Ying-Yu Hsu, Huey-Liang Kuo, Shu-Woei Ju and Chia-Hung Kao in DIGITAL HEALTH

sj-docx-2-dhj-10.1177_20552076251342067 - Supplemental material for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

sj-docx-2-dhj-10.1177_20552076251342067.docx^{(40.8KB, docx)}

Supplemental material, sj-docx-2-dhj-10.1177_20552076251342067 for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation by Shih-Yi Lin, Chang-Cheng Jiang, Kin-Man Law, Pei-Chun Yeh, Min-Kuang Tsai, Chu-Lin Chou, I-Kuan Wang, I-Wen Ting, Yu-Wei Chen, Che-Yi Chou, Ming-Han Hsieh, Heng-Chih Pan, Sung-Lin Hsieh, Chien-Hua Chiu, Pei-Wen Lee, Yu-Cyuan Hong, Ying-Yu Hsu, Huey-Liang Kuo, Shu-Woei Ju and Chia-Hung Kao in DIGITAL HEALTH

sj-docx-3-dhj-10.1177_20552076251342067 - Supplemental material for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

sj-docx-3-dhj-10.1177_20552076251342067.docx^{(50.1KB, docx)}

Supplemental material, sj-docx-3-dhj-10.1177_20552076251342067 for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation by Shih-Yi Lin, Chang-Cheng Jiang, Kin-Man Law, Pei-Chun Yeh, Min-Kuang Tsai, Chu-Lin Chou, I-Kuan Wang, I-Wen Ting, Yu-Wei Chen, Che-Yi Chou, Ming-Han Hsieh, Heng-Chih Pan, Sung-Lin Hsieh, Chien-Hua Chiu, Pei-Wen Lee, Yu-Cyuan Hong, Ying-Yu Hsu, Huey-Liang Kuo, Shu-Woei Ju and Chia-Hung Kao in DIGITAL HEALTH

sj-docx-4-dhj-10.1177_20552076251342067 - Supplemental material for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

sj-docx-4-dhj-10.1177_20552076251342067.docx^{(71.6KB, docx)}

Supplemental material, sj-docx-4-dhj-10.1177_20552076251342067 for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation by Shih-Yi Lin, Chang-Cheng Jiang, Kin-Man Law, Pei-Chun Yeh, Min-Kuang Tsai, Chu-Lin Chou, I-Kuan Wang, I-Wen Ting, Yu-Wei Chen, Che-Yi Chou, Ming-Han Hsieh, Heng-Chih Pan, Sung-Lin Hsieh, Chien-Hua Chiu, Pei-Wen Lee, Yu-Cyuan Hong, Ying-Yu Hsu, Huey-Liang Kuo, Shu-Woei Ju and Chia-Hung Kao in DIGITAL HEALTH

sj-docx-5-dhj-10.1177_20552076251342067 - Supplemental material for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

sj-docx-5-dhj-10.1177_20552076251342067.docx^{(44.4KB, docx)}

Supplemental material, sj-docx-5-dhj-10.1177_20552076251342067 for Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation by Shih-Yi Lin, Chang-Cheng Jiang, Kin-Man Law, Pei-Chun Yeh, Min-Kuang Tsai, Chu-Lin Chou, I-Kuan Wang, I-Wen Ting, Yu-Wei Chen, Che-Yi Chou, Ming-Han Hsieh, Heng-Chih Pan, Sung-Lin Hsieh, Chien-Hua Chiu, Pei-Wen Lee, Yu-Cyuan Hong, Ying-Yu Hsu, Huey-Liang Kuo, Shu-Woei Ju and Chia-Hung Kao in DIGITAL HEALTH

ORCID iDs: Ying-Yu Hsu https://orcid.org/0009-0006-9025-8102

Chia-Hung Kao https://orcid.org/0000-0002-6368-3676

Author contributions: Conceptualization, methodology, project administration, and supervision were done by S.-Y. L. and C.-H. K. Data curation and investigation were done by S.-Y. L.; C.-C. J.; K.-M. L.; P.-C. Y.; H.-L. K.; S.-W. J. Formal analysis was accomplished by K.-M. L.; P.-C. Y. Software was provided by K.-M. L.; P.-C. Y. Writing—original draft, reviewing, and editing were done by all authors.

Funding: The authors received no financial support for the research, authorship, and/or publication of this article.

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

The datasets generated and analyzed during this study, including the responses from AI platforms (GPT-4, GPT-4o, Gemini 1.0 Ultra, PaLM 2) to nephrology-related questions and renal biopsy report interpretations, are not publicly available due to ethical and privacy considerations. These considerations pertain to the protection of proprietary information inherent in the AI platforms’ responses and the sensitivity of medical data. This study did not involve human or animal subjects, and thus, no ethical approval was required. The study protocol adhered to the guidelines established by the journal. The protocol was also under the premise of safeguarding the confidentiality and integrity of the proprietary AI methodologies and ensuring the privacy of hypothetical patient data embedded in the biopsy reports.

For researchers interested in accessing the data within the constraints of these ethical considerations, requests can be made to the corresponding author. Upon request, and where feasible, the research team is prepared to provide summaries of the AI responses and statistical analyses performed.

For further inquiries or requests for data access, please contact the corresponding author via mail.

Presentation: Will be presented as a poster at the Internal Medicine and Nephrology Annual Meeting.

Supplemental material: Supplemental material for this article is available online.

References

1. https://datareportal.com/reports/digital-2023-taiwan .
2.Chi C, Lee JL, Schoon R. Assessing health information technology in a national health care system—an example from Taiwan. Adv Health Care Manag 2012; 12: 75–109. [DOI] [PubMed] [Google Scholar]
3.Liu P, Yeh LL, Wang JY, et al. Relationship between levels of digital health literacy based on the Taiwan digital health literacy assessment and accurate assessment of online health information: cross-sectional questionnaire study. J Med Internet Res 2020; 22: e19767. [DOI] [PMC free article] [PubMed] [Google Scholar]
4.Hsu YC, Chen TJ, Chu FY, et al. Official websites of local health centers in Taiwan: a nationwide study. Int J Environ Res Public Health 2019; 16: 399. [DOI] [PMC free article] [PubMed] [Google Scholar]
5.Chiang HY, Liang LY, Lin CC, et al. Electronic medical record-based deep data cleaning and phenotyping improve the diagnostic validity and mortality assessment of infective endocarditis: medical big data initiative of CMUH. Biomedicine (Taipei) 2021; 11: 59–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
6.Hsu LL. An exploratory study of Taiwanese consumers’ experiences of using health-related websites. J Nurs Res 2005; 13: 129–140. [DOI] [PubMed] [Google Scholar]
7.Desai D. Website personalization: strategy for user experience design & development. Turk J Comput Math Educ (TURCOMAT) 2021; 12: 3516–3523. [Google Scholar]
8.Li C, Wang J, Zhang Y, et al. The good, the bad, and why: unveiling emotions in generative AI. arXiv preprint arXiv:2312.11111 (2023).
9.Epstein Z, Hertzmann A, Investigators of Human Creativity , et al. Art and the science of generative AI. Science 2023; 380: 1110–1111. [DOI] [PubMed] [Google Scholar]
10.Jo A. The promise and peril of generative AI. Nature 2023; 614: 214–216.36747115 [Google Scholar]
11.Miao J, Thongprayoon C, Garcia Valencia OA, et al. Performance of ChatGPT on nephrology test questions. Clin J Am Soc Nephrol 2024; 19: 35–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
12.Unger Z, Soffer S, Efros O, et al. Clinical applications and limitations of large language models in nephrology: a systematic review. medRxiv. 2024; 2024.2010.2030.24316199. 10.1101/2024.10.30.24316199. [DOI]
13.Miao J, Thongprayoon C, Suppadungsuk S, et al. Integrating retrieval-augmented generation with large language models in nephrology: advancing practical applications. Medicina (Kaunas) 2024; 60: 445. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Pham JH, Thongprayoon C, Miao J, et al. Large language model triaging of simulated nephrology patient inbox messages. Front Artif Intell 2024; 7: 1452469. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Wu S, Koo M, Blum L, et al. A comparative study of open-source large language models, GPT-4 and Claude 2: multiple-choice test taking in nephrology. arXiv:2308.04709 (2023).
16.Qaiser S, Ali R. Text mining: use of TF-IDF to examine the relevance of words to documents. Int J Comput Appl 2018; 181: 25–29. [Google Scholar]
17.Lin CY. Rouge: a package for automatic evaluation of summaries. 2004; pp. 74–81.
18.Rogasch JMM, Metzger G, Preisler M, et al. ChatGPT: can you prepare my patients for [18F]FDG PET/CT and explain my reports? J Nucl Med 2023; 64: 1876–1879. [DOI] [PMC free article] [PubMed] [Google Scholar]
19.Baxter SL, Longhurst CA, Millen M, et al. Generative artificial intelligence responses to patient messages in the electronic health record: early lessons learned. JAMIA Open 2024; 7: ooae028. [DOI] [PMC free article] [PubMed] [Google Scholar]
20.Sorin V, Brin D, Barash Y, et al. Large language models (LLMs) and empathy—a systematic review. medRxiv, 2023.2008. 2007.23293769 (2023).
21.Patel S, Pelletier-Bui A, Smith S, et al. Curricula for empathy and compassion training in medical education: a systematic review. PLoS One 2019; 14: e0221412. [DOI] [PMC free article] [PubMed] [Google Scholar]
22.Elyoseph Z, Hadar-Shoval D, Asraf K, et al. ChatGPT outperforms humans in emotional awareness evaluations. Front Psychol 2023; 14: 1199058. [DOI] [PMC free article] [PubMed] [Google Scholar]
23.Pichai S. An important next step on our AI journey. https://blog.google/technology/ai/bard-google-ai-search-updates/.
24.Introducing ChatGPT. https://openai.com/blog/chatgpt.
25.Salazar GZ, Zúñiga D, Vindel CL, et al. Efficacy of AI chats to determine an emergency: a comparison between OpenAI’s ChatGPT, Google Bard, and Microsoft Bing AI chat. Cureus 2023; 15: e45473. [DOI] [PMC free article] [PubMed] [Google Scholar]
26.Shaier S, Sanz-Guerrero M, von der Wense K. Asking again and again: exploring LLM robustness to repeated questions. arXiv:2412.07923 (2024).
27.Funk PF, Hoch CC, Knoedler S, et al. ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Investig Health Psychol Educ 2024; 14: 657–668. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Yun HS, Arjmand M, Sherlock P, et al. Keeping users engaged during repeated administration of the same questionnaire: using large language models to reliably diversify questions. arXiv:2311.12707 (2023).
29.Sanchez-Ramos L, Lin L, Romero R. Beware of references when using ChatGPT as a source of information to write scientific articles. Am J Obstet Gynecol 2023; 229: 356–357. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Gravel J, D’Amours-Gravel M, Osmanlliu E. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions. Mayo Clin Proc: Digit Health 2023; 1: 226–234. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Sallam M. The utility of ChatGPT as an example of large language models in healthcare education, research and practice: systematic review on the future perspectives and potential limitations. Healthcare 2023; 11: 887.36981544 [Google Scholar]
32.Choudhary OP, Saini J, Challana A. ChatGPT for veterinary anatomy education: an overview of the prospects and drawbacks. Int J Morphol 2023; 41: 1198–1202. [Google Scholar]
33.Deng C, Duan Y, Jin X, et al. Deconstructing the ethics of large language models from long-standing issues to new-emerging dilemmas. arXiv:2406.05392 (2024).

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

sj-docx-1-dhj-10.1177_20552076251342067.docx^{(49.4KB, docx)}

sj-docx-2-dhj-10.1177_20552076251342067.docx^{(40.8KB, docx)}

sj-docx-3-dhj-10.1177_20552076251342067.docx^{(50.1KB, docx)}

sj-docx-4-dhj-10.1177_20552076251342067.docx^{(71.6KB, docx)}

sj-docx-5-dhj-10.1177_20552076251342067.docx^{(44.4KB, docx)}

[bibr1-20552076251342067] 1. https://datareportal.com/reports/digital-2023-taiwan .

[bibr2-20552076251342067] 2.Chi C, Lee JL, Schoon R. Assessing health information technology in a national health care system—an example from Taiwan. Adv Health Care Manag 2012; 12: 75–109. [DOI] [PubMed] [Google Scholar]

[bibr3-20552076251342067] 3.Liu P, Yeh LL, Wang JY, et al. Relationship between levels of digital health literacy based on the Taiwan digital health literacy assessment and accurate assessment of online health information: cross-sectional questionnaire study. J Med Internet Res 2020; 22: e19767. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr4-20552076251342067] 4.Hsu YC, Chen TJ, Chu FY, et al. Official websites of local health centers in Taiwan: a nationwide study. Int J Environ Res Public Health 2019; 16: 399. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr5-20552076251342067] 5.Chiang HY, Liang LY, Lin CC, et al. Electronic medical record-based deep data cleaning and phenotyping improve the diagnostic validity and mortality assessment of infective endocarditis: medical big data initiative of CMUH. Biomedicine (Taipei) 2021; 11: 59–67. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr6-20552076251342067] 6.Hsu LL. An exploratory study of Taiwanese consumers’ experiences of using health-related websites. J Nurs Res 2005; 13: 129–140. [DOI] [PubMed] [Google Scholar]

[bibr7-20552076251342067] 7.Desai D. Website personalization: strategy for user experience design & development. Turk J Comput Math Educ (TURCOMAT) 2021; 12: 3516–3523. [Google Scholar]

[bibr8-20552076251342067] 8.Li C, Wang J, Zhang Y, et al. The good, the bad, and why: unveiling emotions in generative AI. arXiv preprint arXiv:2312.11111 (2023).

[bibr9-20552076251342067] 9.Epstein Z, Hertzmann A, Investigators of Human Creativity , et al. Art and the science of generative AI. Science 2023; 380: 1110–1111. [DOI] [PubMed] [Google Scholar]

[bibr10-20552076251342067] 10.Jo A. The promise and peril of generative AI. Nature 2023; 614: 214–216.36747115 [Google Scholar]

[bibr11-20552076251342067] 11.Miao J, Thongprayoon C, Garcia Valencia OA, et al. Performance of ChatGPT on nephrology test questions. Clin J Am Soc Nephrol 2024; 19: 35–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr12-20552076251342067] 12.Unger Z, Soffer S, Efros O, et al. Clinical applications and limitations of large language models in nephrology: a systematic review. medRxiv. 2024; 2024.2010.2030.24316199. 10.1101/2024.10.30.24316199. [DOI]

[bibr13-20552076251342067] 13.Miao J, Thongprayoon C, Suppadungsuk S, et al. Integrating retrieval-augmented generation with large language models in nephrology: advancing practical applications. Medicina (Kaunas) 2024; 60: 445. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr14-20552076251342067] 14.Pham JH, Thongprayoon C, Miao J, et al. Large language model triaging of simulated nephrology patient inbox messages. Front Artif Intell 2024; 7: 1452469. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr15-20552076251342067] 15.Wu S, Koo M, Blum L, et al. A comparative study of open-source large language models, GPT-4 and Claude 2: multiple-choice test taking in nephrology. arXiv:2308.04709 (2023).

[bibr16-20552076251342067] 16.Qaiser S, Ali R. Text mining: use of TF-IDF to examine the relevance of words to documents. Int J Comput Appl 2018; 181: 25–29. [Google Scholar]

[bibr17-20552076251342067] 17.Lin CY. Rouge: a package for automatic evaluation of summaries. 2004; pp. 74–81.

[bibr18-20552076251342067] 18.Rogasch JMM, Metzger G, Preisler M, et al. ChatGPT: can you prepare my patients for [18F]FDG PET/CT and explain my reports? J Nucl Med 2023; 64: 1876–1879. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr19-20552076251342067] 19.Baxter SL, Longhurst CA, Millen M, et al. Generative artificial intelligence responses to patient messages in the electronic health record: early lessons learned. JAMIA Open 2024; 7: ooae028. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr20-20552076251342067] 20.Sorin V, Brin D, Barash Y, et al. Large language models (LLMs) and empathy—a systematic review. medRxiv, 2023.2008. 2007.23293769 (2023).

[bibr21-20552076251342067] 21.Patel S, Pelletier-Bui A, Smith S, et al. Curricula for empathy and compassion training in medical education: a systematic review. PLoS One 2019; 14: e0221412. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr22-20552076251342067] 22.Elyoseph Z, Hadar-Shoval D, Asraf K, et al. ChatGPT outperforms humans in emotional awareness evaluations. Front Psychol 2023; 14: 1199058. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr23-20552076251342067] 23.Pichai S. An important next step on our AI journey. https://blog.google/technology/ai/bard-google-ai-search-updates/.

[bibr24-20552076251342067] 24.Introducing ChatGPT. https://openai.com/blog/chatgpt.

[bibr25-20552076251342067] 25.Salazar GZ, Zúñiga D, Vindel CL, et al. Efficacy of AI chats to determine an emergency: a comparison between OpenAI’s ChatGPT, Google Bard, and Microsoft Bing AI chat. Cureus 2023; 15: e45473. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr26-20552076251342067] 26.Shaier S, Sanz-Guerrero M, von der Wense K. Asking again and again: exploring LLM robustness to repeated questions. arXiv:2412.07923 (2024).

[bibr27-20552076251342067] 27.Funk PF, Hoch CC, Knoedler S, et al. ChatGPT’s response consistency: a study on repeated queries of medical examination questions. Eur J Investig Health Psychol Educ 2024; 14: 657–668. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr28-20552076251342067] 28.Yun HS, Arjmand M, Sherlock P, et al. Keeping users engaged during repeated administration of the same questionnaire: using large language models to reliably diversify questions. arXiv:2311.12707 (2023).

[bibr29-20552076251342067] 29.Sanchez-Ramos L, Lin L, Romero R. Beware of references when using ChatGPT as a source of information to write scientific articles. Am J Obstet Gynecol 2023; 229: 356–357. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr30-20552076251342067] 30.Gravel J, D’Amours-Gravel M, Osmanlliu E. Learning to fake it: limited responses and fabricated references provided by ChatGPT for medical questions. Mayo Clin Proc: Digit Health 2023; 1: 226–234. [DOI] [PMC free article] [PubMed] [Google Scholar]

[bibr31-20552076251342067] 31.Sallam M. The utility of ChatGPT as an example of large language models in healthcare education, research and practice: systematic review on the future perspectives and potential limitations. Healthcare 2023; 11: 887.36981544 [Google Scholar]

[bibr32-20552076251342067] 32.Choudhary OP, Saini J, Challana A. ChatGPT for veterinary anatomy education: an overview of the prospects and drawbacks. Int J Morphol 2023; 41: 1198–1202. [Google Scholar]

[bibr33-20552076251342067] 33.Deng C, Duan Y, Jin X, et al. Deconstructing the ethics of large language models from long-standing issues to new-emerging dilemmas. arXiv:2406.05392 (2024).

PERMALINK

Evaluation of generative AI assistance in clinical nephrology: Assessing GPT-4, GPT-4o, Gemini 1.0 Ultra, and PaLM 2 in patient interaction and renal biopsy interpretation

Shih-Yi Lin

Chang-Cheng Jiang

Kin-Man Law

Pei-Chun Yeh

Min-Kuang Tsai

Chu-Lin Chou

I-Kuan Wang

I-Wen Ting

Yu-Wei Chen

Che-Yi Chou

Ming-Han Hsieh

Heng-Chih Pan

Sung-Lin Hsieh

Chien-Hua Chiu

Pei-Wen Lee

Yu-Cyuan Hong

Ying-Yu Hsu

Huey-Liang Kuo

Shu-Woei Ju

Chia-Hung Kao

Abstract

Importance

Objective

Design

Setting

Intervention(s) (for clinical trials) or Exposure(s) (for observational studies)

Main Outcome(s) and Measure(s)

Results

Conclusions and Relevance

Key points

Introduction

Methods

Table 1.

Data collection

Evaluation process

Blinding and anonymity

Data analysis

Ethical considerations

Results

Table 2.

Table 3.

Table 4.

Table 5.

Table 6.

Table 7.

Table 8.

Discussion

Conclusion

Supplemental Material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases