Skip to main content
. 2025 Apr 21;14(6):1281–1295. doi: 10.1007/s40123-025-01142-x

Table 2.

Comparing prompt A and prompt B: evaluating the quality and understandability of large language models in generated patient education materials

LLMs Discern points Median (range) N Mean ranks Sum ranks p value*
Quality (DISCERN)
 Gemini Prompt A 43 (43–43) 20 16.52 330 0.021
Prompt B 49 (37–57) 20 24.51 490
 ChatGPT3.5 Prompt A 52.72 (51–53) 20 24.18 483.52 0.013
Prompt B 51 (49–53) 20 16.83 336.54
 ChatGP-4o (01 Preview) Prompt A 52.81 (49–53) 20 25.38 507.53 0.001
Prompt B 50 (41–53) 20 15.63 312.52
Understandability (PEMAT %)
 Gemini Prompt A 75 (75–83.3)% 20 20.53 410 1.00
Prompt B 75 (75–83.3)% 20 20.52 410
 ChatGPT3.5 Prompt A 75 (75–83.3)% 20 19 380 0.389
Prompt B 83.31 (75–83.3)% 20 22 440
 ChatGP-4o (01 Preview) Prompt A 75 (75–83.3)% 20 20 400 0.771
Prompt B 83.32 (75–83.3)% 20 21 420

LLM large language model, PEMAT Patient Education Materials Assessment Tool

*Mann–Whitney U test conducted between prompt A and B (significance at p < 0.05)