Table 2.
Comparing prompt A and prompt B: evaluating the quality and understandability of large language models in generated patient education materials
| LLMs | Discern points | Median (range) | N | Mean ranks | Sum ranks | p value* |
|---|---|---|---|---|---|---|
| Quality (DISCERN) | ||||||
| Gemini | Prompt A | 43 (43–43) | 20 | 16.52 | 330 | 0.021 |
| Prompt B | 49 (37–57) | 20 | 24.51 | 490 | ||
| ChatGPT3.5 | Prompt A | 52.72 (51–53) | 20 | 24.18 | 483.52 | 0.013 |
| Prompt B | 51 (49–53) | 20 | 16.83 | 336.54 | ||
| ChatGP-4o (01 Preview) | Prompt A | 52.81 (49–53) | 20 | 25.38 | 507.53 | 0.001 |
| Prompt B | 50 (41–53) | 20 | 15.63 | 312.52 | ||
| Understandability (PEMAT %) | ||||||
| Gemini | Prompt A | 75 (75–83.3)% | 20 | 20.53 | 410 | 1.00 |
| Prompt B | 75 (75–83.3)% | 20 | 20.52 | 410 | ||
| ChatGPT3.5 | Prompt A | 75 (75–83.3)% | 20 | 19 | 380 | 0.389 |
| Prompt B | 83.31 (75–83.3)% | 20 | 22 | 440 | ||
| ChatGP-4o (01 Preview) | Prompt A | 75 (75–83.3)% | 20 | 20 | 400 | 0.771 |
| Prompt B | 83.32 (75–83.3)% | 20 | 21 | 420 | ||
LLM large language model, PEMAT Patient Education Materials Assessment Tool
*Mann–Whitney U test conducted between prompt A and B (significance at p < 0.05)