. 2025 Apr 21;14(6):1281–1295. doi: 10.1007/s40123-025-01142-x

Table 2.

Comparing prompt A and prompt B: evaluating the quality and understandability of large language models in generated patient education materials

LLMs	Discern points	Median (range)	N	Mean ranks	Sum ranks	p value*
Quality (DISCERN)
Gemini	Prompt A	43 (43–43)	20	16.52	330	0.021
Gemini	Prompt B	49 (37–57)	20	24.51	490	0.021
ChatGPT3.5	Prompt A	52.72 (51–53)	20	24.18	483.52	0.013
ChatGPT3.5	Prompt B	51 (49–53)	20	16.83	336.54	0.013
ChatGP-4o (01 Preview)	Prompt A	52.81 (49–53)	20	25.38	507.53	0.001
ChatGP-4o (01 Preview)	Prompt B	50 (41–53)	20	15.63	312.52	0.001
Understandability (PEMAT %)
Gemini	Prompt A	75 (75–83.3)%	20	20.53	410	1.00
Gemini	Prompt B	75 (75–83.3)%	20	20.52	410	1.00
ChatGPT3.5	Prompt A	75 (75–83.3)%	20	19	380	0.389
ChatGPT3.5	Prompt B	83.31 (75–83.3)%	20	22	440	0.389
ChatGP-4o (01 Preview)	Prompt A	75 (75–83.3)%	20	20	400	0.771
ChatGP-4o (01 Preview)	Prompt B	83.32 (75–83.3)%	20	21	420	0.771

LLM large language model, PEMAT Patient Education Materials Assessment Tool

*Mann–Whitney U test conducted between prompt A and B (significance at p < 0.05)