Skip to main content
. 2025 Apr 21;14(6):1281–1295. doi: 10.1007/s40123-025-01142-x

Table 3.

Performance of LLMs based on readability metrics on prompt A vs prompt B

Readability metrics ChatGPT-3.5 ChatGPT-4o (01 preview) Google Gemini
Prompt A Prompt B p value (A vs B) Prompt A Prompt B p value (A vs B) Prompt A Prompt B p value (A vs B)
Syllables 789.7 (137.5) 562.4 (76.9) < 0.001 861.9 (145.8) 534.4 (80) < 0.001 558.9 (109.2) 407.5 (107.8) < 0.001
Words 500.2 (96.9) 381.6 (51.2) < 0.001 564.5 (91.8) 370.5 (51.6) < 0.001 326.6 (60.2) 272.2 (60.7) < 0.001
3+ syllable words 37.3 (6.3) 18.4 (4.1) < 0.001 36.7 (10.8) 15.35 (5) < 0.001 35.2 (10.9) 17.6 (7.3) < 0.001
Sentences 27.5 (5.6) 27.3 (3.7) < 0.001 31.7 (4.9) 24.5 (4.3) < 0.001 19.7 (4.2) 17.6 (4.9) < 0.001
SMOG Readability Score 8.03 (0.6) 5.9 (0.4) < 0.001 7.7 (0.7) 5.8 (0.7) < 0.001 9.07 (0.62) 6.7 (0.8) < 0.001
Flesch-Kincaid Grade Level 7.7 (0.5) 6.2 (0.3) < 0.001 7.5 (0.58) 6 (0.6) < 0.001 8.5 (0.7) 7.0 (0.6) < 0.001

LLM large language model, SMOG Simple Measure of Gobbledygook