Table 3.
Performance of LLMs based on readability metrics on prompt A vs prompt B
| Readability metrics | ChatGPT-3.5 | ChatGPT-4o (01 preview) | Google Gemini | ||||||
|---|---|---|---|---|---|---|---|---|---|
| Prompt A | Prompt B | p value (A vs B) | Prompt A | Prompt B | p value (A vs B) | Prompt A | Prompt B | p value (A vs B) | |
| Syllables | 789.7 (137.5) | 562.4 (76.9) | < 0.001 | 861.9 (145.8) | 534.4 (80) | < 0.001 | 558.9 (109.2) | 407.5 (107.8) | < 0.001 |
| Words | 500.2 (96.9) | 381.6 (51.2) | < 0.001 | 564.5 (91.8) | 370.5 (51.6) | < 0.001 | 326.6 (60.2) | 272.2 (60.7) | < 0.001 |
| 3+ syllable words | 37.3 (6.3) | 18.4 (4.1) | < 0.001 | 36.7 (10.8) | 15.35 (5) | < 0.001 | 35.2 (10.9) | 17.6 (7.3) | < 0.001 |
| Sentences | 27.5 (5.6) | 27.3 (3.7) | < 0.001 | 31.7 (4.9) | 24.5 (4.3) | < 0.001 | 19.7 (4.2) | 17.6 (4.9) | < 0.001 |
| SMOG Readability Score | 8.03 (0.6) | 5.9 (0.4) | < 0.001 | 7.7 (0.7) | 5.8 (0.7) | < 0.001 | 9.07 (0.62) | 6.7 (0.8) | < 0.001 |
| Flesch-Kincaid Grade Level | 7.7 (0.5) | 6.2 (0.3) | < 0.001 | 7.5 (0.58) | 6 (0.6) | < 0.001 | 8.5 (0.7) | 7.0 (0.6) | < 0.001 |
LLM large language model, SMOG Simple Measure of Gobbledygook