Skip to main content
. 2024 Mar 14;14(3):e076484. doi: 10.1136/bmjopen-2023-076484

Table 2.

Mean readability scores for ChatGPT and GPT-3 derived responses (SD=SD deviation) and p value for paired t-test comparing mean difference. Statistical significance equates to p value <0.05

Metric Mean ChatGPT response (SD) Mean GPT-3 response (SD) P value (significance)
Flesch-Kincaid Grade Level 8.77 (0.918) 8.47 (0.982) 0.4023 (NS)
Flesch Readability Ease 58.2 (4.00) 59.3 (6.98) 0.4861 (NS)
SMOG Index 11.6 (0.755) 11.4 (1.01) 0.5870 (NS)
General Public Reach (%) 80.3 (4.20) 81.2 (5.78) 0.6218 (NS)

ChatGPT had a mean accuracy of 8.7/10 (SD 0.60) according to independent ratings from three senior orthopaedic clinicians. This compared with a mean accuracy of 7.3/10 (SD 1.41) for GPT-3 generated clinical letters. This difference was statistically significant (p=0.024) (figure 1A).

GPT, Generated Pre-trained Transformer; NS, not significant; SMOG, Simple Measure of Gobbledygook.