. 2024 Mar 14;14(3):e076484. doi: 10.1136/bmjopen-2023-076484

Table 2.

Mean readability scores for ChatGPT and GPT-3 derived responses (SD=SD deviation) and p value for paired t-test comparing mean difference. Statistical significance equates to p value <0.05

Metric	Mean ChatGPT response (SD)	Mean GPT-3 response (SD)	P value (significance)
Flesch-Kincaid Grade Level	8.77 (0.918)	8.47 (0.982)	0.4023 (NS)
Flesch Readability Ease	58.2 (4.00)	59.3 (6.98)	0.4861 (NS)
SMOG Index	11.6 (0.755)	11.4 (1.01)	0.5870 (NS)
General Public Reach (%)	80.3 (4.20)	81.2 (5.78)	0.6218 (NS)

ChatGPT had a mean accuracy of 8.7/10 (SD 0.60) according to independent ratings from three senior orthopaedic clinicians. This compared with a mean accuracy of 7.3/10 (SD 1.41) for GPT-3 generated clinical letters. This difference was statistically significant (p=0.024) (figure 1A).

GPT, Generated Pre-trained Transformer; NS, not significant; SMOG, Simple Measure of Gobbledygook.