Table 2.
Mean readability scores for ChatGPT and GPT-3 derived responses (SD=SD deviation) and p value for paired t-test comparing mean difference. Statistical significance equates to p value <0.05
| Metric | Mean ChatGPT response (SD) | Mean GPT-3 response (SD) | P value (significance) |
| Flesch-Kincaid Grade Level | 8.77 (0.918) | 8.47 (0.982) | 0.4023 (NS) |
| Flesch Readability Ease | 58.2 (4.00) | 59.3 (6.98) | 0.4861 (NS) |
| SMOG Index | 11.6 (0.755) | 11.4 (1.01) | 0.5870 (NS) |
| General Public Reach (%) | 80.3 (4.20) | 81.2 (5.78) | 0.6218 (NS) |
ChatGPT had a mean accuracy of 8.7/10 (SD 0.60) according to independent ratings from three senior orthopaedic clinicians. This compared with a mean accuracy of 7.3/10 (SD 1.41) for GPT-3 generated clinical letters. This difference was statistically significant (p=0.024) (figure 1A).
GPT, Generated Pre-trained Transformer; NS, not significant; SMOG, Simple Measure of Gobbledygook.