Table 3.
Readability and interrater reliability scores.
Prompt 1 | Prompt 2 | Prompt 3 | Prompt 4 | |
---|---|---|---|---|
Average Flesch–Kincaid score | 13.2 ± 2.2 | 8.1 ± 1.9 | 15.4 ± 2.8 | 17.3 ± 2.3 |
Clinical accuracya | 0.141 (0.002 to 0.279) | 0.185 (0.046 to 0.323) | 0.141 (0.002 to 0.279) | 0.176 (0.038 to 0.315) |
Relevancea | — | −0.020 (−0.159 to 0.118) | N/Ab | N/Ab |
Percent agreement (CA) | 97% | 92% | 97% | 85% |
Percent agreement (relevance) | — | 98% | 100% | 100% |
Evaluator 1 grading (CA) | 100% | 95% | 95% | 100% |
Evaluator 2 grading (CA) | 95% | 95% | 100% | 100% |
Evaluator 3 grading (CA) | 95% | 90% | 95% | 85% |
Evaluator 4 grading (CA) | 100% | 85% | 95% | 65% |
Evaluator 5 grading (CA) | 95% | 95% | 100% | 75% |
Fleiss kappa (95% confidence interval).
All ratings were the same, so no Fleiss kappa could be calculated. CA = clinical accuracy.