Table 7.
Testing dataset on Genant fracture grading in the lumbar spine to compare AI performance with human labels in accuracy and sensitivity with bootstrapping method
| Degree of lumbar fractures | Accuracy | Sensitivity | ||||
| Mean, % | 95% CI, % | p value | Mean, % | 95% CI, % | p value | |
| Grade 1 | 84 | 83.55-84.23 | < 0.001 | 78 | 77.39-78.53 | < 0.001 |
| Grade 2 | 95 | 94.74-95.14 | < 0.001 | 99 | 98.41-100.0 | < 0.001 |
| Grade 3 | 94 | 93.50-93.97 | < 0.001 | 97 | 97.12-97.57 | < 0.001 |
A total of 141 fractured lumbar vertebrae were included in the test dataset; the fractured lumbar vertebrae included Grade 1 (n = 50), Grade 2 (n = 54), and Grade 3 (n = 37) fractures.