Table 4. Performance of independent and AI-assisted clinicians on the internal, external, and prospective test sets.
Data sets | Clinician | Accuracy | Sensitivity | Specificity | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Independent | AI-assisted | P value | Independent | AI-assisted | P value | Independent | AI-assisted | P value | ||||
Internal test set | Rad1 | 72.2 (60.4–82.1) | 81.9 (71.1–90.0) | 0.0003 | 66.7 (51.0–80.0) | 77.8 (62.9–88.8) | 0.0007 | 81.5 (62.0–93.7) | 88.9 (70.8–97.6) | 0.007 | ||
Rad2 | 75.0 (63.4–84.5) | 86.1 (75.9–93.1) | 0.0002 | 68.9 (53.4–81.8) | 82.2 (68.0–92.0) | 0.0006 | 85.2 (66.3–95.8) | 92.6 (75.7–99.1) | 0.042 | |||
Rad3 | 83.3 (70.0–90.1) | 88.9 (79.3–95.1) | 0.008 | 80.0 (65.4–90.4) | 86.7 (73.2–95.0) | 0.032 | 85.0 (62.1–96.8) | 92.6 (75.7–99.1) | 0.008 | |||
Rad4 | 88.9 (79.3–95.1) | 94.4 (86.4–98.5) | 0.033 | 86.7 (73.2–94.9) | 93.3 (81.7–98.6) | 0.047 | 92.6 (75.7–99.1) | 96.3 (81.0–99.9) | 0.17 | |||
External test set | Rad1 | 66.7 (52.5–78.9) | 74.1 (60.3–85.0) | 0.0004 | 68.0 (46.5–85.1) | 69.0 (49.2–84.7) | 0.008 | 65.5 (45.7–82.1) | 80.0 (59.3–93.2) | 0.0006 | ||
Rad2 | 70.3 (56.4–82.0) | 77.8 (64.4–88.0) | 0.0002 | 80.0 (59.3–93.2) | 84.0 (63.9–95.5) | 0.01 | 62.1 (42.3–79.3) | 72.4 (52.7–87.3) | 0.009 | |||
Rad3 | 77.8 (64.4–88.0) | 85.2 (72.9–93.4) | 0.008 | 84.0 (63.9–95.5) | 92.0 (74.0–99.0) | 0.023 | 72.4 (52.8–87.3) | 79.3 (60.3–92.0) | 0.027 | |||
Rad4 | 79.6 (66.5–89.4) | 90.7 (79.7–96.9) | 0.039 | 84.0 (63.9–95.5) | 96.0 (79.6–99.9) | 0.28 | 75.9 (56.5–89.7) | 86.2 (68.3–96.1) | 0.041 | |||
Prospective test set | Rad2 | 67.8 (56.9–77.4) | 79.3 (69.3–87.3) | 0.016 | 75.5 (61.7–86.2) | 81.1 (68.0–90.6) | 0.022 | 55.9 (37.9–72.8) | 76.5 (58.8–89.3) | 0.031 | ||
Rad3 | 75.9 (65.5–84.4) | 82.8 (73.2–90.0) | 0.007 | 84.9 (72.4–93.3) | 84.9 (72.4–93.3) | 0.040 | 61.8 (43.6–77.8) | 79.4 (62.1–91.3) | 0.19 | |||
Rheu1 | 76.2 (68.0–86.3) | 83.9 (74.5–91.0) | 0.006 | 77.2 (65.9–89.2) | 84.9 (72.4–93.3) | 0.006 | 76.5 (58.8–89.3) | 83.4 (65.5–93.2) | 0.005 | |||
Rheu2 | 86.2 (77.1–92.7) | 88.5 (79.9–96.9) | 0.044 | 88.7 (77.0–95.7) | 90.6 (79.3–96.9) | 0.53 | 82.4 (65.5–93.2) | 85.3 (68.9–95.0) | 0.36 |
Accuracy, sensitivity, and specificity are expressed as percentages. Data in brackets are 95% confidence intervals. P<0.05 was considered significant. Rad1 and Rad2 are junior radiologists. Rad3 and Rad4 are senior radiologists. Rheu1 and Rheu2 are junior and senior rheumatologists, respectively. AI, artificial intelligence.