TABLE 4.
AUROC (95% CI) | p-value | Sensitivity, % (95% CI) | Specificity, % (95% CI) | Accuracy, % (95% CI) | PPV, % (95% CI) | NPV, % (95% CI) | |
Internal validation | |||||||
3D multi-task DL | 0.892 (0.860–0.924) | \ | 79.6 (71.8–92.2) | 86.7 (72.6–94.1) | 81.9 (77.1–87.4) | 93.3 (88.6–96.7) | 64.6 (57.5–80.2) |
3D single-task DL | 0.873 (0.839–0.906) | 0.39 | 79.6 (74.0–88.1) | 83.7 (73.3–89.6) | 80.8 (76.9–84.8) | 91.8 (88.2–94.7) | 63.4. (57.7–72.6) |
2D multi-task DL | 0.861 (0.823–0.900) | 0.22 | 81.4 (74.9–88.4) | 83.7 (74.8–90.4) | 81.9 (78.0–85.9) | 92.1 (88.9–94.9) | 65.5 (59.0–73.9) |
External testing 1 | |||||||
3D multi-task DL | 0.885 (0.855–0.915) | \ | 83.8 (74.4–93.8) | 81.5 (69.3–90.2) | 83.1 (79.1–86.6) | 88.2 (83.2–92.9) | 75.5 (67.3–87.5) |
3D single-task DL | 0.851 (0.818–0.884) | 0.13 | 81.5 (67.7–87.4) | 78.1 (70.7–89.8) | 80.0 (75.1–83.3) | 86.0. (82.6–91.6) | 71.6 (62.1–78.0) |
2D multi-task DL | 0.818 (0.781–0.855) | 0.006 | 83.5 (73.8–92.9) | 67.3 (54.6–76.6) | 77.3 (72.8–80.9) | 80.7 (76.8–84.9) | 71.0 (62.6–83.1) |
External testing 2 | |||||||
3D multi-task DL | 0.855 (0.811–0.899) | \ | 83.7 (68.9–91.9) | 76.5 (66.7–87.9) | 79.8 (74.9–84.6) | 78.3 (72.5–86.5) | 81.8 (72.9–89.5) |
3D single-task DL | 0.806 (0.755–0.858) | 0.16 | 65.2 (45.9–83.0) | 86.4 (67.4–99.2) | 74.9 (70.4–79.4) | 82.8 (71.2–98.4) | 70.8 (63.7–80.3) |
2D multi-task DL | 0.799 (0.747–0.852) | 0.12 | 69.6 (54.1–87.4) | 81.1 (60.6–92.4) | 74.9 (70.0–79.4) | 78.5 (68.1–89.0) | 72.3 (65.6–83.3) |
External testing 3 | |||||||
3D multi-task DL | 0.886 (0.856–0.916) | \ | 78.3 (72.8–84.6) | 88.1 (79.7–94.1) | 80.6 (76.3–84.7) | 95.6 (93.0–97.8) | 54.7 (49.1–61.9) |
3D single-task DL | 0.831 (0.795–0.868) | 0.03 | 65.8 (60.4–72.4) | 95.3 (89.6–99.1) | 72.7 (68.5–77.0) | 97.9 (95.6–99.6) | 45.7 (42.1–50.3) |
2D multi-task DL | 0.850 (0.812–0.887) | 0.15 | 72.4 (64.1–81.2) | 88.7 (78.3–95.3) | 76.2 (70.5–81.4) | 95.4 (92.1–97.9) | 49.2 (43.4–56.8) |
External testing 4 | |||||||
3D multi-task DL | 0.866 (0.843–0.888) | \ | 68.4 (62.9–77.0) | 95.0 (86.2–97.7) | 79.5 (77.1–82.0) | 94.7 (88.0–97.6) | 69.1 (66.0–74.0) |
3D single-task DL | 0.849 (0.825–0.873) | 0.34 | 70.6 (62.6–79.8) | 84.7 (74.9–91.2) | 76.5 (73.6–79.2) | 85.9 (80.6–90.7) | 68.3 (64.1–74.0) |
2D multi-task DL | 0.859 (0.836–0.883) | 0.70 | 72.8 (66.1–86.2) | 85.5 (70.9–90.7) | 78.2 (75.5–81.1) | 86.8 (79.7–90.7) | 70.2 (66.3–79.7) |
External testing 5 | |||||||
3D multi-task DL | 0.875 (0.854–0.896) | \ | 84.1 (70.7–90.2) | 76.5 (69.3–88.3) | 79.8 (77.0–82.2) | 73.2 (68.7–82.4) | 86.2 (79.6–90.4) |
3D single-task DL | 0.886 (0.866–0.907) | 0.18 | 83.2 (78.0–88.9) | 83.8 (77.5–88.3) | 83.4 (80.9–85.6) | 79.6 (74.6–84.1) | 86.7 (83.7–90.3) |
2D multi-task DL | 0.862 (0.840–0.884) | 0.39 | 85.2 (69.0–90.4) | 72.3 (65.5–87.2) | 78.1 (75.4–80.7) | 70.3 (66.3–80.9) | 86.3 (78.4–90.3) |
p-values in bold were AUROC values with significant difference.