Skip to main content
. 2024 Dec 12;15:298. doi: 10.1186/s13244-024-01872-9

Table 2.

Comparison of diagnostic performance of the combined model, CNN model, FIB-4, APRI, and radiologists for cirrhosis on the internal and external testing datasets

Combined model CNN model FIB-4a APRIa Radiologist 1 Radiologist 2
Internal testing dataset 1
Cut-offb > 0.53405 > 0.54343 > 2.26352 > 0.54094 / /
AUC 0.89 (0.81–0.95) 0.87 (0.78–0.93) 0.74 (0.64–0.82) 0.71 (0.61–0.80) 0.74 (0.64–0.83) 0.78 (0.69–0.86)
p-valuec / 0.36 0.008 0.003 0.006 0.04
Sensitivity 90% (80%–96%) 87% (76%–94%) 69% (56%–79%) 72% (59%–82%) 82% (71%–90%) 72% (59%–82%)
p-valuec / 0.50 0.001 0.02 0.23 0.004
Specificity 81% (62%–94%) 74% (54%–89%) 67% (46%–83%) 63% (42%–81%) 67% (46%–83%) 85% (66%–96%)
p-valuec / 0.50 0.29 0.18 0.34 1.00
PPV 92% (84%–96%) 89% (81%–94%) 84% (75%–90%) 83% (74%–89%) 86% (78%–91%) 92% (83%–97%)
p-valuec / 0.12 0.04 0.03 0.15 1.00
NPV 76% (60%–87%) 69% (54%–81%) 46% (35%–57%) 47% (36%–59%) 60% (46%–73%) 55% (45%–65%)
p-valuec / 0.07 < 0.001 0.002 0.06 0.006
Accuracy 87% (79%–93%) 83% (74%–90%) 68% (58%–77%) 69% (59%–78%) 78% (68%–86%) 76% (66%–84%)
p-valuec / 0.13 < 0.001 0.003 0.08 0.03
True positive 60 58 46 48 55 48
False positive 5 7 9 10 9 4
False negative 7 9 21 19 12 19
True negative 22 20 18 17 18 23
Internal testing dataset 2
Cut-offb > 0.53405 > 0.54343 > 2.26352 > 0.54094 / /
AUC 0.88 (0.83–0.91) 0.85 (0.80–0.89) 0.71 (0.65–0.76) 0.67 (0.61–0.73) 0.74 (0.68–0.79) 0.81 (0.76–0.85)
p-valuec / 0.03 < 0.001 < 0.001 < 0.001 0.01
Sensitivity 87% (81%–91%) 81% (75%–87%) 64% (56%–71%) 59% (52%–67%) 74% (67%–81%) 71% (64%–78%)
p-valuec / 0.01 < 0.001 < 0.001 0.002 < 0.001
Specificity 71% (60%–80%) 79% (69%–87%) 64% (53%–74%) 62% (51%–72%) 73% (63%–82%) 91% (82%–96%)
p-valuec / 0.09 0.36 0.18 0.86 0.002
PPV 86% (82%–90%) 89% (84%–92%) 79% (73%–83%) 76% (71%–81%) 85% (80%–89%) 94% (89%–97%)
p-valuec / 0.13 0.01 0.001 0.78 0.008
NPV 72% (63%–79%) 67% (59%–73%) 46% (40%–52%) 42% (36%–48%) 58% (51%–64%) 60% (54%–66%)
p-valuec / 0.09 < 0.001 < 0.001 0.006 0.02
Accuracy 82% (76%–86%) 80% (75%–85%) 64% (58%–70%) 60% (54%–66%) 74% (68%–79%) 77% (72%–82%)
p-valuec / 0.70 < 0.001 < 0.001 0.03 0.27
True positive 156 146 115 107 134 128
False positive 25 18 31 33 23 8
False negative 24 34 65 73 46 52
True negative 61 68 55 53 63 78
External testing dataset
Cut-offb > 0.58556 > 0.57295 > 3.03248 > 1.03125 / /
AUC 0.86 (0.78–0.91) 0.81 (0.73–0.88) 0.69 (0.59–0.77) 0.67 (0.58–0.76) 0.73 (0.64–0.81) 0.71 (0.61–0.79)
p-valuec / 0.02 0.001 < 0.001 0.02 0.006
Sensitivity 84% (73%–91%) 77% (66%–86%) 62% (50%–80%) 65% (53%–76%) 73% (61%–83%) 70% (59%–80%)
p-valuec / 0.13 0.003 0.007 0.10 0.05
Specificity 73% (57%–86%) 68% (52%–82%) 66% (49%–80%) 54% (37%–69%) 73% (57%–86%) 71% (54%–84%)
p-valuec / 0.73 0.58 0.08 1.00 1.00
PPV 85% (77%–90%) 81% (73%–87%) 77% (67%–84%) 72% (64%–79%) 83% (74%–89%) 81% (72%–88%)
p-valuec / 0.30 0.08 0.005 0.67 0.48
NPV 71% (59%–81%) 62% (51%–72%) 49% (40%–58%) 46% (36%–56%) 60% (50%–69%) 57% (47%–66%)
p-valuec / 0.045 0.001 < 0.001 0.08 0.04
Accuracy 80% (72%–87%) 74% (65%–82%) 63% (54%–72%) 61% (51%–70%) 73% (64%–81%) 70% (61%–79%)
p-valuec / 0.12 0.003 < 0.001 0.20 0.11
True positive 62 57 46 48 54 52
False positive 11 13 14 19 11 12
False negative 12 17 28 26 20 22
True negative 30 28 27 22 30 29

Data in parentheses are 95% confidence interval

APRI aminotransferase-to-platelet ratio index, AUC area under the receiver operating characteristic curve, CNN convolutional neural network, FIB-4 fibrosis-4 index, NPV negative predictive value, PPV positive predictive value

a The calculation formulas were as follows: FIB-4 = (age [year] × AST [U/L]) / (platelet count [109/L] × (ALT [U/L])1/2); APRI = (AST (/upper limit of normal) / platelet count [109/L]) × 100 [9, 10]

b Cut-off values were selected based on the receiver operating characteristic and Youden index in the training dataset. Cut-offs of combined and CNN models represent the model outputs for the combined model and the CNN model

c p-values were calculated in comparison to the combined model. AUCs were compared using Delong test. PPVs and NPVs were compared using the weighted generalized score test proposed by Kosinski, while sensitivities, specificities, and accuracies were compared using McNemar’s test