Table 2.
Comparison of diagnostic performance of the combined model, CNN model, FIB-4, APRI, and radiologists for cirrhosis on the internal and external testing datasets
| Combined model | CNN model | FIB-4a | APRIa | Radiologist 1 | Radiologist 2 | |
|---|---|---|---|---|---|---|
| Internal testing dataset 1 | ||||||
| Cut-offb | > 0.53405 | > 0.54343 | > 2.26352 | > 0.54094 | / | / |
| AUC | 0.89 (0.81–0.95) | 0.87 (0.78–0.93) | 0.74 (0.64–0.82) | 0.71 (0.61–0.80) | 0.74 (0.64–0.83) | 0.78 (0.69–0.86) |
| p-valuec | / | 0.36 | 0.008 | 0.003 | 0.006 | 0.04 |
| Sensitivity | 90% (80%–96%) | 87% (76%–94%) | 69% (56%–79%) | 72% (59%–82%) | 82% (71%–90%) | 72% (59%–82%) |
| p-valuec | / | 0.50 | 0.001 | 0.02 | 0.23 | 0.004 |
| Specificity | 81% (62%–94%) | 74% (54%–89%) | 67% (46%–83%) | 63% (42%–81%) | 67% (46%–83%) | 85% (66%–96%) |
| p-valuec | / | 0.50 | 0.29 | 0.18 | 0.34 | 1.00 |
| PPV | 92% (84%–96%) | 89% (81%–94%) | 84% (75%–90%) | 83% (74%–89%) | 86% (78%–91%) | 92% (83%–97%) |
| p-valuec | / | 0.12 | 0.04 | 0.03 | 0.15 | 1.00 |
| NPV | 76% (60%–87%) | 69% (54%–81%) | 46% (35%–57%) | 47% (36%–59%) | 60% (46%–73%) | 55% (45%–65%) |
| p-valuec | / | 0.07 | < 0.001 | 0.002 | 0.06 | 0.006 |
| Accuracy | 87% (79%–93%) | 83% (74%–90%) | 68% (58%–77%) | 69% (59%–78%) | 78% (68%–86%) | 76% (66%–84%) |
| p-valuec | / | 0.13 | < 0.001 | 0.003 | 0.08 | 0.03 |
| True positive | 60 | 58 | 46 | 48 | 55 | 48 |
| False positive | 5 | 7 | 9 | 10 | 9 | 4 |
| False negative | 7 | 9 | 21 | 19 | 12 | 19 |
| True negative | 22 | 20 | 18 | 17 | 18 | 23 |
| Internal testing dataset 2 | ||||||
| Cut-offb | > 0.53405 | > 0.54343 | > 2.26352 | > 0.54094 | / | / |
| AUC | 0.88 (0.83–0.91) | 0.85 (0.80–0.89) | 0.71 (0.65–0.76) | 0.67 (0.61–0.73) | 0.74 (0.68–0.79) | 0.81 (0.76–0.85) |
| p-valuec | / | 0.03 | < 0.001 | < 0.001 | < 0.001 | 0.01 |
| Sensitivity | 87% (81%–91%) | 81% (75%–87%) | 64% (56%–71%) | 59% (52%–67%) | 74% (67%–81%) | 71% (64%–78%) |
| p-valuec | / | 0.01 | < 0.001 | < 0.001 | 0.002 | < 0.001 |
| Specificity | 71% (60%–80%) | 79% (69%–87%) | 64% (53%–74%) | 62% (51%–72%) | 73% (63%–82%) | 91% (82%–96%) |
| p-valuec | / | 0.09 | 0.36 | 0.18 | 0.86 | 0.002 |
| PPV | 86% (82%–90%) | 89% (84%–92%) | 79% (73%–83%) | 76% (71%–81%) | 85% (80%–89%) | 94% (89%–97%) |
| p-valuec | / | 0.13 | 0.01 | 0.001 | 0.78 | 0.008 |
| NPV | 72% (63%–79%) | 67% (59%–73%) | 46% (40%–52%) | 42% (36%–48%) | 58% (51%–64%) | 60% (54%–66%) |
| p-valuec | / | 0.09 | < 0.001 | < 0.001 | 0.006 | 0.02 |
| Accuracy | 82% (76%–86%) | 80% (75%–85%) | 64% (58%–70%) | 60% (54%–66%) | 74% (68%–79%) | 77% (72%–82%) |
| p-valuec | / | 0.70 | < 0.001 | < 0.001 | 0.03 | 0.27 |
| True positive | 156 | 146 | 115 | 107 | 134 | 128 |
| False positive | 25 | 18 | 31 | 33 | 23 | 8 |
| False negative | 24 | 34 | 65 | 73 | 46 | 52 |
| True negative | 61 | 68 | 55 | 53 | 63 | 78 |
| External testing dataset | ||||||
| Cut-offb | > 0.58556 | > 0.57295 | > 3.03248 | > 1.03125 | / | / |
| AUC | 0.86 (0.78–0.91) | 0.81 (0.73–0.88) | 0.69 (0.59–0.77) | 0.67 (0.58–0.76) | 0.73 (0.64–0.81) | 0.71 (0.61–0.79) |
| p-valuec | / | 0.02 | 0.001 | < 0.001 | 0.02 | 0.006 |
| Sensitivity | 84% (73%–91%) | 77% (66%–86%) | 62% (50%–80%) | 65% (53%–76%) | 73% (61%–83%) | 70% (59%–80%) |
| p-valuec | / | 0.13 | 0.003 | 0.007 | 0.10 | 0.05 |
| Specificity | 73% (57%–86%) | 68% (52%–82%) | 66% (49%–80%) | 54% (37%–69%) | 73% (57%–86%) | 71% (54%–84%) |
| p-valuec | / | 0.73 | 0.58 | 0.08 | 1.00 | 1.00 |
| PPV | 85% (77%–90%) | 81% (73%–87%) | 77% (67%–84%) | 72% (64%–79%) | 83% (74%–89%) | 81% (72%–88%) |
| p-valuec | / | 0.30 | 0.08 | 0.005 | 0.67 | 0.48 |
| NPV | 71% (59%–81%) | 62% (51%–72%) | 49% (40%–58%) | 46% (36%–56%) | 60% (50%–69%) | 57% (47%–66%) |
| p-valuec | / | 0.045 | 0.001 | < 0.001 | 0.08 | 0.04 |
| Accuracy | 80% (72%–87%) | 74% (65%–82%) | 63% (54%–72%) | 61% (51%–70%) | 73% (64%–81%) | 70% (61%–79%) |
| p-valuec | / | 0.12 | 0.003 | < 0.001 | 0.20 | 0.11 |
| True positive | 62 | 57 | 46 | 48 | 54 | 52 |
| False positive | 11 | 13 | 14 | 19 | 11 | 12 |
| False negative | 12 | 17 | 28 | 26 | 20 | 22 |
| True negative | 30 | 28 | 27 | 22 | 30 | 29 |
Data in parentheses are 95% confidence interval
APRI aminotransferase-to-platelet ratio index, AUC area under the receiver operating characteristic curve, CNN convolutional neural network, FIB-4 fibrosis-4 index, NPV negative predictive value, PPV positive predictive value
a The calculation formulas were as follows: FIB-4 = (age [year] × AST [U/L]) / (platelet count [109/L] × (ALT [U/L])1/2); APRI = (AST (/upper limit of normal) / platelet count [109/L]) × 100 [9, 10]
b Cut-off values were selected based on the receiver operating characteristic and Youden index in the training dataset. Cut-offs of combined and CNN models represent the model outputs for the combined model and the CNN model
c p-values were calculated in comparison to the combined model. AUCs were compared using Delong test. PPVs and NPVs were compared using the weighted generalized score test proposed by Kosinski, while sensitivities, specificities, and accuracies were compared using McNemar’s test