. 2024 Dec 12;15:298. doi: 10.1186/s13244-024-01872-9

Table 2.

Comparison of diagnostic performance of the combined model, CNN model, FIB-4, APRI, and radiologists for cirrhosis on the internal and external testing datasets

	Combined model	CNN model	FIB-4^a	APRI^a	Radiologist 1	Radiologist 2
Internal testing dataset 1
Cut-off^b	> 0.53405	> 0.54343	> 2.26352	> 0.54094	/	/
AUC	0.89 (0.81–0.95)	0.87 (0.78–0.93)	0.74 (0.64–0.82)	0.71 (0.61–0.80)	0.74 (0.64–0.83)	0.78 (0.69–0.86)
p-value^c	/	0.36	0.008	0.003	0.006	0.04
Sensitivity	90% (80%–96%)	87% (76%–94%)	69% (56%–79%)	72% (59%–82%)	82% (71%–90%)	72% (59%–82%)
p-value^c	/	0.50	0.001	0.02	0.23	0.004
Specificity	81% (62%–94%)	74% (54%–89%)	67% (46%–83%)	63% (42%–81%)	67% (46%–83%)	85% (66%–96%)
p-value^c	/	0.50	0.29	0.18	0.34	1.00
PPV	92% (84%–96%)	89% (81%–94%)	84% (75%–90%)	83% (74%–89%)	86% (78%–91%)	92% (83%–97%)
p-value^c	/	0.12	0.04	0.03	0.15	1.00
NPV	76% (60%–87%)	69% (54%–81%)	46% (35%–57%)	47% (36%–59%)	60% (46%–73%)	55% (45%–65%)
p-value^c	/	0.07	< 0.001	0.002	0.06	0.006
Accuracy	87% (79%–93%)	83% (74%–90%)	68% (58%–77%)	69% (59%–78%)	78% (68%–86%)	76% (66%–84%)
p-value^c	/	0.13	< 0.001	0.003	0.08	0.03
True positive	60	58	46	48	55	48
False positive	5	7	9	10	9	4
False negative	7	9	21	19	12	19
True negative	22	20	18	17	18	23
Internal testing dataset 2
Cut-off^b	> 0.53405	> 0.54343	> 2.26352	> 0.54094	/	/
AUC	0.88 (0.83–0.91)	0.85 (0.80–0.89)	0.71 (0.65–0.76)	0.67 (0.61–0.73)	0.74 (0.68–0.79)	0.81 (0.76–0.85)
p-value^c	/	0.03	< 0.001	< 0.001	< 0.001	0.01
Sensitivity	87% (81%–91%)	81% (75%–87%)	64% (56%–71%)	59% (52%–67%)	74% (67%–81%)	71% (64%–78%)
p-value^c	/	0.01	< 0.001	< 0.001	0.002	< 0.001
Specificity	71% (60%–80%)	79% (69%–87%)	64% (53%–74%)	62% (51%–72%)	73% (63%–82%)	91% (82%–96%)
p-value^c	/	0.09	0.36	0.18	0.86	0.002
PPV	86% (82%–90%)	89% (84%–92%)	79% (73%–83%)	76% (71%–81%)	85% (80%–89%)	94% (89%–97%)
p-value^c	/	0.13	0.01	0.001	0.78	0.008
NPV	72% (63%–79%)	67% (59%–73%)	46% (40%–52%)	42% (36%–48%)	58% (51%–64%)	60% (54%–66%)
p-value^c	/	0.09	< 0.001	< 0.001	0.006	0.02
Accuracy	82% (76%–86%)	80% (75%–85%)	64% (58%–70%)	60% (54%–66%)	74% (68%–79%)	77% (72%–82%)
p-value^c	/	0.70	< 0.001	< 0.001	0.03	0.27
True positive	156	146	115	107	134	128
False positive	25	18	31	33	23	8
False negative	24	34	65	73	46	52
True negative	61	68	55	53	63	78
External testing dataset
Cut-off^b	> 0.58556	> 0.57295	> 3.03248	> 1.03125	/	/
AUC	0.86 (0.78–0.91)	0.81 (0.73–0.88)	0.69 (0.59–0.77)	0.67 (0.58–0.76)	0.73 (0.64–0.81)	0.71 (0.61–0.79)
p-value^c	/	0.02	0.001	< 0.001	0.02	0.006
Sensitivity	84% (73%–91%)	77% (66%–86%)	62% (50%–80%)	65% (53%–76%)	73% (61%–83%)	70% (59%–80%)
p-value^c	/	0.13	0.003	0.007	0.10	0.05
Specificity	73% (57%–86%)	68% (52%–82%)	66% (49%–80%)	54% (37%–69%)	73% (57%–86%)	71% (54%–84%)
p-value^c	/	0.73	0.58	0.08	1.00	1.00
PPV	85% (77%–90%)	81% (73%–87%)	77% (67%–84%)	72% (64%–79%)	83% (74%–89%)	81% (72%–88%)
p-value^c	/	0.30	0.08	0.005	0.67	0.48
NPV	71% (59%–81%)	62% (51%–72%)	49% (40%–58%)	46% (36%–56%)	60% (50%–69%)	57% (47%–66%)
p-value^c	/	0.045	0.001	< 0.001	0.08	0.04
Accuracy	80% (72%–87%)	74% (65%–82%)	63% (54%–72%)	61% (51%–70%)	73% (64%–81%)	70% (61%–79%)
p-value^c	/	0.12	0.003	< 0.001	0.20	0.11
True positive	62	57	46	48	54	52
False positive	11	13	14	19	11	12
False negative	12	17	28	26	20	22
True negative	30	28	27	22	30	29

Data in parentheses are 95% confidence interval

APRI aminotransferase-to-platelet ratio index, AUC area under the receiver operating characteristic curve, CNN convolutional neural network, FIB-4 fibrosis-4 index, NPV negative predictive value, PPV positive predictive value

^a The calculation formulas were as follows: FIB-4 = (age [year] × AST [U/L]) / (platelet count [10⁹/L] × (ALT [U/L])^1/2); APRI = (AST (/upper limit of normal) / platelet count [10⁹/L]) × 100 [9, 10]

^b Cut-off values were selected based on the receiver operating characteristic and Youden index in the training dataset. Cut-offs of combined and CNN models represent the model outputs for the combined model and the CNN model

^c p-values were calculated in comparison to the combined model. AUCs were compared using Delong test. PPVs and NPVs were compared using the weighted generalized score test proposed by Kosinski, while sensitivities, specificities, and accuracies were compared using McNemar’s test