. 2024 Jul 2;27(5):1088–1099. doi: 10.1007/s10120-024-01524-3

Table 3.

Comparison of predictive accuracies between AI model and endoscopists in the test set

Prediction target	AI model (n/N)		Experts, mean (95% CI)		Novices, mean (95% CI)
Prediction target	Overall	Video	Overall	Video	Overall	Video
Undifferentiated histology
Accuracy (%)	92.7 (102/110)	100 (10/10)	71.6 (68.6–74.6)*	71.1	58.1 (53.4–62.8)**	48.8
Sensitivity (%)	87.3 (48/55)	100 (5/5)	85.8 (79.5–92.1)	88.5	60.8 (53.1–68.5)	43.1
Specificity (%)	98.2 (54/55)	100 (5/5)	65.9 (63.2–68.6)	65.8	57.4 (53.1–61.7)	48.0
PPV (%)	98.0 (48/49)	100 (5/5)	52.9 (46.8–59.2)	51.1	53.3 (42.5–64.1)	45.0
NPV (%)	88.5 (54/61)	100 (5/5)	90.3 (84.6–95.4)	91.1	62.9 (49.0–76.8)	52.5
Submucosal invasion
Accuracy (%)	87.3 (96/110)	100 (10/10)	72.6 (70.6–74.6)*	77.8	63.9 (56.4–71.4)**	58.8
Sensitivity (%)	83.3 (45/54)	100 (4/4)	70.9 (65.9–75.9)	86.2	60.8 (53.1–68.5)	78.5
Specificity (%)	91.1 (51/56)	100 (6/6)	76.5 (73.4–79.6)	70.2	73.0 (61.6–84.4)	51.8
PPV (%)	90.0 (45/50)	100 (4/4)	77.6 (71.4–83.8)	80.6	77.2 (64.0–90.4)	75.0
NPV (%)	85.0 (51/60)	100 (6/6)	67.9 (58.8–0.77)	75.9	51.0 (38.6–63.5)	47.9
Lymphovascular invasion
Accuracy (%)	76.4 (84/110)	100 (10/10)	69.7 (59.6–79.8)	72.2	64.8 (61.4–68.2)	62.5
Sensitivity (%)	30.0 (6/20)	100 (1/1)	33.7 (26.7–40.7)	33.3	28.9 (20.6–37.2)	62.5
Specificity (%)	86.7 (78/90)	100 (9/9)	86.7 (85.2–88.2)	76.5	87.5 (83.2–92.7)	62.5
PPV (%)	33.3 (6/18)	100 (1/1)	46.9 (31.5–62.3)	8.7	40.0 (16.9–63.1)	3.8
NPV (%)	84.8 (78/92)	100 (9/9)	68.4 (56.7–73.7)	91.8	68.1 (48.8–87.4)	96.0
Lymph node metastasis
Accuracy (%)	87.7 (71/81)	62.5 (5/8)	72.3 (65.8–78.8)*	68.1	67.7 (49.8–85.6)	65.6
Sensitivity (%)	41.7 (5/12)	33.3 (1/3)	17.6 (11.5–23.7)	22.2	14.8 (5.2–24.4)	29.2
Specificity (%)	95.4 (66/69)	80.0 (4/5)	85.3 (72.3—98.4)	95.6	83.4 (79.6–87.2)	87.5
PPV (%)	62.5 (5/8)	50 (1/2)	19.7 (3.9–35.5)	46.3	10.3 (2.6–18.0)	43.8
NPV (%)	90.4 (66/73)	66.7 (4/6)	81.8 (72.7–90.9)	67.9	76.2 (0.58–0.95)	70.5

AI artificial intelligence, CI confidence interval, n number of correct answers, N number of questions

^* P < 0.05, when accuracy was compared with that of the AI system using the Mcnemar’s test

^** P < 0.05, when the mean accuracy was compared with that of the experts using the Mann–Whitney U test