Skip to main content
. 2023 Mar 21;239(4):499–513. doi: 10.1159/000530225

Table 3.

Measures of output and performance for AI models included in the review

Reference Accuracy (%) Sensitivity (%) Specificity (%) AUC
Binary classification models
Piccolo et al. (2002) [23] n/a 92 74 n/a
Iyatomi et al. (2008) [24] n/a 86 86 0.93
Chang et al. (2013) [25] 91 86 88 0.95
Chen et al. (2016) [26] 91 90 92 n/a
Yang et al. (2017) [27] 99.7 100 99 n/a
Yu et al. (2018) [29] 82 93 72 0.80
Cho et al. (2020) [33] n/a Dataset 1: 76
Dataset 2: 70
Dataset 1: 80
Dataset 2: 76
Dataset 1: 0.83
Dataset 2: 0.77
Huang et al. (2020) [37] 86 n/a n/a 0.92
Han et al. (2020) [35] n/a 77 91 0.91
Fujisawa et al. (2019) [31] 93 96 90 n/a
Jinnai et al. (2019) [38] 92 83 95 n/a
Han et al. (2020) [36] n/a n/a n/a Edinburgh dataset: 0.93
SNU dataset: 0.94
Han et al. (2020) [34] n/a Top 1: 63 Top 1: 90 0.86
Li et al. (2020) [44] 86 75 93 n/a
Wang et al. (2020) [40] 77 n/a n/a n/a
Multiclass classification models
Han et al. (2018) [28] n/a ASAN dataset: 86
Edinburg dataset: 85
ASAN dataset: 86
Edinburg dataset: 81
Zhang et al. (2018) [30] Dataset A: 87
Dataset B: 87
n/a n/a
Fujisawa et al. (2019) [31] 77 n/a n/a
Jinnai et al. (2019) [38] 87 86 87
Liu et al. (2020) (26-classification model) [39] Top 1: 71
Top 3: 93
Top 1: 58
Top 3: 88
n/a
Han et al. (2020) [36] Top 1
Edinburgh dataset: 57
SNU dataset: 45
Top 3
Edinburgh dataset: 84
SNU dataset: 69
Top 5
Edinburgh dataset: 92
SNU dataset: 78
n/a n/a
Han et al. (2020) [34] Top 1: 43
Top 3: 62
n/a n/a
Li et al. (2020) [44] 73 n/a n/a
Wang et al. (2020) [40] 82 n/a n/a
Minagawa et al. (2021) [42] 90 n/a n/a
Yang et al. (2021) [43] Algorithm A: 88
Algorithm B: 77
Algorithm C: 90
Algorithm D: 87
Algorithm A: 83
Algorithm B: 63
Algorithm C: 81
Algorithm D: 80
Algorithm A: 98
Algorithm B: 90
Algorithm C: 99
Algorithm D: 98
Huang et al. (2021) [37] 5 class (KCGMH dataset): 72
7 class (HAM10000 dataset): 86
n/a n/a
Risk categorical classification
Zhao et al. (2019) [32] 83 Benign: 93
Low risk: 85
High risk: 86
Benign: 88
Low risk: 85
High risk: 91
Benign: 0.96
Low risk: 0.92
High risk: 0.95

Top: top-(n) accuracy represents the fact that the correct diagnosis is among the top n predictions output by the model.

For example, top-3 accuracy means that any of the top 3 highest probability predictions made by the model match the expected answer.

AUC, area under the curve.