Table 3.
Measures of output and performance for AI models included in the review
Reference | Accuracy (%) | Sensitivity (%) | Specificity (%) | AUC |
---|---|---|---|---|
Binary classification models | ||||
Piccolo et al. (2002) [23] | n/a | 92 | 74 | n/a |
Iyatomi et al. (2008) [24] | n/a | 86 | 86 | 0.93 |
Chang et al. (2013) [25] | 91 | 86 | 88 | 0.95 |
Chen et al. (2016) [26] | 91 | 90 | 92 | n/a |
Yang et al. (2017) [27] | 99.7 | 100 | 99 | n/a |
Yu et al. (2018) [29] | 82 | 93 | 72 | 0.80 |
Cho et al. (2020) [33] | n/a | Dataset 1: 76 Dataset 2: 70 |
Dataset 1: 80 Dataset 2: 76 |
Dataset 1: 0.83 Dataset 2: 0.77 |
Huang et al. (2020) [37] | 86 | n/a | n/a | 0.92 |
Han et al. (2020) [35] | n/a | 77 | 91 | 0.91 |
Fujisawa et al. (2019) [31] | 93 | 96 | 90 | n/a |
Jinnai et al. (2019) [38] | 92 | 83 | 95 | n/a |
Han et al. (2020) [36] | n/a | n/a | n/a | Edinburgh dataset: 0.93 SNU dataset: 0.94 |
Han et al. (2020) [34] | n/a | Top 1: 63 | Top 1: 90 | 0.86 |
Li et al. (2020) [44] | 86 | 75 | 93 | n/a |
Wang et al. (2020) [40] | 77 | n/a | n/a | n/a |
Multiclass classification models | ||||
Han et al. (2018) [28] | n/a | ASAN dataset: 86 Edinburg dataset: 85 |
ASAN dataset: 86 Edinburg dataset: 81 |
|
Zhang et al. (2018) [30] | Dataset A: 87 Dataset B: 87 |
n/a | n/a | |
Fujisawa et al. (2019) [31] | 77 | n/a | n/a | |
Jinnai et al. (2019) [38] | 87 | 86 | 87 | |
Liu et al. (2020) (26-classification model) [39] | Top 1: 71 Top 3: 93 |
Top 1: 58 Top 3: 88 |
n/a | |
Han et al. (2020) [36] | Top 1 Edinburgh dataset: 57 SNU dataset: 45 Top 3 Edinburgh dataset: 84 SNU dataset: 69 Top 5 Edinburgh dataset: 92 SNU dataset: 78 |
n/a | n/a | |
Han et al. (2020) [34] | Top 1: 43 Top 3: 62 |
n/a | n/a | |
Li et al. (2020) [44] | 73 | n/a | n/a | |
Wang et al. (2020) [40] | 82 | n/a | n/a | |
Minagawa et al. (2021) [42] | 90 | n/a | n/a | |
Yang et al. (2021) [43] | Algorithm A: 88 Algorithm B: 77 Algorithm C: 90 Algorithm D: 87 |
Algorithm A: 83 Algorithm B: 63 Algorithm C: 81 Algorithm D: 80 |
Algorithm A: 98 Algorithm B: 90 Algorithm C: 99 Algorithm D: 98 |
|
Huang et al. (2021) [37] | 5 class (KCGMH dataset): 72 7 class (HAM10000 dataset): 86 |
n/a | n/a | |
Risk categorical classification | ||||
Zhao et al. (2019) [32] | 83 | Benign: 93 Low risk: 85 High risk: 86 |
Benign: 88 Low risk: 85 High risk: 91 |
Benign: 0.96 Low risk: 0.92 High risk: 0.95 |
Top: top-(n) accuracy represents the fact that the correct diagnosis is among the top n predictions output by the model.
For example, top-3 accuracy means that any of the top 3 highest probability predictions made by the model match the expected answer.
AUC, area under the curve.