Skip to main content
. 2024 Jun 15;485(6):1095–1105. doi: 10.1007/s00428-024-03841-5

Table 3.

AI(H) performance on inflammation-focused tasks, H&E slides

Model Overall accuracy (%)* Macro-precision (%) Macro-sensitivity (macro recall) (%)
Test Training Test Training Test Training
Tissue detection 99.4 99.9 99.7 98.9 99.3 99.5
Microanatomy (portal area, lobular area, central vein) 88.0 97.5 94.2 98.3 93.7 96.6
Necro-inflammation (focal necrosis, interface hepatitis, confluent necrosis, pericentral necrosis, bridging necrosis, panacinar necrosis) 83.9 98.2 49.7 81.0 37.2 94.5
Portal inflammation 79.2 78.5 88.4 99.7 79.2 79.9
Immune cells (lymphocytes, plasma cells, macrophages, eosinophils, neutrophils) 72.4 83.6 86.9 91.8 85.2 91.8
Bile duct damage 81.7 90.3 91.3 95.4 90.3 95.0

*Overall accuracy is a standalone metric that measures how well machine-learning models perform in multiclass classifications. It denotes the ratio of correct predications; for example, for a three-category (category A, B, and C) classification task, overall accuracy is calculated as the sum of correct predications on category A, B, and C divided by the grand total

Precision and sensitivity (also called recall) are paired metrics (which means that they cannot be used individually) that measure how well machine-learning models perform in classification tasks. In binary classification, precision is calculated as TP/(TP + FP), and sensitivity is computed as TP/(TP + FN). In multiclass classification, each category forms its own positive class (and combines other categories as the negative class) and thus renders several binary classifications. Macro-precision and macro-sensitivity are arithmetic means (average) of individual binary precisions and of individual binary sensitivities, respectively

AI(H), artificial intelligence for hepatitis; FP, false positive; FN, false negative; H&E, hematoxylin and eosin; TP, true positive