. 2024 Jun 15;485(6):1095–1105. doi: 10.1007/s00428-024-03841-5

Table 3.

AI(H) performance on inflammation-focused tasks, H&E slides

Model	Overall accuracy (%)*		Macro-precision (%)^†		Macro-sensitivity (macro recall) (%)^†
Model	Test	Training	Test	Training	Test	Training
Tissue detection	99.4	99.9	99.7	98.9	99.3	99.5
Microanatomy (portal area, lobular area, central vein)	88.0	97.5	94.2	98.3	93.7	96.6
Necro-inflammation (focal necrosis, interface hepatitis, confluent necrosis, pericentral necrosis, bridging necrosis, panacinar necrosis)	83.9	98.2	49.7	81.0	37.2	94.5
Portal inflammation	79.2	78.5	88.4	99.7	79.2	79.9
Immune cells (lymphocytes, plasma cells, macrophages, eosinophils, neutrophils)	72.4	83.6	86.9	91.8	85.2	91.8
Bile duct damage	81.7	90.3	91.3	95.4	90.3	95.0

^*Overall accuracy is a standalone metric that measures how well machine-learning models perform in multiclass classifications. It denotes the ratio of correct predications; for example, for a three-category (category A, B, and C) classification task, overall accuracy is calculated as the sum of correct predications on category A, B, and C divided by the grand total

^†Precision and sensitivity (also called recall) are paired metrics (which means that they cannot be used individually) that measure how well machine-learning models perform in classification tasks. In binary classification, precision is calculated as TP/(TP + FP), and sensitivity is computed as TP/(TP + FN). In multiclass classification, each category forms its own positive class (and combines other categories as the negative class) and thus renders several binary classifications. Macro-precision and macro-sensitivity are arithmetic means (average) of individual binary precisions and of individual binary sensitivities, respectively

AI(H), artificial intelligence for hepatitis; FP, false positive; FN, false negative; H&E, hematoxylin and eosin; TP, true positive