Table 3.
AI(H) performance on inflammation-focused tasks, H&E slides
| Model | Overall accuracy (%)* | Macro-precision (%)† | Macro-sensitivity (macro recall) (%)† | |||
|---|---|---|---|---|---|---|
| Test | Training | Test | Training | Test | Training | |
| Tissue detection | 99.4 | 99.9 | 99.7 | 98.9 | 99.3 | 99.5 |
| Microanatomy (portal area, lobular area, central vein) | 88.0 | 97.5 | 94.2 | 98.3 | 93.7 | 96.6 |
| Necro-inflammation (focal necrosis, interface hepatitis, confluent necrosis, pericentral necrosis, bridging necrosis, panacinar necrosis) | 83.9 | 98.2 | 49.7 | 81.0 | 37.2 | 94.5 |
| Portal inflammation | 79.2 | 78.5 | 88.4 | 99.7 | 79.2 | 79.9 |
| Immune cells (lymphocytes, plasma cells, macrophages, eosinophils, neutrophils) | 72.4 | 83.6 | 86.9 | 91.8 | 85.2 | 91.8 |
| Bile duct damage | 81.7 | 90.3 | 91.3 | 95.4 | 90.3 | 95.0 |
*Overall accuracy is a standalone metric that measures how well machine-learning models perform in multiclass classifications. It denotes the ratio of correct predications; for example, for a three-category (category A, B, and C) classification task, overall accuracy is calculated as the sum of correct predications on category A, B, and C divided by the grand total
†Precision and sensitivity (also called recall) are paired metrics (which means that they cannot be used individually) that measure how well machine-learning models perform in classification tasks. In binary classification, precision is calculated as TP/(TP + FP), and sensitivity is computed as TP/(TP + FN). In multiclass classification, each category forms its own positive class (and combines other categories as the negative class) and thus renders several binary classifications. Macro-precision and macro-sensitivity are arithmetic means (average) of individual binary precisions and of individual binary sensitivities, respectively
AI(H), artificial intelligence for hepatitis; FP, false positive; FN, false negative; H&E, hematoxylin and eosin; TP, true positive