Table 2.
Overall mean performance of each task’s classifiers measured using mean balanced accuracy and evaluated on tile level and slide level.
| Taska | ResNet50 performance (mean balanced accuracy) | ||
| Tile level | Slide level (95% CI)b | ||
| 1: Patient age | 76.2% | 87.5% | |
| 2: Slide preparation date | 
 | 
 | 
|
| 
 | 
Data set 1: 2015 versus 2017 | 54.1% | 56.1% (52.7% to 59.5%) | 
| 
 | 
Data set 1: 2016 versus 2018 | 56.5% | 63.2% (53.4% to 73.0%) | 
| 
 | 
Data set 2: 2014 versus 2016 | 69.0% | 82.0% (76.4% to 87.6%) | 
| 
 | 
Data set 2: 2015 versus 2017 | 66.6% | 83.5% (80.9% to 86.1%) | 
| 
 | 
Data set 2: 2016 versus 2018 | 52.7% | 56.7% (52.6% to 60.7%) | 
| 3: Slide origin | 94.2% | 97.9% (97.3% to 98.5%) | |
| 4: Scanner type | 100% | 100% | |
aTest sets for each task had a minimum of 10 slides per class.
bConfidence intervals are shown for the decisive criteria (slide level) and are omitted for tasks where no variation on slide level was observed.