[Preprint]. 2023 Mar 3:rs.3.rs-2526701. [Version 1] doi: 10.21203/rs.3.rs-2526701/v1

TABLE 2:

Repeatability analysis highlighting quadratic weighted kappa (QWK) summary statistics – mean, median with interquartile range (IQR) and adjusted linear regression (LR) β values – for design choices within each design choice category for our automated visual evaluation (AVE) classifier. Rows shaded in salmon indicate design choices filtered out at this stage due to poor repeatability.

Design Choice Category	Design Choices	QWK summary

		Mean (SD)		Median (IQR)		Adjusted LR β
Architecture	densenet121	0.743	(0.062)	0.748	(0.719 - 0.786)	−0.016

	resnest50	0.675	(0.069)	0.649	(0.630 - 0.743)	−0.083**

	resnet50	0.752	(0.048)	0.760	(0.736 - 0.776)	−0.018
	SWT	0.743	(0.079)	0.748	(0.671 - 0.815)	ref

Loss Function	Cross Entropy	0.725	(0.069)	0.738	(0.671 - 0.771)	−0.039**

	Focal	0.717	(0.070)	0.730	(0.654 - 0.773)	−0.078**

	QWK	0.779	(0.042)	0.782	(0.752 - 0.809)	ref

	CORAL	0.678	(0.056)	0.649	(0.636 - 0.729)	−0.069**

Balancing strategy	Balanced loss	0.703	(0.107)	0.751	(0.647 - 0.769)	−0.053**
	Balanced sampling	0.729	(0.057)	0.735	(0.675 - 0.781)	−0.046**
	Remove controls	0.775	(0.054)	0.777	(0.744 - 0.809)	ref
	Sampling 1:1:2	0.744	(0.055)	0.758	(0.728 - 0.783)	−0.042**
	Sampling 1:1:4	0.776	(0.033)	0.772	(0.752 - 0.798)	−0.026
	Sampling 2:1:1	0.764	(0.017)	0.762	(0.750 - 0.778)	−0.045
	None	0.706	(0.069)	0.721	(0.638 - 0.749)	−0.019

Dropout	No Dropout	0.663	(0.072)	0.649	(0.620 - 0.723)	−0.088**

	Train Dropout only	0.725	(0.058)	0.738	(0.681 - 0.759)	−0.035**
	Monte Carlo Dropout	0.760	(0.059)	0.772	(0.733 - 0.802)	ref

Multilevel Ground Truth	3 level all patients	0.740	(0.068)	0.752	(0.719 - 0.780)	ref
	3 level subsets	0.707	(0.070)	0.709	(0.637 - 0.778)	−0.026**
	5 level all patients	0.705	(0.064)	0.721	(0.650 - 0.748)	−0.025

SWT: Swin Transformer; CORAL: CORAL (consistent rank logits) loss, as described in the METHODS section; ref: reference category.