. 2020 Nov 22;62:103121. doi: 10.1016/j.ebiom.2020.103121

Table 4.

Comparison of model performance with subspecialists and junior radiologists evaluating uncropped images of the external testing data and stratified by age group. For Cohen's kappa scores and categorical accuracy, 95% confidence intervals were generated using 10,000 bootstrap samples. Permutation tests with 10,000 iterations were used to calculate p-values.

		Accuracy	Cohen's κ	Difference in κ	p-value
Total (n = 291)	Model	73•4%	0•560 (0•481, 0•639)
	Rater 1	69•3%	0•483 (0•394, 0•567)	-0•077 (-0•180, 0•021)	0•14
	Rater 2	73•4%	0•553 (0•468, 0•634)	-0•007 (-0•112, 0•096)	0•89
	Rater 3	73•1%	0•555 (0•472, 0•633)	-0•005 (-0•115, 0•103)	0•93
	Rater 4	67•9%	0•430 (0•340, 0•519)	-0•130 (-0•240, -0•020)	0•02
	Rater 5	63•4%	0•367 (0•285, 0•449)	-0•193 (-0•293, -0•093)	0•0005
Age (<10, n = 97)	Model	74•2%	0•383 (0•210, 0•542)
	Rater 1	79•4%	0•478 (0•278, 0•655)	0•095 (-0•128, 0•314)	0•41
	Rater 2	79•4%	0•515 (0•334, 0•678)	0•132 (-0•080, 0•343)	0•23
	Rater 3	79•4%	0•535 (0•367, 0•695)	0•152 (-0•080, 0•393)	0•25
	Rater 4	80•4%	0•448 (0•239, 0•637)	0•065 (-0•177, 0•314)	0•61
	Rater 5	69•1%	0•229 (0•064, 0•390)	-0•154 (-0•341, 0•017)	0•11
Age (10-24, n = 97)	Model	77•3%	0•630 (0•498, 0•755)
	Rater 1	70•1%	0•496 (0•336, 0•640)	-0•134 (-0•311, 0•038)	0•13
	Rater 2	72•2%	0•538 (0•392, 0•676)	-0•092 (-0•261, 0•075)	0•28
	Rater 3	77•3%	0•618 (0•473, 0•749)	-0•012 (-0•183, 0•156)	0•88
	Rater 4	69•1%	0•450 (0•291, 0•596)	-0•180 (-0•352, -0•011)	0•045
	Rater 5	52•6%	0•217 (0•085, 0•354)	-0•413 (-0•576, -0•246)	<1•0e-6
Age (>24, n = 97)	Model	68•8%	0•514 (0•366, 0•648)
	Rater 1	58•3%	0•386 (0•250, 0•521)	-0•128 (-0•304, 0•047)	0•15
	Rater 2	68•8%	0•526 (0•385, 0•660)	0•012 (-0•178, 0•200)	0•89
	Rater 3	62•5%	0•413 (0•263, 0•556)	-0•101 (-0•294, 0•093)	0•31
	Rater 4	54•2%	0•282 (0•132, 0•429)	-0•232 (-0•426, -0•033)	0•025
	Rater 5	68•8%	0•479 (0•345, 0•608)	-0•035 (-0•198, 0•137)	0•71

Rater 1 and 2 are subspecialists, while rater 3-5 are junior radiologists.