. 2020 Nov 30;3(11):e2027426. doi: 10.1001/jamanetworkopen.2020.27426

Table 1. Autosegmentation Performance on 3 Head and Neck Data Sets.

Data set	Dice score, mean (SD)
	Brainstem	Mandible	Spinal cord	Globe		Parotid		SMG
	Brainstem	Mandible	Spinal cord	Left	Right	Left	Right	Left	Right
IOV-10^a
Annotator 1	89.3 (4.2)	98.6 (1.0)	92.9 (1.5)	96.4 (0.9)	96.5 (1.1)	92.7 (3.5)	92.7 (3.5)	92.3 (3.4)	92.3 (2.6)
Annotator 2	91.8 (2.0)	98.5 (0.5)	91.8 (2.3)	95.6 (1.3)	96.7 (1.1)	91.1 (4.3)	91.2 (3.7)	91.3 (4.7)	91.3 (5.4)
Annotator 3	89.6 (2.7)	96.9 (1.0)	81.9 (7.3)	96.5 (0.8)	95.7 (1.0)	88.2 (3.8)	90.1 (2.8)	91.6 (2.8)	90.3 (8.0)
Ensemble	88.5 (2.0)	97.0 (1.0)	87.7 (3.6)	94.8 (1.0)	94.5 (1.9)	88.5 (2.3)	87.8 (4.1)	87.0 (2.9)	85.1 (5.3)
Agreement between annotators, κ	0.831	0.971	0.836	0.927	0.939	0.838	0.845	0.848	0.836
Agreement between annotators and model	0.806	0.966	0.844	0.917	0.931	0.852	0.825	0.803	0.794
Main data set, ensemble^b	85.0 (3.7)	95.7 (2.3)	84.0 (3.8)	92.9 (1.6)	93.1 (1.5)	87.9 (3.8)	87.8 (4.3)	87.5 (2.3)	86.7 (3.5)
External data set, ensemble^c	84.9 (6.8)	93.8 (2.5)	80.3 (7.7)	92.7 (3.6)	93.3 (1.4)	84.3 (4.6)	84.5 (4.3)	83.3 (9.1)	78.2 (21.1)
External data set,^c Nikolov et al¹⁵	79.1 (9.6)	93.8 (1.6)	80.0 (7.8)	91.5 (2.1)	92.1 (1.9)	83.2 (5.4)	84.0 (3.7)	80.3 (7.8)	76.0 (16.5)
External data set, radiographer^c	89.5 (2.2)	93.9 (2.3)	84.0 (4.8)	92.9 (1.9)	93.0 (1.7)	86.7 (3.5)	87.0 (3.1)	83.3 (19.7)	74.9 (30.2)

Abbreviations: IOV, interobserver variability; SMG, submandibular glands.

^{^a}

IOV-10 data set included 10 images. In the IOV study, a subset of the main data set was annotated multiple times by 2 radiation oncologists and a trained reader. Later, the proposed model was compared against each human expert. The statistical agreement between annotators and model were measured with Fleiss κ values.

^{^b}

Main data set included 20 images.

^{^c}

External data set included 26 images. For the external data set, the reference ground truth contours were delineated by an expert head and neck oncologist, and IOV between clinical experts was measured by comparing the reference contours with those produced by an experienced radiographer.¹⁵