Figure 9.

University College London Hospitals (UCLH) test set: quantitative performance of the model in comparison with radiographers. (a) The model achieves a surface Dice similarity coefficient similar to humans in all 21 organs at risk (on the UCLH held out test set) when compared with the gold standard for each organ at an organ-specific tolerance τ. Blue: our model; green: radiographers. (b) Performance difference between the model and the radiographers. Each blue dot represents a model-radiographer pair. The gray area highlights nonsubstantial differences (−5% to +5%). The box extends from the lower to upper quartile values of the data, with a line at the median. The whiskers indicate most extreme, nonoutlier data points. Where data lie outside, an IQR of 1.5 is represented as a circular flier. The notches represent the 95% CI around the median. DSC: Dice similarity coefficient; UCLH: University College London Hospitals.