Figure 8.
The effect of IAR-based label refinement on the agreement (κ) between CNN classifier outputs and the three original human annotation tracks (gray lines) as a function of IAR iterations on held-out test data. The mean pair-wise agreement between classifier and the three human annotations is shown with the blue line and Fleiss’ κ agreement across all human annotators is shown with the dotted black line. A clear improvement in classifier performance is observed due to label refinement on the training data, reaching the human-to-human agreement rate.
