Skip to main content
. 2025 Apr 8;10:401. Originally published 2021 May 19. [Version 3] doi: 10.12688/f1000research.51117.3

Table 5. Examples for reports of inter-annotator disagreements in the included publications.

Please see each included publication for further details on corpus quality.

Publication Type Score, or range between worst to best class
43 Average accuracy between annotators Range: 0.62 to 0.70
48 Agreement rate 80%
65 Cohen’s Kappa 0.84 overall, down to 0.59 for worst class
104 Cohen’s Kappa Range: 0.41 to 0.71
75 Inter-annotation recall Range: 0.38 to 0.86
55 Cohen’s Kappa between experts Range: 0.5 to 0.59
55 Macro-averaged worker vs. aggregation precision, recall, F1 (see publication for full scores) Range: 0.39 to 0.70
116 (describes only PECODR corpus creation, excluded from review) Initial agreement between annotators Range: 85-87%
52 Average and range of agreement 62%, Range: 41-71
58 Avg. sentences labelled by expert vs. student per abstract 1.9 vs. 4.2
58 Cohen’s Kappa expert vs. student 0.42
61 Agreement; Cohen’s Kappa 86%; 0.76
38 MASI measure (Measuring Agreement on Set-Valued Items) for article/selection level; Krippendorff’s alpha for class-level MASI 0.6 range 0.5-0.89; Krippendorf 0.53 for I, 0.57 for O, ranging from 0.06-0.96 between all classes
35 F1 strict vs. relaxed, at beginning and end of annotation phase 85.6% vs. 93.9% at the end; relaxed score increasing from 86% at beginning of annotation phase to 93.9% at the end
36 Fleiss’ Kappa on 47 abstracts for outcomes and on 30 for relation-extraction Outcomes 0.81; Relations 0.62-0.72
63 B3, MUC, Constrained Entity-Alignment F-Measure (CEAFe) scores B3 0.40; MUC 0.46; and CEAFe 0.42
51 Kappa for entities and F1 for complex entities with sub-classes or relations Kappa range 0.74-0.68; complex entities 0.81
37 Cohen’s Kappa of their EBM-NLP adaptation vs. original dataset Between 0.53 for P-0.69 for O
171 Fleiss Kappa for expert annotators, percentage of exact overlaps Fleiss Kappa 0.77, exact match 92.4% of the time
150 Mean inter-rater reliability F1 For entities mean 0.86, range 0.72-0.92. For dependencies 0.69
145 Cohen’s Kappa before and after annotation guideline and scope redefined for re-annotating EBM-NLP 0.3 before vs. 0.74 after
179 Inter-rater reliability Combined 0.74, range 0.7-0.8
143 Document-level Cohen’s kappa range, span f1 range, span-level F1 Document level range 0.74-0.83, span-F1 0.92-0.95, span-level F1 0.9-0.94
149 Randolph’s kappa PICO range on 15 texts 0.56 (P entity) – 0.8 (I entity), EvidenceInference corpus 0.47
162 Cohen’s kappa, token-level F1 Kappa 0.81, F1 0.88
156 Cohen’s kappa 0.8
134 Pairwise F1 78%