. 2025 Apr 8;10:401. Originally published 2021 May 19. [Version 3] doi: 10.12688/f1000research.51117.3

Table 5. Examples for reports of inter-annotator disagreements in the included publications.

Please see each included publication for further details on corpus quality.

Publication	Type	Score, or range between worst to best class
43	Average accuracy between annotators	Range: 0.62 to 0.70
48	Agreement rate	80%
65	Cohen’s Kappa	0.84 overall, down to 0.59 for worst class
104	Cohen’s Kappa	Range: 0.41 to 0.71
75	Inter-annotation recall	Range: 0.38 to 0.86
55	Cohen’s Kappa between experts	Range: 0.5 to 0.59
55	Macro-averaged worker vs. aggregation precision, recall, F1 (see publication for full scores)	Range: 0.39 to 0.70
116 (describes only PECODR corpus creation, excluded from review)	Initial agreement between annotators	Range: 85-87%
52	Average and range of agreement	62%, Range: 41-71
58	Avg. sentences labelled by expert vs. student per abstract	1.9 vs. 4.2
58	Cohen’s Kappa expert vs. student	0.42
61	Agreement; Cohen’s Kappa	86%; 0.76
38	MASI measure (Measuring Agreement on Set-Valued Items) for article/selection level; Krippendorff’s alpha for class-level	MASI 0.6 range 0.5-0.89; Krippendorf 0.53 for I, 0.57 for O, ranging from 0.06-0.96 between all classes
35	F1 strict vs. relaxed, at beginning and end of annotation phase	85.6% vs. 93.9% at the end; relaxed score increasing from 86% at beginning of annotation phase to 93.9% at the end
36	Fleiss’ Kappa on 47 abstracts for outcomes and on 30 for relation-extraction	Outcomes 0.81; Relations 0.62-0.72
63	B3, MUC, Constrained Entity-Alignment F-Measure (CEAFe) scores	B3 0.40; MUC 0.46; and CEAFe 0.42
51	Kappa for entities and F1 for complex entities with sub-classes or relations	Kappa range 0.74-0.68; complex entities 0.81
37	Cohen’s Kappa of their EBM-NLP adaptation vs. original dataset	Between 0.53 for P-0.69 for O
171	Fleiss Kappa for expert annotators, percentage of exact overlaps	Fleiss Kappa 0.77, exact match 92.4% of the time
150	Mean inter-rater reliability F1	For entities mean 0.86, range 0.72-0.92. For dependencies 0.69
145	Cohen’s Kappa before and after annotation guideline and scope redefined for re-annotating EBM-NLP	0.3 before vs. 0.74 after
179	Inter-rater reliability	Combined 0.74, range 0.7-0.8
143	Document-level Cohen’s kappa range, span f1 range, span-level F1	Document level range 0.74-0.83, span-F1 0.92-0.95, span-level F1 0.9-0.94
149	Randolph’s kappa PICO range on 15 texts	0.56 (P entity) – 0.8 (I entity), EvidenceInference corpus 0.47
162	Cohen’s kappa, token-level F1	Kappa 0.81, F1 0.88
156	Cohen’s kappa	0.8
134	Pairwise F1	78%