Table 5. Examples for reports of inter-annotator disagreements in the included publications.
Please see each included publication for further details on corpus quality.
| Publication | Type | Score, or range between worst to best class |
|---|---|---|
| 43 | Average accuracy between annotators | Range: 0.62 to 0.70 |
| 48 | Agreement rate | 80% |
| 65 | Cohen’s Kappa | 0.84 overall, down to 0.59 for worst class |
| 104 | Cohen’s Kappa | Range: 0.41 to 0.71 |
| 75 | Inter-annotation recall | Range: 0.38 to 0.86 |
| 55 | Cohen’s Kappa between experts | Range: 0.5 to 0.59 |
| 55 | Macro-averaged worker vs. aggregation precision, recall, F1 (see publication for full scores) | Range: 0.39 to 0.70 |
| 116 (describes only PECODR corpus creation, excluded from review) | Initial agreement between annotators | Range: 85-87% |
| 52 | Average and range of agreement | 62%, Range: 41-71 |
| 58 | Avg. sentences labelled by expert vs. student per abstract | 1.9 vs. 4.2 |
| 58 | Cohen’s Kappa expert vs. student | 0.42 |
| 61 | Agreement; Cohen’s Kappa | 86%; 0.76 |
| 38 | MASI measure (Measuring Agreement on Set-Valued Items) for article/selection level; Krippendorff’s alpha for class-level | MASI 0.6 range 0.5-0.89; Krippendorf 0.53 for I, 0.57 for O, ranging from 0.06-0.96 between all classes |
| 35 | F1 strict vs. relaxed, at beginning and end of annotation phase | 85.6% vs. 93.9% at the end; relaxed score increasing from 86% at beginning of annotation phase to 93.9% at the end |
| 36 | Fleiss’ Kappa on 47 abstracts for outcomes and on 30 for relation-extraction | Outcomes 0.81; Relations 0.62-0.72 |
| 63 | B3, MUC, Constrained Entity-Alignment F-Measure (CEAFe) scores | B3 0.40; MUC 0.46; and CEAFe 0.42 |
| 51 | Kappa for entities and F1 for complex entities with sub-classes or relations | Kappa range 0.74-0.68; complex entities 0.81 |
| 37 | Cohen’s Kappa of their EBM-NLP adaptation vs. original dataset | Between 0.53 for P-0.69 for O |
| 171 | Fleiss Kappa for expert annotators, percentage of exact overlaps | Fleiss Kappa 0.77, exact match 92.4% of the time |
| 150 | Mean inter-rater reliability F1 | For entities mean 0.86, range 0.72-0.92. For dependencies 0.69 |
| 145 | Cohen’s Kappa before and after annotation guideline and scope redefined for re-annotating EBM-NLP | 0.3 before vs. 0.74 after |
| 179 | Inter-rater reliability | Combined 0.74, range 0.7-0.8 |
| 143 | Document-level Cohen’s kappa range, span f1 range, span-level F1 | Document level range 0.74-0.83, span-F1 0.92-0.95, span-level F1 0.9-0.94 |
| 149 | Randolph’s kappa PICO range on 15 texts | 0.56 (P entity) – 0.8 (I entity), EvidenceInference corpus 0.47 |
| 162 | Cohen’s kappa, token-level F1 | Kappa 0.81, F1 0.88 |
| 156 | Cohen’s kappa | 0.8 |
| 134 | Pairwise F1 | 78% |