Table 4.
Biomedical IE results for Task 1: aggregating sequential crowd labels to induce consensus labels. Rows 1–3 indicate non-sequential baselines. Results are averaged over 100 bootstrap re-samples. We report the standard deviation of F1, std, due to this dataset having fewer gold labels for evaluation.
| Method | Precision | Recall | F1 | std |
|---|---|---|---|---|
| Majority Vote | 91.89 | 48.03 | 63.03 | 2.6 |
| MACE | 45.01 | 88.49 | 59.63 | 1.7 |
| Dawid-Skene | 77.85 | 66.77 | 71.84 | 1.7 |
|
| ||||
| Dawid-Skene then HMM | 72.49 | 58.77 | 64.86 | 2.0 |
| ID HMM (Huang et al., 2015) | 78.99 | 68.10 | 73.11 | 1.9 |
| HMM-Crowd | 72.81 | 75.14 | 73.93 | 1.8 |