Table 3.
NER results on Task 2: predicting sequences on unannotated text when trained on crowd labels. Rows 1–4 train the predictive model using individual crowd labels, while Rows 5–8 first aggregate crowd labels then train the model on the induced consensus labels. The last row indicates an upper-bound from training on gold labels. LSTM-Crowd and LSTM-Crowd-cat are described in Section 3.
| Method | Precision | Recall | F1 |
|---|---|---|---|
| CRF-MA (Rodrigues et al., 2014) | 49.40 | 85.60 | 62.60 |
| LSTM (Lample et al., 2016) | 83.19 | 57.12 | 67.73 |
| LSTM-Crowd | 82.38 | 62.10 | 70.82 |
| LSTM-Crowd-cat | 79.61 | 62.87 | 70.26 |
|
| |||
| Majority Vote then CRF | 45.50 | 80.90 | 58.20 |
| Dawid-Skene then LSTM | 72.30 | 61.17 | 66.27 |
| HMM-Crowd then CRF | 77.40 | 61.40 | 68.50 |
| HMM-Crowd then LSTM | 76.19 | 66.24 | 70.87 |
|
| |||
| LSTM on Gold Labels (upper-bound) | 85.27 | 83.19 | 84.22 |