. Author manuscript; available in PMC: 2017 Oct 30.

Published in final edited form as: Proc Conf Assoc Comput Linguist Meet. 2017;2017:299–309. doi: 10.18653/v1/P17-1028

Table 4.

Biomedical IE results for Task 1: aggregating sequential crowd labels to induce consensus labels. Rows 1–3 indicate non-sequential baselines. Results are averaged over 100 bootstrap re-samples. We report the standard deviation of F1, std, due to this dataset having fewer gold labels for evaluation.

Method	Precision	Recall	F1	std
Majority Vote	91.89	48.03	63.03	2.6
MACE	45.01	88.49	59.63	1.7
Dawid-Skene	77.85	66.77	71.84	1.7

Dawid-Skene then HMM	72.49	58.77	64.86	2.0
ID HMM (Huang et al., 2015)	78.99	68.10	73.11	1.9
HMM-Crowd	72.81	75.14	73.93	1.8