Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2011 Jul 20.
Published in final edited form as: AMIA Annu Symp Proc. 2010;2010:1316.

Using Amazon’s Mechanical Turk for Annotating Medical Named Entities

Meliha Yetisgen-Yildiz 1, Imre Solti 1, Fei Xia 2
PMCID: PMC3140100  NIHMSID: NIHMS251727  PMID: 21785667

Abstract

Amazon’s Mechanical Turk (AMT) service is becoming increasingly popular in Natural Language Processing (NLP) research. In this poster, we report our findings in using AMT to annotate biomedical text extracted from clinical trial descriptions with three entity types: medical condition, medication, and laboratory test. We also describe our observations on AMT workers’ annotations.

Introduction

The manual construction of annotated corpora is extremely expensive both in terms of time and money. Snow et. al. (2008) demonstrated the potential power of Amazon’s Mechanical Turk (AMT) service in creating large-scale annotated corpora for natural language tasks in a cheap and fast way1. We piloted the feasibility of using AMT for medical text annotation with 100 clinical trial announcements downloaded from ClinicalTrials.gov website.

Annotation Performance

To make the annotation task more convenient for AMT workers, we used a customized user interface and provided detailed annotation guidelines. Four AMT workers annotated the inclusion/exclusion sections of 100 selected announcements. We first posted the announcements to be annotated for medical condition, next for medication, and finally for laboratory test. We measured the quality of AMT annotations at different inter-annotator agreement levels by comparing the agreed entity spans to a gold standard (GS) manually created by one of the authors who has medical training. Agreement level k meant the annotation included only the spans that were agreed by at least k workers. As can be seen from Table 1, the annotation performance of non-medical expert AMT workers was very promising, especially for medical condition and medication.

Table 1.

Quality measurement of AMT annotations.

Medical
Condition
Medication Laboratory Test
k P R F P R F P R F
1 .70 .99 .79 .50 .84 .62 .42 .73 .53
2 .84 .87 .86 .79 .73 .76 .72 .65 .68
3 .89 .73 .80 .93 .45 .61 .86 .40 .54

k: Agreement level, P: Precision, R: Recall, F: F-Measure.

Error Analysis

After AMT workers completed the tasks, we analyzed their annotations in detail in order to understand the problematic areas. This study led to the following observations for each entity type.

Medical Condition

As can be seen from Table 1, for agreement level k=1, the recall was almost perfect, R=0.99. On the other hand, the precision was lower, P=0.70 since some phrases (e.g., “cardiac surgery”) annotated by the workers were not medical conditions. Such wrong annotations indicated that the workers were confused about the definition of medical condition.

Medication

For this task, the workers mainly failed to annotate many general GS medication phrases such as “other investigational agents” and collective names of groups of medications such as “vitamins”. The existence of such errors indicated that either our guideline was not clear or descriptive enough for the workers or the workers did not pay enough attention to the guidelines.

Laboratory Test

In clinical trials, the laboratory tests were usually represented as criteria with arithmetic comparator, such as “hemoglobin level of >/=9.0 gm/dL”. The workers annotated almost all phrases with comparators (e.g. age>50) as test results which resulted very poor precision results for k=1.

For both medication and laboratory test, the workers wrongly annotated the other entity types. This might be a side effect of how we ordered the annotation tasks, since some workers (10 out of 72) worked on the annotation of multiple entity types. Those workers might have not read the guidelines for second and third tasks carefully because they thought they were annotating the previous entity type.

Conclusion and Future Plans

We believe that with careful design of the task AMT is a very promising tool for annotating biomedical text. For future work, we plan to improve the performance by revising annotation guidelines, increasing the number of annotations per announcement, and preventing the same workers from annotating different entity types.

Acknowledgment

This publication was made possible in part by: 1K99LM010227-0110.

References

  • 1.Snow R, O'Connor B, Jurafsky D, Ng AY, editors. Cheap and Fast - But is it Good? Evaluating Non-Expert Annotations for Natural Language Tasks. Proceedings of EMNLP.2008. [Google Scholar]

RESOURCES