Table 4.
Evaluation of current Europe PMC dictionary workflow against the human annotation.
| Gene/Protein | Disease | Organism | Overall | ||||
|---|---|---|---|---|---|---|---|
| Unique | Total | Unique | Total | Unique | Total | ||
| Correct | 2,551 | 20,832 | 1,309 | 7,518 | 1,351 | 15,353 | 43,703 |
| Added | 2,671 | 13,718 | 575 | 5,836 | 820 | 5,307 | 24,230 |
| Removed | 697 | 6,172 | 447 | 1,991 | 207 | 982 | 8,514 |
| Modified | 561 | 1,819 | 269 | 1,164 | 311 | 831 | 4,445 |
| Precision | 0.72 | 0.70 | 0.89 | 0.77 | |||
| Recall | 0.60 | 0.56 | 0.74 | 0.64 | |||
| F1-score | 0.65 | 0.62 | 0.80 | 0.70 | |||
This table shows the number of dictionary-based Europe PMC annotations updated by human annotators. A large proportion of the Europe PMC annotations are confirmed as correct by the human annotators, although they also added/annotated a significant number of previously unidentified/unannotated terms. The Europe PMC pipeline misses a proportion of these terms due to outdated dictionaries. The removed terms are often common English words or short acronyms. Gene/Protein terms (GP) are more likely to be removed than other entity types, that is, Diseases (DS) and Organisms (OG), due to the frequency of occurrence and the false positive rate for three-letter gene-protein acronyms. This row also counts the annotation where the dictionary-based approach wrongly assigned the type (WT), e.g. Diseases entities wrongly tagged as Gene/Proteins (WT_GP) by the Europe PMC dictionary-based approach (annotators used WT_GP, DS tag) will be added to the ‘removed’ cell count for the Gene/Proteins and ‘added’ cell for the Diseases. The “Modified” row shows the number of entities that were modified/split into multiple entities (WS). The overall column is the summation of correctness tags (CRT), i.e. CRT, Missing (MIS) and Wrong Span (WS), going under the Correct, Added and Modified rows. For the WT tag, they were split into two, one under the Removed column and the rest under the Modified row. When an annotation is assigned WT_GP, it means that it is a wrong Gene/Proteins annotation and removed from the annotation set, whereas the [WT_GP, DS] tag means the annotation was not removed from the annotation set, but the entity type is modified.