Table 8.
System or authors | Precision | Recall | F-score | # of features | Tagging complexity | Availability |
CRF 1 (ABNER+) | 87.30 | 80.68 | 83.86 | 171,251 | LM | N |
CRF 2 (ABNER++) | 87.39 | 81.96 | 84.59 | 355,461 | LM | N |
Dictionary | 90.37 | 82.40 | 86.20 | 0 | Trie | Y |
Dictionary + CRF 2 | 90.52 | 87.63 | 89.05 | 355,609 | LM | Y |
BANNER [26,27] | 88.66 | 84.32 | 86.43 | 500,876 | LM+POS tagger | Y |
Ando [5] (1st in BioCreative 2) | 88.48 | 85.97 | 87.21 | -- | 2*LM+POS tagger+syntactic parser | N |
Hus et al. [10] | 88.95 | 87.65 | 88.30 | 8 * 5,059,368 | 8*LM+POS tagger | N |
In the 6th column, 'LM' and 'Trie' respectively refer to the time complexities of a linear model and a Trie tree based dictionary match. The 'Dictionary' method doesn't need any feature, once the dictionary is constructed. For Ando's system, we cannot find the number of features in the paper [5]. Since the systems in the last two rows used classifier combination, the tagging complexities and numbers of features are multiplied by the numbers of sub-models.