Table 3.
Overall annotation statistics comparing the existing Europe PMC dictionary-based text mining approach to the human curation for the selected 300 gold-standard articles.
Europe PMC dictionary-based | Gold-standard human annotation | ||||||||
---|---|---|---|---|---|---|---|---|---|
Gene/Protein | Disease | Organism | Total | Gene/Protein | Disease | Organism | Total | ||
Annotations | Total | 28,869 | 10,515 | 18,040 | 57,425 | 36,369 | 14,518 | 21,491 | 72,378 |
Unique | 3,419 | 1,752 | 1,700 | 6,871 | 5,600 | 2,037 | 2,347 | 9,970 | |
Normalised to a DB entry | Total | — | — | — | — | 21,664 | 8,476 | 16,021 | 46,161 |
Median per article | Total | 53.5 | 19.5 | 34 | 170 | 54.5 | 16 | 30 | 192 |
Unique | 12 | 8 | 8 | 36 | 13 | 6.5 | 8 | 44.5 | |
Max annotation per article | Total | 722 | 219 | 407 | 955 | 795 | 478 | 456 | 940 |
Unique | 113 | 78 | 111 | 156 | 178 | 76 | 170 | 201 |
Overall we have gained around 11k term annotations, with the highest gain existing for the Gene/Protein category. We report unique term count based on the string match and how many normalise to a database identifier of the databases mentioned above rather than unique database identifier counts.