Table 1.
Initial dataset | Dataset with PubMed abstracts | Dataset fulfilling the algorithm’s requirements* | Final dataset (ambiguous aliases excluded) | |
---|---|---|---|---|
EntrezGene official symbols | 100 | 73 | 68** | 68 |
Aliases | 425 | 256 | 223 | 165 |
Abstracts in text corpus | - | 13355 | 12088 | 9005 |
Unique PubMed IDs in text corpus | - | 11022 | 10312 | 7523 |
Redundancy in text corpus (%) | - | 21 | 16.6 | 19.7 |
* The algorithm requires the official gene symbol, and at least one alias and one internal control to produce text corpora of PubMed abstracts. Additionally, the algorithm requires an informative group-specific vocabulary to pass the filters for ubiquitous terms.
** Five official gene symbols, namely DERL3, KCNA7, KCNJ14, MED18, and TBRV4-2, did not fulfil the algorithm’s requirements since their aliases produced no PubMed abstract.