Skip to main content
. 2010 Dec 22;11(Suppl 5):S3. doi: 10.1186/1471-2164-11-S5-S3

Table 1.

Dataset description

Initial dataset Dataset with PubMed abstracts Dataset fulfilling the algorithm’s requirements* Final dataset (ambiguous aliases excluded)
EntrezGene official symbols 100 73 68** 68
Aliases 425 256 223 165
Abstracts in text corpus - 13355 12088 9005
Unique PubMed IDs in text corpus - 11022 10312 7523
Redundancy in text corpus (%) - 21 16.6 19.7

* The algorithm requires the official gene symbol, and at least one alias and one internal control to produce text corpora of PubMed abstracts. Additionally, the algorithm requires an informative group-specific vocabulary to pass the filters for ubiquitous terms.

** Five official gene symbols, namely DERL3, KCNA7, KCNJ14, MED18, and TBRV4-2, did not fulfil the algorithm’s requirements since their aliases produced no PubMed abstract.