. 2010 Dec 22;11(Suppl 5):S3. doi: 10.1186/1471-2164-11-S5-S3

Table 1.

Dataset description

	Initial dataset	Dataset with PubMed abstracts	Dataset fulfilling the algorithm’s requirements*	Final dataset (ambiguous aliases excluded)
EntrezGene official symbols	100	73	68**	68
Aliases	425	256	223	165
Abstracts in text corpus	-	13355	12088	9005
Unique PubMed IDs in text corpus	-	11022	10312	7523
Redundancy in text corpus (%)	-	21	16.6	19.7

* The algorithm requires the official gene symbol, and at least one alias and one internal control to produce text corpora of PubMed abstracts. Additionally, the algorithm requires an informative group-specific vocabulary to pass the filters for ubiquitous terms.

** Five official gene symbols, namely DERL3, KCNA7, KCNJ14, MED18, and TBRV4-2, did not fulfil the algorithm’s requirements since their aliases produced no PubMed abstract.