Skip to main content
. 2005 May 24;6(Suppl 1):S3. doi: 10.1186/1471-2105-6-S1-S3

Table 1.

GENETAG corpus statistics The 20K sentences were split into four subsets called Train, Test, Round1 and Round2.

Train Test Round1 Round 2 Total
Number of Sentences 7,500 2,500 5,000 5,000 20,000
Number of Words 204,195 68,043 137,586 137,977 547,801
Number of Tagged Genes = G 8,935 2,987 5,949 6,125 23,996
Total Number of Alternative Forms of Gene Names in G 6,583 2,158 4,275 4,505 17,531
Number of Gene Names in G with Alternative Forms = N 4,675 1,522 3,057 3,186 12,440
Average Number of Alternatives per Gene Name in N 1.66 1.67 1.62 1.65 1.65