Skip to main content
. 2008 Sep 1;9(Suppl 2):S9. doi: 10.1186/gb-2008-9-s2-s9

Table 4.

GN results: performance impact of the seven heuristics used to normalize gene names on the development data.

Rule Example P R F
0 0.783 0.469 0.586
1 Substitution: Roman letters > arabic numerals carbonic andydrase XI to carbonic andydrase 11 0.778 0.492 0.603
2 Substitution: Greek letters > single letters AP-2alpha to AP-2a 0.779 0.497 0.607
3 Normalization of case CAMK2A to camk2a 0.787 0.619 0.693
4 Removal: parenthesized materials sialyltransferase 1 (beta-galactoside alpha-2,6-sialytransferase) to sialyltransferase 1 0.782 0.623 0.694
5 Removal: punctuation VLA-2 to VLA2 0.768 0.667 0.714
6 Removal: spaces calcineurin B to calcineurinB 0.784 0.742 0.762
7 Removal: strings < 2 characters P 0.827 0.727 0.774

Presented are the seven heuristics used to normalize gene names in both lexicon construction and during processing of the gene tagger output, and the performance on the development data after each step was performed. GN, gene normalization.