Table 4.
GN results: performance impact of the seven heuristics used to normalize gene names on the development data.
Rule | Example | P | R | F | |
0 | 0.783 | 0.469 | 0.586 | ||
1 | Substitution: Roman letters > arabic numerals | carbonic andydrase XI to carbonic andydrase 11 | 0.778 | 0.492 | 0.603 |
2 | Substitution: Greek letters > single letters | AP-2alpha to AP-2a | 0.779 | 0.497 | 0.607 |
3 | Normalization of case | CAMK2A to camk2a | 0.787 | 0.619 | 0.693 |
4 | Removal: parenthesized materials | sialyltransferase 1 (beta-galactoside alpha-2,6-sialytransferase) to sialyltransferase 1 | 0.782 | 0.623 | 0.694 |
5 | Removal: punctuation | VLA-2 to VLA2 | 0.768 | 0.667 | 0.714 |
6 | Removal: spaces | calcineurin B to calcineurinB | 0.784 | 0.742 | 0.762 |
7 | Removal: strings < 2 characters | P | 0.827 | 0.727 | 0.774 |
Presented are the seven heuristics used to normalize gene names in both lexicon construction and during processing of the gene tagger output, and the performance on the development data after each step was performed. GN, gene normalization.