Table 1.
Feature Name | Description/Example |
---|---|
tokeni | Normalized token at the current position |
tokeni-1 | Normalized token at the position i-1, if available |
tokeni-2 | Normalized token at the position i-2, if available |
tokeni+1 | Normalized token at the position i+1, if available |
tokenj,j+1 for j=i-2 to i+1 | Normalized token bigrams |
is tokeni a sub-word | If a token is originated from a consecutive letter sequence such as a hyphenated word, then true, or false otherwise |
shape of normalized tokeni | Given a token at the position i (tokeni), convert an uppercase letter as ‘X', a lowercase letter as ‘x', a digit sequence as ‘9', and a Greek letter as ‘G'. A sequence of every two to five consecutive ‘X' (‘x') was converted to ‘XXX' (‘xxx') |
suffix of normalized tokeni (length 4) | If tokeni consists only of alphabets, and its length is greater than 5, extract the last four lowercase alphabets |
POSi | Part-of-speech for tokeni assigned by the GENIA tagger 52 |
BioThesaurus labeli | B/I labels (“BioT_{B, I}” or none) indicating mapping of tokeni to a BioThesaurus entry |
UMLS labeli | B/I labels with semantic type information (UMLS_{B,I}_SemT or none) indicating mapping of tokeni to a token in a UMLS entry |
CRF = conditional random fields; GM = gene mention; MEMM = maximum entropy Markov models.