. 2009 Mar-Apr;16(2):247–255. doi: 10.1197/jamia.M2844

Table 1.

Table 1 Features Considered for CRF Models and MEMMs in BioTagger-GM

Feature Name	Description/Example
token_i	Normalized token at the current position
token_i-1	Normalized token at the position i-1, if available
token_i-2	Normalized token at the position i-2, if available
token_i+1	Normalized token at the position i+1, if available
token_j,j+1 for j=i-2 to i+1	Normalized token bigrams
is token_i a sub-word	If a token is originated from a consecutive letter sequence such as a hyphenated word, then true, or false otherwise
shape of normalized token_i	Given a token at the position i (token_i), convert an uppercase letter as ‘X', a lowercase letter as ‘x', a digit sequence as ‘9', and a Greek letter as ‘G'. A sequence of every two to five consecutive ‘X' (‘x') was converted to ‘XXX' (‘xxx')
suffix of normalized token_i (length 4)	If token_i consists only of alphabets, and its length is greater than 5, extract the last four lowercase alphabets
POS_i	Part-of-speech for token_i assigned by the GENIA tagger⁵²
BioThesaurus label_i	B/I labels (“BioT_{B, I}” or none) indicating mapping of token_i to a BioThesaurus entry
UMLS label_i	B/I labels with semantic type information (UMLS_{B,I}_SemT or none) indicating mapping of token_i to a token in a UMLS entry

CRF = conditional random fields; GM = gene mention; MEMM = maximum entropy Markov models.