Skip to main content
. 2009 Mar-Apr;16(2):247–255. doi: 10.1197/jamia.M2844

Table 1.

Table 1 Features Considered for CRF Models and MEMMs in BioTagger-GM

Feature Name Description/Example
tokeni Normalized token at the current position
tokeni-1 Normalized token at the position i-1, if available
tokeni-2 Normalized token at the position i-2, if available
tokeni+1 Normalized token at the position i+1, if available
tokenj,j+1 for j=i-2 to i+1 Normalized token bigrams
is tokeni a sub-word If a token is originated from a consecutive letter sequence such as a hyphenated word, then true, or false otherwise
shape of normalized tokeni Given a token at the position i (tokeni), convert an uppercase letter as ‘X', a lowercase letter as ‘x', a digit sequence as ‘9', and a Greek letter as ‘G'. A sequence of every two to five consecutive ‘X' (‘x') was converted to ‘XXX' (‘xxx')
suffix of normalized tokeni (length 4) If tokeni consists only of alphabets, and its length is greater than 5, extract the last four lowercase alphabets
POSi Part-of-speech for tokeni assigned by the GENIA tagger 52
BioThesaurus labeli B/I labels (“BioT_{B, I}” or none) indicating mapping of tokeni to a BioThesaurus entry
UMLS labeli B/I labels with semantic type information (UMLS_{B,I}_SemT or none) indicating mapping of tokeni to a token in a UMLS entry

CRF = conditional random fields; GM = gene mention; MEMM = maximum entropy Markov models.