Skip to main content
. 2016 Jun 5;2016:baw077. doi: 10.1093/database/baw077

Table 2.

Features used in two individual sequence labeling modules: CRFs and SSVMs.

Feature Description
Bag-of-words Unigrams: w0, w−1, w1, w−2, w0;
Bigrams: w−2w−1, w−1,w0, w0w1, w1w2;
Trigrams: w−2w−1w0, w−1w0w1, w0w1w2
wi is a token at position relative the current token.
Part-of-speech (POS) tags Unigrams: p0, p−1, p1, p−2, p2
Bigrams: p−2p−1, p−1p0, p0p1, p1p2;
Trigrams: p−2p−1,p0, p−1,p0,p1, p0,p1,p2
pi is a POS tag at position i relative the current token.
Combinations of tokens and POS tags w−1p−2, w1p−1, w−1p0, w2p−1, w0p0, w0p1, w1p0, w1p1, w1p2,
Sentence information Length of the current sentence; whether there is any bracket unmatched in the current sentence?
Affixes Prefixes and suffixes of the length from 1 to 5.
Orthographical features Whether the current word is an upper Caps word? Contains a digit or not? Has uppercase characters inside? Etc.
Word shapes Any or consecutive uppercase character(s), lowercase character(s), digit (s) and other character(s) in the current word is/are replaced by ‘A’, ‘a’, ‘#’ and ‘-’ respectively.
Section information Which section the current word belongs to, title or abstract?
Word representation features [5] Brown clustering (https://github.com/percyliang/brown-cluster);Word2vec (https://code.google.com/p/word2vec/).
Dictionary features Chemical dictionary: CTD, DrugBank, MeSH, Pharmacogenetics Knowledge Base (PharmGKB) (26), UMLS, and Wikipedia;
Disease dictionary: CTD, MeSH, UMLS, disease ontology (27), National Drug File Reference Terminology (NDF-RT) (28) and Wikipedia.
Frequency features Whether the frequency of the current word is higher than a given value (4 in our system) and the inverse document frequency of it is less than another given value (0.1 in our system)?
Character N-grams Character N-grams (N = 1, 2, …, 4) within the current word.