Bag-of-words |
Unigrams: w0, w−1, w1, w−2, w0; |
Bigrams: w−2w−1,
w−1,w0,
w0w1,
w1w2; |
Trigrams: w−2w−1w0,
w−1w0w1,
w0w1w2
|
wi is a token at position relative the current token. |
Part-of-speech (POS) tags |
Unigrams: p0,
p−1,
p1,
p−2,
p2
|
Bigrams: p−2p−1,
p−1p0,
p0p1,
p1p2; |
Trigrams: p−2p−1,p0,
p−1,p0,p1,
p0,p1,p2
|
pi is a POS tag at position i relative the current token. |
Combinations of tokens and POS tags |
w−1p−2,
w1p−1,
w−1p0,
w2p−1,
w0p0,
w0p1,
w1p0,
w1p1,
w1p2, |
Sentence information |
Length of the current sentence; whether there is any bracket unmatched in the current sentence? |
Affixes |
Prefixes and suffixes of the length from 1 to 5. |
Orthographical features |
Whether the current word is an upper Caps word? Contains a digit or not? Has uppercase characters inside? Etc. |
Word shapes |
Any or consecutive uppercase character(s), lowercase character(s), digit (s) and other character(s) in the current word is/are replaced by ‘A’, ‘a’, ‘#’ and ‘-’ respectively. |
Section information |
Which section the current word belongs to, title or abstract? |
Word representation features [5] |
Brown clustering (https://github.com/percyliang/brown-cluster);Word2vec (https://code.google.com/p/word2vec/). |
Dictionary features |
Chemical dictionary: CTD, DrugBank, MeSH, Pharmacogenetics Knowledge Base (PharmGKB) (26), UMLS, and Wikipedia; |
Disease dictionary: CTD, MeSH, UMLS, disease ontology (27), National Drug File Reference Terminology (NDF-RT) (28) and Wikipedia. |
Frequency features |
Whether the frequency of the current word is higher than a given value (4 in our system) and the inverse document frequency of it is less than another given value (0.1 in our system)? |
Character N-grams |
Character N-grams (N = 1, 2, …, 4) within the current word. |