. 2016 Jun 5;2016:baw077. doi: 10.1093/database/baw077

Table 2.

Features used in two individual sequence labeling modules: CRFs and SSVMs.

Feature	Description
Bag-of-words	Unigrams: w₀, w₋₁, w₁, w₋₂, w₀;
	Bigrams: w₋₂w_−1, w_−1,w_0, w₀w_1, w₁w₂;
	Trigrams: w₋₂w₋₁w_0, w₋₁w₀w_1, w₀w₁w₂
	w_i is a token at position relative the current token.
Part-of-speech (POS) tags	Unigrams: p_0, p_−1, p_1, p_−2, p₂
	Bigrams: p₋₂p_−1, p₋₁p_0, p₀p_1, p₁p₂;
	Trigrams: p₋₂p_−1,p_0, p_−1,p_0,p_1, p_0,p_1,p₂
	p_i is a POS tag at position i relative the current token.
Combinations of tokens and POS tags	w₋₁p_−2, w₁p_−1, w₋₁p_0, w₂p_−1, w₀p_0, w₀p_1, w₁p_0, w₁p_1, w₁p₂,
Sentence information	Length of the current sentence; whether there is any bracket unmatched in the current sentence?
Affixes	Prefixes and suffixes of the length from 1 to 5.
Orthographical features	Whether the current word is an upper Caps word? Contains a digit or not? Has uppercase characters inside? Etc.
Word shapes	Any or consecutive uppercase character(s), lowercase character(s), digit (s) and other character(s) in the current word is/are replaced by ‘A’, ‘a’, ‘#’ and ‘-’ respectively.
Section information	Which section the current word belongs to, title or abstract?
Word representation features [5]	Brown clustering (https://github.com/percyliang/brown-cluster);Word2vec (https://code.google.com/p/word2vec/).
Dictionary features	Chemical dictionary: CTD, DrugBank, MeSH, Pharmacogenetics Knowledge Base (PharmGKB) (26), UMLS, and Wikipedia;
Dictionary features	Disease dictionary: CTD, MeSH, UMLS, disease ontology (27), National Drug File Reference Terminology (NDF-RT) (28) and Wikipedia.
Frequency features	Whether the frequency of the current word is higher than a given value (4 in our system) and the inverse document frequency of it is less than another given value (0.1 in our system)?
Character N-grams	Character N-grams (N = 1, 2, …, 4) within the current word.