Skip to main content
. 2015 Jan 19;7(Suppl 1):S6. doi: 10.1186/1758-2946-7-S1-S6

Table 7.

Character and word n-gram features extracted by NERsuite by default.

Feature Brief description Sample features (bigrams)
Character n-grams the set of all possible combinations of a token's consecutive characters, taken n at a time (n = 2, 3, 4) {GS}, {SK}, {K2}, {21}, {14}, {4a}

Token n-grams unigrams and bigrams of surface forms; unigrams and bigrams of normalised surface forms where numbers numbers are replaced with '0's, the consecutive instances of which are compressed {It, attenuated}, {attenuated, GSK214a}; {Aa, aaaaaaaaaa}, {aaaaaaaaaa, AAA000a}

Lemma n-grams unigrams and bigrams of lemmatised surface forms {It, attenuate}, {attenuate, GSK214a}

POS tag n-grams unigrams and bigrams of part-of-speech (POS) tags {PRP, VBD}, {VBD, NN},

Lemma & POS tag
n-grams
unigrams and bigrams of lemmatised forms combined with POS tags {It:PRP, attenuate:VBD}, {attenuate:VBD, GSK214a:NN}

Chunk information chunk tag of current token; surface form of the enclosing chunk's {B-NP}; {gestation}