Table 7.
Feature | Brief description | Sample features (bigrams) |
---|---|---|
Character n-grams | the set of all possible combinations of a token's consecutive characters, taken n at a time (n = 2, 3, 4) | {GS}, {SK}, {K2}, {21}, {14}, {4a} |
Token n-grams | unigrams and bigrams of surface forms; unigrams and bigrams of normalised surface forms where numbers numbers are replaced with '0's, the consecutive instances of which are compressed | {It, attenuated}, {attenuated, GSK214a}; {Aa, aaaaaaaaaa}, {aaaaaaaaaa, AAA000a} |
Lemma n-grams | unigrams and bigrams of lemmatised surface forms | {It, attenuate}, {attenuate, GSK214a} |
POS tag n-grams | unigrams and bigrams of part-of-speech (POS) tags | {PRP, VBD}, {VBD, NN}, |
Lemma & POS tag n-grams |
unigrams and bigrams of lemmatised forms combined with POS tags | {It:PRP, attenuate:VBD}, {attenuate:VBD, GSK214a:NN} |
Chunk information | chunk tag of current token; surface form of the enclosing chunk's | {B-NP}; {gestation} |