Skip to main content
. 2015 Jan 19;7(Suppl 1):S9. doi: 10.1186/1758-2946-7-S1-S9

Table 1.

The baseline features.

Feature description Note/Regular expression
Roman number [ivxdlcm]+|[IVXDLCM]+

Punctuation [,\\.;:?!]

Start with dash "-.*

Nucleotide sequence [atgcu]+

Number [0-9]+

Capitalized [A-Z] [a-z]*

Quote [\"`']

The lemma for the current token Provided by BioLemmatizer [23]

2, 3 and 4-character prefixes and suffixes

2 and 3 character n-grams Token start or end indicators are included

2 and 3 word n-grams