Skip to main content
. 2015 Jan 19;7(Suppl 1):S3. doi: 10.1186/1758-2946-7-S1-S3

Table 1.

Comparison of Model 1 and Model 2.

Aspect Model 1 Model 2
System adapted BANNER [22] tmVar [24]

Preprocessing

Unicode transliteration No Yes

Tokenization whitespace
punctuation
digits
lowercase to uppercase
whitespace
punctuation
digits
lowercase to uppercase
uppercase to lowercase

Sentence segmentation Java BreakIterator None

Conditional random field configuration and settings

Implementation MALLET [25] CRF++ [23]

Order 1 2

Label model IOB with one entity label IOB with one entity label

Regularization L2 L2

Gaussian prior variance (σ) 1.0 4.0

Feature frequency threshold 0 3

Features

Individual tokens Yes Yes

Morphology Lemmatization Stemming

Part of speech Yes No

Word shapes Yes Yes

Characters N-grams length 2 - 4 Prefixes and suffixes length 2 - 5

Character counts None Total characters, digits, uppercase, lowercase

ChemSpot [4] Yes No

Semantic affixes None Suffixes, alkane stems, trivial rings, simple multipliers, etc.

Chemical elements Name and symbol Name

Amino acids Name, 3-char abbreviation, 1-char abbreviation None

Chemical formulas Within a single token None

Amino acid sequences Across tokens None

Context window 2 3

Post processing

Consistency Yes No

Abbreviation resolution Yes Yes

Parenthesis balancing Yes Yes

Chemical identifiers Yes Yes

This table compares the setup and configuration of Model 1 and Model 2.