Table 1.
Comparison of Model 1 and Model 2.
| Aspect | Model 1 | Model 2 |
|---|---|---|
| System adapted | BANNER [22] | tmVar [24] |
| Preprocessing | ||
| Unicode transliteration | No | Yes |
| Tokenization | whitespace punctuation digits lowercase to uppercase |
whitespace punctuation digits lowercase to uppercase uppercase to lowercase |
| Sentence segmentation | Java BreakIterator | None |
| Conditional random field configuration and settings | ||
| Implementation | MALLET [25] | CRF++ [23] |
| Order | 1 | 2 |
| Label model | IOB with one entity label | IOB with one entity label |
| Regularization | L2 | L2 |
| Gaussian prior variance (σ) | 1.0 | 4.0 |
| Feature frequency threshold | 0 | 3 |
| Features | ||
| Individual tokens | Yes | Yes |
| Morphology | Lemmatization | Stemming |
| Part of speech | Yes | No |
| Word shapes | Yes | Yes |
| Characters | N-grams length 2 - 4 | Prefixes and suffixes length 2 - 5 |
| Character counts | None | Total characters, digits, uppercase, lowercase |
| ChemSpot [4] | Yes | No |
| Semantic affixes | None | Suffixes, alkane stems, trivial rings, simple multipliers, etc. |
| Chemical elements | Name and symbol | Name |
| Amino acids | Name, 3-char abbreviation, 1-char abbreviation | None |
| Chemical formulas | Within a single token | None |
| Amino acid sequences | Across tokens | None |
| Context window | 2 | 3 |
| Post processing | ||
| Consistency | Yes | No |
| Abbreviation resolution | Yes | Yes |
| Parenthesis balancing | Yes | Yes |
| Chemical identifiers | Yes | Yes |
This table compares the setup and configuration of Model 1 and Model 2.