Skip to main content
. 2021 Feb 6;24(3):102155. doi: 10.1016/j.isci.2021.102155

Table 2.

Examples of how different tokenizers split sentences into tokens

Reagents (NH4)2HPO4 and Sm2 O 3 were mixed
NLTK Reagents | (|NH4 |) | 2HPO4 | and | Sm2O3 | were | mixed
SpaCy Reagents | (| NH4)2HPO4 | and | Sm2O3 | were | mixed
OSCAR4 Reagents | (NH4)2HPO4 | and | Sm2O3 | were | mixed
ChemicalTagger Reagents | (NH4)2HPO4 | and | Sm2O3 | were | mixed
ChemDataExtractor Reagents | (NH4)2HPO4 | and | Sm2O3 | were | mixed

We made Eu2+-doped Ba3Ce(P O 4)3 at 1200°C for 2 h
NLTK We | made | Eu2+-doped | Ba3Ce | (| PO4 |) | 3 | at | 1200 | °C | for | 2 |h
SpaCy We | made | Eu2 | + | -doped | Ba3Ce(PO4)3 | at | 1200 | ° |C | for | 2 |h
OSCAR4 We | made | Eu2+ | - | doped | Ba3Ce(PO4)3| at | 1200 | °C | for | 2 |h
ChemicalTagger We | made | Eu2+-doped | Ba3Ce(PO4)3 | at | 1200 | °C | for | 2 |h
ChemDataExtractor We | made | Eu2+ | - | doped | Ba3Ce(PO4)3 | at | 1200 | ° |C | for | 2 |h

Lead-free a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 ceramics were investigated

NLTK Lead-free | a | (| Bi0.5NA0.5 |) | TiO3-bBaTiO3-c | (| Bi0.5K0.5 |) | TiO3 | ceramics | was | investigated
SpaCy Lead | - | free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated
OSCAR4 Lead | - | free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated
ChemicalTagger Lead-free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated
ChemDataExtractor Lead-free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated

NLTK (Bird et al., 2009) and SpaCy (Honnibal and Johnson, 2015) are general-purpose tokenizing tools, whereas ChemDataExtractor (Swain and Cole, 2016), OSCAR4 (Jessop et al., 2011), ChemicalTagger (Hawizy et al., 2011) are the tools trained for a scientific corpus. Tokens are bound by “|” symbol.