Table 2.
Examples of how different tokenizers split sentences into tokens
| Reagents (NH4)2HPO4 and Sm2 O 3 were mixed | |
| NLTK | Reagents | (|NH4 |) | 2HPO4 | and | Sm2O3 | were | mixed |
| SpaCy | Reagents | (| NH4)2HPO4 | and | Sm2O3 | were | mixed |
| OSCAR4 | Reagents | (NH4)2HPO4 | and | Sm2O3 | were | mixed |
| ChemicalTagger | Reagents | (NH4)2HPO4 | and | Sm2O3 | were | mixed |
| ChemDataExtractor | Reagents | (NH4)2HPO4 | and | Sm2O3 | were | mixed |
| We made Eu2+-doped Ba3Ce(P O 4)3 at 1200°C for 2 h | |
| NLTK | We | made | Eu2+-doped | Ba3Ce | (| PO4 |) | 3 | at | 1200 | °C | for | 2 |h |
| SpaCy | We | made | Eu2 | + | -doped | Ba3Ce(PO4)3 | at | 1200 | ° |C | for | 2 |h |
| OSCAR4 | We | made | Eu2+ | - | doped | Ba3Ce(PO4)3| at | 1200 | °C | for | 2 |h |
| ChemicalTagger | We | made | Eu2+-doped | Ba3Ce(PO4)3 | at | 1200 | °C | for | 2 |h |
| ChemDataExtractor | We | made | Eu2+ | - | doped | Ba3Ce(PO4)3 | at | 1200 | ° |C | for | 2 |h |
| Lead-free a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 ceramics were investigated | |
| NLTK | Lead-free | a | (| Bi0.5NA0.5 |) | TiO3-bBaTiO3-c | (| Bi0.5K0.5 |) | TiO3 | ceramics | was | investigated |
| SpaCy | Lead | - | free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated |
| OSCAR4 | Lead | - | free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated |
| ChemicalTagger | Lead-free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated |
| ChemDataExtractor | Lead-free | a(Bi0.5NA0.5)TiO3-bBaTiO3-c(Bi0.5K0.5)TiO3 | ceramics | was | investigated |
NLTK (Bird et al., 2009) and SpaCy (Honnibal and Johnson, 2015) are general-purpose tokenizing tools, whereas ChemDataExtractor (Swain and Cole, 2016), OSCAR4 (Jessop et al., 2011), ChemicalTagger (Hawizy et al., 2011) are the tools trained for a scientific corpus. Tokens are bound by “” symbol.