Table 1. Top-1 Accuracy Using Different Molecule Formats, Tokenization Schemes, and Embeddings Strategiesa.
atom-Level |
BPE |
|||
---|---|---|---|---|
FS | PT | FS | PT | |
Product Prediction (with Reagents) | ||||
SMILES | 0.879 | 0.865 | 0.854 | 0.512 |
SELFIES | 0.768 | 0.721 | 0.654 | 0.313 |
Product Prediction (without Reagents) | ||||
SMILES | 0.837 | 0.827 | 0.807 | 0.589 |
SELFIES | 0.745 | 0.695 | 0.623 | 0.379 |
Reactant Prediction (with Reagents) | ||||
SMILES | 0.678 | 0.643 | 0.660 | 0.421 |
SELFIES | 0.610 | 0.545 | 0.540 | 0.301 |
Reactant Prediction (without Reagent) | ||||
SMILES | 0.525 | 0.504 | 0.514 | 0.401 |
SELFIES | 0.472 | 0.449 | 0.427 | 0.311 |
Reagent Prediction | ||||
SMILES | 0.196 | 0.135 | 0.183 | 0.211 |
SELFIES | 0.187 | 0.122 | 0.174 | 0.196 |
FS—input embeddings trained from scratch, and PT—pre-trained input embeddings.