Table 5.
Performance analysis of tokenization schemes for molecular property prediction using MoleculeNet benchmark suite
SMILES | DeepSMILES | SELFIES | SmilesPE | AIS | |
---|---|---|---|---|---|
Regression Datasets: RMSE | |||||
ESOL | 0.628 | 0.631 | 0.675 | 0.689 | 0.553 |
FreeSolv | 0.545 | 0.544 | 0.564 | 0.761 | 0.441 |
Lip | 0.924 | 0.895 | 0.938 | 0.800 | 0.683 |
Classification Datasets: ROC-AUC | |||||
BBBP | 0.758 | 0.777 | 0.799 | 0.847 | 0.885 |
BACE | 0.740 | 0.774 | 0.746 | 0.837 | 0.835 |
HIV | 0.649 | 0.648 | 0.653 | 0.739 | 0.729 |
Comparison of Random Forest regression and classification models with 5-Fold Cross-Validation. Bold emphasis denotes the highest performing approach