Skip to main content
. 2023 Nov 29;25(1):bbad422. doi: 10.1093/bib/bbad422

Table 5.

List of popular small molecule pretrained embedders

Name Year Training data Architecture # of parameters Link Ref.
Chemical VAE 2018 250 000 drug-like molecules from ZINC; 108 000 molecules from QM9 data set under 9 heavy atoms Autoencoder 4.2 M https://github.com/aspuru-guzik-group/chemical_vae [80]
SMILES-BERT 2019 18.7 million compounds sampled from ZINC Encoder-only transformer 13 M https://github.com/uta-smile/SMILES-BERT [88]
ChemBERTa/ChemBERTa-v2 2020 250 000 drug-like molecules from ZINC Encoder-only transformer 5–77 M https://github.com/seyonechithrananda/bert-loves-chemistry [99]
MolBERT 2020 1.27 million GuacaMol benchmark data set molecules Encoder-only transformer 85 M https://github.com/BenevolentAI/MolBERT [100]
MegaMolBART 2021 1.45 billion ‘reactive’ molecules from ZINC under 500 Da and logP ≤ 5 Encoder-decoder transformer 45–230 M https://github.com/NVIDIA/MegaMolBART [101]
Molformer 2022 1.1 billion molecules from ZINC and PubChem Encoder-only transformer 110 M https://github.com/IBM/molformer [102]
Chemformer/MolBART 2022 100 million molecules randomly sampled from ZINC under 500 Da and logP ≤ 5 Encoder-decoder transformer 45–230 M https://github.com/MolecularAI/Chemformer [90]
X-MOL 2022 1.1 billion ZINC database molecules Encoder-decoder transformer 110 M https://github.com/bm2-lab/X-MOL [103]