. 2023 Nov 29;25(1):bbad422. doi: 10.1093/bib/bbad422

Table 5.

List of popular small molecule pretrained embedders

Name	Year	Training data	Architecture	# of parameters	Link	Ref.
Chemical VAE	2018	250 000 drug-like molecules from ZINC; 108 000 molecules from QM9 data set under 9 heavy atoms	Autoencoder	4.2 M	https://github.com/aspuru-guzik-group/chemical_vae	[80]
SMILES-BERT	2019	18.7 million compounds sampled from ZINC	Encoder-only transformer	13 M	https://github.com/uta-smile/SMILES-BERT	[88]
ChemBERTa/ChemBERTa-v2	2020	250 000 drug-like molecules from ZINC	Encoder-only transformer	5–77 M	https://github.com/seyonechithrananda/bert-loves-chemistry	[99]
MolBERT	2020	1.27 million GuacaMol benchmark data set molecules	Encoder-only transformer	85 M	https://github.com/BenevolentAI/MolBERT	[100]
MegaMolBART	2021	1.45 billion ‘reactive’ molecules from ZINC under 500 Da and logP ≤ 5	Encoder-decoder transformer	45–230 M	https://github.com/NVIDIA/MegaMolBART	[101]
Molformer	2022	1.1 billion molecules from ZINC and PubChem	Encoder-only transformer	110 M	https://github.com/IBM/molformer	[102]
Chemformer/MolBART	2022	100 million molecules randomly sampled from ZINC under 500 Da and logP ≤ 5	Encoder-decoder transformer	45–230 M	https://github.com/MolecularAI/Chemformer	[90]
X-MOL	2022	1.1 billion ZINC database molecules	Encoder-decoder transformer	110 M	https://github.com/bm2-lab/X-MOL	[103]