Skip to main content
. 2022 Oct 4;62(20):4852–4862. doi: 10.1021/acs.jcim.2c00715

Table 1. Unfiltered and Filtered Pre-Training Dataset Sizes and the Proportions They Represent of the Final Dataseta.

dataset GDB-13 Zinc 15 PubChem ChEMBL USPTO total
unfiltered size (molecules) 977,468, 301 389,000,000 206,550,222 1,920,027 994,838 1,575,933,388
proportion of unfiltered total (%) 81 10 9 0.2 0.00825 100
a

The filtered aggregate dataset contains only unique SMILES, totaling 1,264,754,823 molecules.