Table 1. Unfiltered and Filtered Pre-Training Dataset Sizes and the Proportions They Represent of the Final Dataseta.
dataset | GDB-13 | Zinc 15 | PubChem | ChEMBL | USPTO | total |
---|---|---|---|---|---|---|
unfiltered size (molecules) | 977,468, 301 | 389,000,000 | 206,550,222 | 1,920,027 | 994,838 | 1,575,933,388 |
proportion of unfiltered total (%) | 81 | 10 | 9 | 0.2 | 0.00825 | 100 |
The filtered aggregate dataset contains only unique SMILES, totaling 1,264,754,823 molecules.