Table 6.
Impact of the training set size on GENs performance
Dataset and evaluated size | Augmented size with real factora | Best model epoch # | Validity% | Uniqueness% | Training% | Length match%b | HAC match%c |
---|---|---|---|---|---|---|---|
PubChem225k | |||||||
9k | 54,624 (4.8) | 10, 10, 10 | 81.3 ± 0.9 | 100.0 ± 0.0 | 0.3 ± 0.1 | 97.7 ± 0.0 | 90.5 ± 0.0 |
45k | 218,124 (4.8) | 5, 5, 5 | 95.6 ± 0.7 | 99.9 ± 0.1 | 2.6 ± 0.5 | 99.0 ± 0.0 | 94.7 ± 0.0 |
225k | 1088,864 (4.8) | 4, 4, 4 | 98.3 ± 0.3 | 99.9 ± 0.0 | 11.2 ± 0.5 | 97.3 ± 0.7 | 96.6 ± 0.3 |
Chembl24 | |||||||
9k | 35,928 (4.0) | 44, 43, 45 | 74.2 ± 1.9 | 99.0 ± 0.2 | 0.2 ± 0.2 | 81.9 ± 5.4 | 95.9 ± 1.0 |
45k | 179,888 (4.0) | 5, 6, 5 | 91.9 ± 1.9 | 100.0 ± 0.0 | 0.2 ± 0.1 | 90.6 ± 2.8 | 97.6 ± 1.4 |
225k | 896,214 (4.0) | 9, 6, 6 | 94.6 ± 0.1 | 100.0 ± 0.0 | 1.4 ± 0.3 | 88.4 ± 1.6 | 98.1 ± 0.6 |
Zinc15 | |||||||
9k | 32,546 (3.6) | 24, 21, 21 | 77.2 ± 1.0 | 100.0 ± 0.0 | 0.0 ± 0.0 | 82.2 ± 3.3 | 91.2 ± 1.1 |
45k | 163,929 (3.6) | 10, 7, 11 | 90.4 ± 1.1 | 100.0 ± 0.0 | 0.1 ± 0.1 | 87.6 ± 1.2 | 92.6 ± 1.1 |
225k | 820,747 (3.6) | 4, 6, 6 | 95.2 ± 0.3 | 100.0 ± 0.0 | 0.3 ± 0.1 | 90.4 ± 1.2 | 93.5 ± 1.2 |
aSize of the augmented dataset after 5 random attempts per SMILES and de-duplication to unique SMILES. Real augmentation factor varies depending on dataset
bLength match for SMILES length distributions of the training set and generated set (See “Methods”)
cHAC match for the atom count distributions of the generated set and training set (See “Methods”)