Skip to main content
. 2020 Apr 10;12:22. doi: 10.1186/s13321-020-00425-8

Table 6.

Impact of the training set size on GENs performance

Dataset and evaluated size Augmented size with real factora Best model epoch # Validity% Uniqueness% Training% Length match%b HAC match%c
PubChem225k
 9k 54,624 (4.8) 10, 10, 10 81.3 ± 0.9 100.0 ± 0.0 0.3 ± 0.1 97.7 ± 0.0 90.5 ± 0.0
 45k 218,124 (4.8) 5, 5, 5 95.6 ± 0.7 99.9 ± 0.1 2.6 ± 0.5 99.0 ± 0.0 94.7 ± 0.0
 225k 1088,864 (4.8) 4, 4, 4 98.3 ± 0.3 99.9 ± 0.0 11.2 ± 0.5 97.3 ± 0.7 96.6 ± 0.3
Chembl24
 9k 35,928 (4.0) 44, 43, 45 74.2 ± 1.9 99.0 ± 0.2 0.2 ± 0.2 81.9 ± 5.4 95.9 ± 1.0
 45k 179,888 (4.0) 5, 6, 5 91.9 ± 1.9 100.0 ± 0.0 0.2 ± 0.1 90.6 ± 2.8 97.6 ± 1.4
 225k 896,214 (4.0) 9, 6, 6 94.6 ± 0.1 100.0 ± 0.0 1.4 ± 0.3 88.4 ± 1.6 98.1 ± 0.6
Zinc15
 9k 32,546 (3.6) 24, 21, 21 77.2 ± 1.0 100.0 ± 0.0 0.0 ± 0.0 82.2 ± 3.3 91.2 ± 1.1
 45k 163,929 (3.6) 10, 7, 11 90.4 ± 1.1 100.0 ± 0.0 0.1 ± 0.1 87.6 ± 1.2 92.6 ± 1.1
 225k 820,747 (3.6) 4, 6, 6 95.2 ± 0.3 100.0 ± 0.0 0.3 ± 0.1 90.4 ± 1.2 93.5 ± 1.2

aSize of the augmented dataset after 5 random attempts per SMILES and de-duplication to unique SMILES. Real augmentation factor varies depending on dataset

bLength match for SMILES length distributions of the training set and generated set (See “Methods”)

cHAC match for the atom count distributions of the generated set and training set (See “Methods”)