. 2020 Apr 10;12:22. doi: 10.1186/s13321-020-00425-8

Table 6.

Impact of the training set size on GENs performance

Dataset and evaluated size	Augmented size with real factor^a	Best model epoch #	Validity%	Uniqueness%	Training%	Length match%^b	HAC match%^c
PubChem225k
9k	54,624 (4.8)	10, 10, 10	81.3 ± 0.9	100.0 ± 0.0	0.3 ± 0.1	97.7 ± 0.0	90.5 ± 0.0
45k	218,124 (4.8)	5, 5, 5	95.6 ± 0.7	99.9 ± 0.1	2.6 ± 0.5	99.0 ± 0.0	94.7 ± 0.0
225k	1088,864 (4.8)	4, 4, 4	98.3 ± 0.3	99.9 ± 0.0	11.2 ± 0.5	97.3 ± 0.7	96.6 ± 0.3
Chembl24
9k	35,928 (4.0)	44, 43, 45	74.2 ± 1.9	99.0 ± 0.2	0.2 ± 0.2	81.9 ± 5.4	95.9 ± 1.0
45k	179,888 (4.0)	5, 6, 5	91.9 ± 1.9	100.0 ± 0.0	0.2 ± 0.1	90.6 ± 2.8	97.6 ± 1.4
225k	896,214 (4.0)	9, 6, 6	94.6 ± 0.1	100.0 ± 0.0	1.4 ± 0.3	88.4 ± 1.6	98.1 ± 0.6
Zinc15
9k	32,546 (3.6)	24, 21, 21	77.2 ± 1.0	100.0 ± 0.0	0.0 ± 0.0	82.2 ± 3.3	91.2 ± 1.1
45k	163,929 (3.6)	10, 7, 11	90.4 ± 1.1	100.0 ± 0.0	0.1 ± 0.1	87.6 ± 1.2	92.6 ± 1.1
225k	820,747 (3.6)	4, 6, 6	95.2 ± 0.3	100.0 ± 0.0	0.3 ± 0.1	90.4 ± 1.2	93.5 ± 1.2

^aSize of the augmented dataset after 5 random attempts per SMILES and de-duplication to unique SMILES. Real augmentation factor varies depending on dataset

^bLength match for SMILES length distributions of the training set and generated set (See “Methods”)

^cHAC match for the atom count distributions of the generated set and training set (See “Methods”)