. 2025 Aug 14;4(10):2752–2764. doi: 10.1039/d5dd00028a

Table 1. Distribution learning of physico-chemical properties. We report the Kolmogorov–Smirnov (KS) distance between the de novo designs (3000 SMILES strings) and the training set molecules, computed for selected descriptors (HBA = number of hydrogen bond acceptors, HBD = number of hydrogen bond donors, MW = molecular weight, and log P = octanol–water partitioning coefficient). For each training set size (1000, 2500, 5000, 7500, and 10 000 molecules), the KS distance is reported for each augmentation strategy and each descriptor. For each descriptor and training set size, the best and second-best KS distances are highlighted in boldface and italics, respectively. The number of times a given augmentation strategy provides the best or second-best performance for a given descriptor across training set sizes is also reported. The KS distances between the training and the test set molecules and for the designs obtained with no augmentation are reported as a reference (n.a. = not available).

Property	Method	Training set size					Times top-2
Property	Method	1000	2500	5000	7500	10 000	Times top-2
HBA	Enumeration	4 ± 2	12 ± 1	2.3 ± 0.7	7 ± 2	4.5 ± 0.7	3
	Token deletion (random)	25 ± 11	22 ± 6	22 ± 5	25 ± 8	27 ± 4	0
	Token deletion (validity)	16 ± 2	17 ± 2	13 ± 1	12.9 ± 0.5	20 ± 3	0
	Token deletion (protected)	33 ± 8	14 ± 5	17 ± 5	18 ± 2	17 ± 4	0
	Atom masking (random)	23 ± 2	21 ± 2	13 ± 5	10.8 ± 0.9	7.7 ± 0.7	0
	Atom masking (funct. group)	14 ± 4	8 ± 3	10 ± 1	6 ± 2	7 ± 2	2
	Bioisosteric substitution	2.6 ± 0.5	10 ± 2	2.1 ± 0.7	5 ± 2	6.8 ± 0.3	5
	Self-training	50.0 ± 0.5	18.0 ± 0.2	14 ± 3	13.0 ± 0.6	13.2 ± 0.9	0
	No augmentation	31 ± 4	16 ± 2	15.4 ± 0.5	18 ± 3	13.4 ± 0.4	0
	Train – test	2	1	1	1	1	n.a.
HBD	Enumeration	4 ± 3	2 ± 1	2 ± 2	1.8 ± 0.5	3 ± 1	4
	Token deletion (random)	10.3 ± 0.2	8 ± 2	8 ± 2	6 ± 2	10 ± 6	0
	Token deletion (validity)	4 ± 2	5 ± 2	3 ± 1	4.0 ± 0.5	4.1 ± 0.8	2
	Token deletion (protected)	11 ± 4	4 ± 1	4 ± 2	5 ± 2	2.9 ± 0.1	1
	Atom masking (random)	4 ± 2	5 ± 2	3.7 ± 0.2	3.2 ± 0.2	6 ± 2	2
	Atom masking (funct. group)	11 ± 3	11 ± 5	4 ± 3	7 ± 3	3.3 ± 0.9	0
	Bioisosteric substitution	5 ± 3	3 ± 2	6 ± 3	4 ± 1	2.2 ± 0.7	2
	Self-training	17 ± 2	4.7 ± 0.9	8 ± 1	14 ± 2	5.7 ± 0.9	0
	No augmentation	14 ± 3	7 ± 1	6 ± 2	4 ± 1	7.2 ± 0.7	0
	Train – test	3	4	2	2	2	n.a.
MW	Enumeration	12.6 ± 0.4	14 ± 2	8 ± 1	5.6 ± 0.6	5 ± 1	3
	Token deletion (random)	45 ± 6	31 ± 4	34 ± 7	31 ± 8	32 ± 4	0
	Token deletion (validity)	25.5 ± 0.7	22 ± 3	20 ± 3	20 ± 1	22 ± 2	0
	Token deletion (protected)	43 ± 5	26 ± 3	22 ± 6	28 ± 3	25 ± 4	0
	Atom masking (random)	21 ± 3	21 ± 1	6 ± 2	10 ± 5	4 ± 1	2
	Atom masking (funct. group)	11 ± 5	9 ± 3	6 ± 2	6 ± 2	5 ± 2	4
	Bioisosteric substitution	5.6 ± 1.0	8 ± 2	9 ± 1	7 ± 1	13.1 ± 0.5	2
	Self-training	16.1 ± 0.7	12.1 ± 0.8	11.2 ± 0.9	11 ± 1	7.5 ± 0.1	0
	No augmentation	40 ± 3	21 ± 1	15.3 ± 0.2	17 ± 1	16 ± 2	0
	Train – test	3	3	3	3	3	n.a.
Log P	Enumeration	11 ± 3	7 ± 2	8 ± 3	3 ± 1	5.1 ± 0.8	3
	Token deletion (random)	31 ± 3	19 ± 4	22 ± 4	18 ± 6	19 ± 2	0
	Token deletion (validity)	17 ± 5	12 ± 1	12 ± 2	13 ± 2	12 ± 3	0
	Token deletion (protected)	32 ± 11	22 ± 4	16 ± 3	22.1 ± 0.6	17 ± 2	0
	Atom masking (random)	11 ± 2	10 ± 1	8 ± 2	7 ± 2	8 ± 2	0
	Atom masking (funct. group)	8 ± 3	6 ± 2	4.7 ± 0.7	7 ± 3	8 ± 2	3
	Bioisosteric substitution	4.8 ± 0.4	6 ± 2	7 ± 4	4 ± 1	7.5 ± 0.5	3
	Self-training	20 ± 1	7.9 ± 0.5	11 ± 1	11.1 ± 0.8	11 ± 2	0
	No augmentation	14 ± 7	11 ± 2	3.2 ± 0.7	12 ± 2	5.7 ± 0.3	2
	Train – test	6	5	4	3	3	n.a.