Skip to main content
. 2025 Aug 14;4(10):2752–2764. doi: 10.1039/d5dd00028a

Table 1. Distribution learning of physico-chemical properties. We report the Kolmogorov–Smirnov (KS) distance between the de novo designs (3000 SMILES strings) and the training set molecules, computed for selected descriptors (HBA = number of hydrogen bond acceptors, HBD = number of hydrogen bond donors, MW = molecular weight, and log P = octanol–water partitioning coefficient). For each training set size (1000, 2500, 5000, 7500, and 10 000 molecules), the KS distance is reported for each augmentation strategy and each descriptor. For each descriptor and training set size, the best and second-best KS distances are highlighted in boldface and italics, respectively. The number of times a given augmentation strategy provides the best or second-best performance for a given descriptor across training set sizes is also reported. The KS distances between the training and the test set molecules and for the designs obtained with no augmentation are reported as a reference (n.a. = not available).

Property Method Training set size Times top-2
1000 2500 5000 7500 10 000
HBA Enumeration 4 ± 2 12 ± 1 2.3 ± 0.7 7 ± 2 4.5 ± 0.7 3
Token deletion (random) 25 ± 11 22 ± 6 22 ± 5 25 ± 8 27 ± 4 0
Token deletion (validity) 16 ± 2 17 ± 2 13 ± 1 12.9 ± 0.5 20 ± 3 0
Token deletion (protected) 33 ± 8 14 ± 5 17 ± 5 18 ± 2 17 ± 4 0
Atom masking (random) 23 ± 2 21 ± 2 13 ± 5 10.8 ± 0.9 7.7 ± 0.7 0
Atom masking (funct. group) 14 ± 4 8 ± 3 10 ± 1 6 ± 2 7 ± 2 2
Bioisosteric substitution 2.6 ± 0.5 10 ± 2 2.1 ± 0.7 5 ± 2 6.8 ± 0.3 5
Self-training 50.0 ± 0.5 18.0 ± 0.2 14 ± 3 13.0 ± 0.6 13.2 ± 0.9 0
No augmentation 31 ± 4 16 ± 2 15.4 ± 0.5 18 ± 3 13.4 ± 0.4 0
Train – test 2 1 1 1 1 n.a.
HBD Enumeration 4 ± 3 2 ± 1 2 ± 2 1.8 ± 0.5 3 ± 1 4
Token deletion (random) 10.3 ± 0.2 8 ± 2 8 ± 2 6 ± 2 10 ± 6 0
Token deletion (validity) 4 ± 2 5 ± 2 3 ± 1 4.0 ± 0.5 4.1 ± 0.8 2
Token deletion (protected) 11 ± 4 4 ± 1 4 ± 2 5 ± 2 2.9 ± 0.1 1
Atom masking (random) 4 ± 2 5 ± 2 3.7 ± 0.2 3.2 ± 0.2 6 ± 2 2
Atom masking (funct. group) 11 ± 3 11 ± 5 4 ± 3 7 ± 3 3.3 ± 0.9 0
Bioisosteric substitution 5 ± 3 3 ± 2 6 ± 3 4 ± 1 2.2 ± 0.7 2
Self-training 17 ± 2 4.7 ± 0.9 8 ± 1 14 ± 2 5.7 ± 0.9 0
No augmentation 14 ± 3 7 ± 1 6 ± 2 4 ± 1 7.2 ± 0.7 0
Train – test 3 4 2 2 2 n.a.
MW Enumeration 12.6 ± 0.4 14 ± 2 8 ± 1 5.6 ± 0.6 5 ± 1 3
Token deletion (random) 45 ± 6 31 ± 4 34 ± 7 31 ± 8 32 ± 4 0
Token deletion (validity) 25.5 ± 0.7 22 ± 3 20 ± 3 20 ± 1 22 ± 2 0
Token deletion (protected) 43 ± 5 26 ± 3 22 ± 6 28 ± 3 25 ± 4 0
Atom masking (random) 21 ± 3 21 ± 1 6 ± 2 10 ± 5 4 ± 1 2
Atom masking (funct. group) 11 ± 5 9 ± 3 6 ± 2 6 ± 2 5 ± 2 4
Bioisosteric substitution 5.6 ± 1.0 8 ± 2 9 ± 1 7 ± 1 13.1 ± 0.5 2
Self-training 16.1 ± 0.7 12.1 ± 0.8 11.2 ± 0.9 11 ± 1 7.5 ± 0.1 0
No augmentation 40 ± 3 21 ± 1 15.3 ± 0.2 17 ± 1 16 ± 2 0
Train – test 3 3 3 3 3 n.a.
Log P Enumeration 11 ± 3 7 ± 2 8 ± 3 3 ± 1 5.1 ± 0.8 3
Token deletion (random) 31 ± 3 19 ± 4 22 ± 4 18 ± 6 19 ± 2 0
Token deletion (validity) 17 ± 5 12 ± 1 12 ± 2 13 ± 2 12 ± 3 0
Token deletion (protected) 32 ± 11 22 ± 4 16 ± 3 22.1 ± 0.6 17 ± 2 0
Atom masking (random) 11 ± 2 10 ± 1 8 ± 2 7 ± 2 8 ± 2 0
Atom masking (funct. group) 8 ± 3 6 ± 2 4.7 ± 0.7 7 ± 3 8 ± 2 3
Bioisosteric substitution 4.8 ± 0.4 6 ± 2 7 ± 4 4 ± 1 7.5 ± 0.5 3
Self-training 20 ± 1 7.9 ± 0.5 11 ± 1 11.1 ± 0.8 11 ± 2 0
No augmentation 14 ± 7 11 ± 2 3.2 ± 0.7 12 ± 2 5.7 ± 0.3 2
Train – test 6 5 4 3 3 n.a.