Skip to main content
. 2020 May 20;12:35. doi: 10.1186/s13321-020-00439-2

Fig. 1.

Fig. 1

Data set summary. Training set was used to derive SYBA scores, as well as to train a random forest classifier. Training set consists of 693 353 molecules randomly selected from the ZINC15 database [57] that are considered to be ES (S+ data set) and of the same number of HS molecules generated by Nonpher [58] (S data set). Two test sets were used to compare the performance of SYBA, a random forest, SAScore [45] and SCScore [43]. Manually curated test set (TMC) contains 40 compounds (TMC- data set) considered to be HS by experienced medicinal chemists [58] supplemented by 40 ES compounds randomly selected from the ZINC15 database (TMC+ data set). 30 TMC data set instances differing in TMC+ compounds were constructed. Computationally picked test set (TCP) consists of 3 581 HS compounds that were obtained from the GDB-17 database [61] (TCP- data set) complemented by the same number of compounds randomly selected from the ZINC15 database (TCP+ data set)