Skip to main content
. 2022 May 4;12:7244. doi: 10.1038/s41598-022-09309-3

Table 2.

Number of active and inactive compounds and year threshold used for the time split. ChEMBL data were temporally split into training, update1, update2 and holdout set based on the publication year. Models for the micro nucleus test and liver toxicity endpoint were trained on public data while the inhouse data were split into update and holdout set based on the internal measurement date.

Target (ID) Training set Update1 set Update2 set Holdout set
Thresh* Inactive Active Thresh* Inactive Active Thresh* Inactive Active Thresh* Inactive Active
CHEMBL220 2014 802 840 2016 211 248 2017 217 138 2020 104 113
CHEMBL4078 2014 1031 1008 2015 259 275 2016 267 202 2020 499 270
CHEMBL5763 2015 1125 600 2016 302 75 2017 307 95 2020 137 114
CHEMBL203 2012 1660 433 2014 526 213 2016 428 291 2020 341 167
CHEMBL206 2006 437 325 2012 117 63 2016 114 97 2020 158 105
CHEMBL279 2010 1955 649 2013 523 307 2014 618 137 2020 686 299
CHEMBL230 2010 475 542 2013 218 78 2015 237 80 2020 218 172
CHEMBL340 2012 1272 496 2014 439 153 2015 341 59 2020 449 107
CHEMBL240 2012 797 1938 2014 301 413 2016 265 526 2020 238 498
CHEMBL2039 2014 710 645 2015 189 192 2017 380 212 2020 134 72
CHEMBL222 2009 231 673 2011 61 227 2015 40 206 2020 74 54
CHEMBL228 2009 242 858 2011 97 373 2014 31 235 2020 79 196
Micro nucleus test - 1475 316 2005 70 134 2020 98 50
Liver toxicity - 247 445 2011 42 48 2020 35 15

*Thresh: Data points published (ChEMBL) or measured (micro nucleus test, liver toxicity) until this year threshold are included in the corresponding subset.