Table 2.
Number of active and inactive compounds and year threshold used for the time split. ChEMBL data were temporally split into training, update1, update2 and holdout set based on the publication year. Models for the micro nucleus test and liver toxicity endpoint were trained on public data while the inhouse data were split into update and holdout set based on the internal measurement date.
Target (ID) | Training set | Update1 set | Update2 set | Holdout set | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Thresh* | Inactive | Active | Thresh* | Inactive | Active | Thresh* | Inactive | Active | Thresh* | Inactive | Active | |
CHEMBL220 | 2014 | 802 | 840 | 2016 | 211 | 248 | 2017 | 217 | 138 | 2020 | 104 | 113 |
CHEMBL4078 | 2014 | 1031 | 1008 | 2015 | 259 | 275 | 2016 | 267 | 202 | 2020 | 499 | 270 |
CHEMBL5763 | 2015 | 1125 | 600 | 2016 | 302 | 75 | 2017 | 307 | 95 | 2020 | 137 | 114 |
CHEMBL203 | 2012 | 1660 | 433 | 2014 | 526 | 213 | 2016 | 428 | 291 | 2020 | 341 | 167 |
CHEMBL206 | 2006 | 437 | 325 | 2012 | 117 | 63 | 2016 | 114 | 97 | 2020 | 158 | 105 |
CHEMBL279 | 2010 | 1955 | 649 | 2013 | 523 | 307 | 2014 | 618 | 137 | 2020 | 686 | 299 |
CHEMBL230 | 2010 | 475 | 542 | 2013 | 218 | 78 | 2015 | 237 | 80 | 2020 | 218 | 172 |
CHEMBL340 | 2012 | 1272 | 496 | 2014 | 439 | 153 | 2015 | 341 | 59 | 2020 | 449 | 107 |
CHEMBL240 | 2012 | 797 | 1938 | 2014 | 301 | 413 | 2016 | 265 | 526 | 2020 | 238 | 498 |
CHEMBL2039 | 2014 | 710 | 645 | 2015 | 189 | 192 | 2017 | 380 | 212 | 2020 | 134 | 72 |
CHEMBL222 | 2009 | 231 | 673 | 2011 | 61 | 227 | 2015 | 40 | 206 | 2020 | 74 | 54 |
CHEMBL228 | 2009 | 242 | 858 | 2011 | 97 | 373 | 2014 | 31 | 235 | 2020 | 79 | 196 |
Micro nucleus test | - | 1475 | 316 | 2005 | 70 | 134 | – | – | – | 2020 | 98 | 50 |
Liver toxicity | - | 247 | 445 | 2011 | 42 | 48 | – | – | – | 2020 | 35 | 15 |
*Thresh: Data points published (ChEMBL) or measured (micro nucleus test, liver toxicity) until this year threshold are included in the corresponding subset.