Abstract
Background
The most important factor affecting metabolic excretion of compounds from the body is their half-life time. This provides an indication of compound stability of, for example, drug molecules. We report on our efforts to develop QSAR models for metabolic stability of compounds, based on in vitro half-life assay data measured in human liver microsomes.
Method
A variety of QSAR models generated using different statistical methods and descriptor sets implemented in both open-source and commercial programs (KNIME, GUSAR and StarDrop) were analyzed. The models obtained were compared using four different external validation sets from public and commercial data sources, including two smaller sets of in vivo half-life data in humans.
Conclusion
In many cases, the accuracy of prediction achieved on one external test set did not correspond to the results achieved with another test set. The most predictive models were used for predicting the metabolic stability of compounds from the open NCI database, the results of which are publicly available on the NCI/CADD Group web server (http://cactus.nci.nih.gov).
In the companion technology review article [1], we have given a general overview of the currently existing freely or commercially available resources in the field of metabolism-related predictions, and reviewed a number of available databases and predictive software tools. In this article, we report on a more specific predictive effort in this field, that is, to build, test, and apply a number of models, using different approaches and tools, of one important metabolic stability measure: the half-life time of a compound measured in human liver microsomes.
Metabolic processes play an important role in the elimination of drugs and other xenobiotics from the human organism. The majority of metabolic processes are catalyzed by drug-metabolizing enzymes, which are, for the most part, produced in the liver. They perform biotransformation, degradation and excretion of chemical compounds. Degradation and excretion have a strong influence on the efficacy and safety of chemical compounds used as, or being studied for potential use as, drugs. While in vivo stability (i.e., half-life time in humans) is obviously the most directly relevant pharmacological parameter, the most easily and most widely measured value to quantify metabolic excretion and thus stability of compounds is their half-life time (t1/2) determined in human liver microsomes (HLMs). More details of the relationship between in vivo and in vitro clearance have been published, for example, by Clarke and Jeffrey [2], Masimirembwa et al. [3], and Di et al. [4]. These and other papers have generally found a significant correlation between in vitro Phase I microsomal stability values and in vivo clearance, with nevertheless a frequently observed underestimation of the in vivo clearance by the HLM data, readily explained by compounds likely being cleared in vivo by non-CYP pathways such as phase II metabolism, extra-microsomal hepatocyte metabolism, renal extraction, biliary clearance, lung metabolism, plasma degradation and red blood cell binding [4].
High-throughput in vitro metabolic stability assays such as t1/2 in liver microsomes are widely used for investigation of the stability of compounds during drug discovery, for selection of lead compounds, and for prediction of their in vivo performance. At the same time, there are many computational approaches that analyze the relationships between structures and metabolic activity. These methods can be applied to prioritize compounds for in vivo measurements, and for optimization of structures in the early drug discovery stage.
Recently, several in silico predictive classification models of human or rodent liver microsomal stability have been reported [5–8]. Classification models of apparent intrinsic clearance (CLint, a compound’s rate of disappearance, be it in vitro or in vivo [typically measured in units of ml/min/kg protein], reflecting the actual metabolic capacity of the enzyme system in the limit of free access to substrate, a quantity inversely proportional to half-life time) were developed by Lee et al. in 2007 [5], for 14,557 compounds using three descriptor sets and random forest as well as naive Bayesian methods as the mathematical approaches. The same endpoint was also analyzed by Sakiyama et al. in 2008 [6], for 2439 compounds using MOE descriptors and four different mathematical approaches: random forest, support vector machine (SVM), logistic regression and recursive partitioning. In 2008, Schwaighofer et al. [7] developed human and rodent liver microsomal stability models for more than 3000 compounds from Bayer Schering Pharma in-house data using Dragon descriptors and a Gaussian process classifier. In 2010, Hu et al. [8] used more than 15,000 compounds to build classification models for prediction of human and rodent half-life microsomal stability data based on SciTegic’s FCFP_6 fingerprints using a naive Bayesian classifier. A small number of somewhat older studies exist that report on work to directly predict metabolic stability, based on generally smaller number of compounds [9–12]. Although the reported models showed reasonable prediction accuracy, achieving a truly desirable prediction accuracy (>0.9) for microsomal stability is still challenging.
Most published studies applied only a single or a few computational approaches for creating QSAR models, which may lead to an underestimation of the problem of applicability domain (AD) variance of different approaches for the same training set [13]. Another issue is that most published studies validated their models on only one independent (external) test set, although it has been shown [14] that prediction results achieved on one external test set may not translate to other external test sets. Therefore, we believe it is important to validate model predictivity on several external datasets.
One additional important aspect of this field is that, even though many liver microsomal stability models have been developed and some reports on them published, the impact of most of these efforts has remained limited because in most cases the authors did not share the obtained QSAR models or prediction results with the public community.
In this article, we describe our efforts in developing various QSAR models for metabolic stability of compounds in HLMs based on in vitro half-life assay data. We compare the performance of a variety of models generated using different statistical methods and descriptor sets implemented in both open-source and commercial programs. The models obtained are compared using several external validation sets from various types of sources. We briefly describe how the most predictive models were then used to predict the metabolic stability of compounds in a publicly available database with the goal to share the results of our HLM t1/2 model building efforts with the scientific community.
Experimental
Microsomal stability assay (typical procedure as reported in literature)
In vitro metabolic stability assays typically utilize liver microsomes to provide key phase I enzymes such as CYP enzymes for testing the stability of compounds. To provide background for the modeling efforts described in the following, we provide a very brief description of the typical experimental procedure as reported in the literature: the liver microsome/dissolved sample solution is added to a buffer solution (5 mM EDTA, 100 mM potassium phosphate buffer, pH 7.4, glucose-6-phosphate, glucose-6-phosphate dehydrogenase and NADP+) and is then incubated, aliquoted, and eventually quenched. The mixture is subsequently analyzed by LCMS/MS to determine the quantity of starting material left in solution [15]. Liver microsome stability assays follow first order kinetics and thus the half life of the compound can be found from the plot of the log percent of the compound remaining versus the linear time, typically done by a linear fit of two points at t = 0 min and t = 15 min, respectively [15].
Microsomal stability datasets
We collected information about chemical structures and their half-life data from several different public and commercially available sources.
A commercial database was obtained from the Evolvus Group of companies (‘Evolvus’) [101]. Evolvus has collected and organized a database with compounds from several mammalian species and their half-life and hepatic clearance data by manual extraction and curation from the scientific literature and patents. We used a subset of 1300 compounds that had experimentally determined in vitro HLM t1/2 data. This database was used as the primary database for preparing training and test sets.
Several public databases were used for construction of the additional external validation sets. External validation sets were selected to investigate whether the QSAR models could be applied to data from different sources. Due to both the general paucity of data available and the typically significant correlation of in vitro (HLM) results with in vivo stability data obtained for drugs in humans [2–4], such human pharmacokinetic data were also harnessed as validation sets for our QSAR models.
External test sets of this type were down-loaded by us from ChEMBL [102], a database of bioactive drug-like molecules supported by the European Bioinformatics Institute [103], extracted from Goodman and Gilman’s (G&G) widely used reference work in this field, ‘The Pharmacological Basis of Therapeutics’ [16], and generated by our own measurement of HLM t1/2 data.
ChEMBL’s data are abstracted and curated from primary scientific literature, and includes information about structure of the compounds and their biological activities. We used data from the half-life assay CHEMBL1614674 [17] for the compilation of the first external test set, which includes 669 chemical structures and half life data in humans after intravenous administration.
The second external test set was put together from human pharmacokinetic data reported in Appendix II of the G&G work [16]. It should be noted that this dataset was primarily intended for healthcare professionals and medical students to understand the pharmacokinetic basis for dosing regimens of frequently used drugs, rather than for the development of structure-pharmacokinetic relationships. More than 250 structures of frequently used drugs were extracted from this textbook alongside with their t1/2 data.
Finally, data measured at the Sanford Burnham Medical Research Institute (SBMRI) in the context of the NIH Molecular Libraries Probe Production Centers Network [104] was used for the creation of the third external test set. These data, 91 compounds providing stability data in HLMs measured as a percentage of the original substance remaining after 60 min, are planned to be added to PubChem assay data (AID 1555) [105].
All collected datasets were prepared and manually curated as described below before undergoing QSAR analysis.
Data preparation
All data were cleaned in order to eliminate duplicate compounds, establish well-defined and validated stereochemistry, and standardize t1/2 data to units of min. Data from the SBMRI assays were converted into t1/2 values using the following equation:
where t is the time of measurement (min), N0 is the initial quantity of the substance, Nt is the quantity that still remains (i.e., has not yet been metabolized) after time t. The half-life data from all other sources were already in the correct t1/2 format.
The Evolvus dataset was randomly divided into training and test sets in the relative proportion of 80% and 20%, respectively. The other datasets (Chext from ChEMBL, Shext from SBMRI, Ghext from G&G) were used as external validation datasets (Table 1). The compounds in each dataset were placed into one of two categories: ‘metabolically unstable’ or ‘metabolically stable’, defined for the bulk of the study as t1/2 ≥15 min and t1/2 >15 min, respectively, following a previous microsomal stability assay study that used a cut-off time of 15 min [11]. The distributions of the compounds in each dataset according to this cut-off criterion are shown in Table 1. For an additional test of the influence of the chosen cut-off time on the produced QSAR models, the datasets were also divided using a 30 min cut-off (see ‘Results’ and ‘Discussion’ sections).
Table 1.
Human metabolic stability datasets used in this study.
| Dataset | Data type | Number of compounds | Unstable† | Stable‡ |
|---|---|---|---|---|
| Evolvus training set (Etrain) | HLM in vitro | 998 | 282 | 716 |
| Evolvus test set (Etest) | HLM in vitro | 244 | 63 | 181 |
| ChEMBL human external set (Chext) | Human in vivo | 669 | 5 | 664 |
| Sanford Burnham Medical Research Institute human external set (Shext) | HLM in vitro | 80 | 21 | 59 |
| Goodman and Gilman human external set (Ghext) | Human in vivo | 246 | 5 | 241 |
t1/2 of 15 min or less.
t1/2 of 16 min or more.
HLM: Human liver microsome.
Table 1 shows that the external sets have different distributions of stable versus unstable compounds. For example, Chext includes only five unstable compounds versus 664 stable compounds (i.e., the ratio of unstable versus stable compounds is 1:133). A similar situation, although not to the same extent, is found for the Shext set, where the ratio of unstable versus stable compounds is 1:3. As can be seen from Table 1, all datasets available to us had, to a larger or smaller degree, an unbalanced distribution of compound activities, which typically presents a challenge for the development of accurate QSAR models. The union set of all compounds was unbalanced, too (376 unstable versus 1861 stable compounds), with an approximate 5:1 skewing toward stable compounds.
It should be noted that, while the majority of drugs fall in the ‘stable’ class, ‘unstable’ is not a synonym for ‘bad’ in every case in the context of drug development. While half-life times in the range of a few to several tens of hours are certainly the goal for, say, typical anticancer drugs or antibiotics, very short half-life times of just a few minutes may be desirable, if not essential for, for example, short-acting analgesics administered during surgery (such as remifentanil, InChIKey=ZTVQQQVZCWLTDFUHFFFAOYSA-N) or drugs to treat emergency situations (such as esmolol, InChIKey=AQNDDEOPVVGCPGUHFFFAOYSA-N, drug of choice in case of suspicion of aortic dissection).
QSAR methods
For the development of the QSAR models we used two commercial programs, GUSAR [18,19] and StarDrop (StarDrop v5.0, Optibrium Ltd, Cambridge, UK), and one open-source software – Konstanz Information Miner (KNIME V.2.4.2, KNIME GmbH, Konstanz, Germany).
GUSAR uses two types of descriptors, called Multilevel and Quantitative Neighborhoods of Atoms (MNA, QNA), respectively [19,20]. The calculation of QNA descriptors is based on the connectivity matrix (C), and also, on the standard values of ionization potential and electron affinity of atoms in a molecule [18,19].
For any given atom i, the QNA descriptors, are calculated as follows:
with
The estimate of the target property of a compound is calculated as the mean value of the values of functions P and Q values for all atoms of a molecule in QNA descriptors space. Two-dimensional Chebyshev polynomials are used for approximating the functions P and Q. Thus, the independent regression variables are calculated as average values of particular two-dimensional Chebyshev polynomials of P and Q values for the atoms in a molecule.
In addition, GUSAR allows creation of QSAR models based on predicted biological activity profiles of compounds. This is done by using the PASS algorithm on each compound’s representation as a list of MNA descriptors for the prediction of the compound’s biological activity profile [18,20]. Version 10.1 of PASS predicts 4130 types of biological activity with a mean prediction accuracy of approximately 95%. The list of predicted biological activities includes pharmacotherapeutic effects, mechanisms of action, adverse and toxic effects, metabolic terms, susceptibility to transporter proteins and activities related to gene expression. The results of the PASS procedure are output as a list of the difference between the probability, for each biological activity, of the compound to be active (Pa) or to be inactive (Pi). For obtaining the different QSAR models in GUSAR, subsets of these Pa–Pi values were randomly selected from the total list of predicted biological activities as input independent variables for the regression analysis. The size of these subsets depends on the number of compounds in the training set. If the number of compounds in the training set falls between 100 and 2000, then the number of initial variables is chosen as one-half the number of compounds in the training set.
For generation of the QSAR models, GUSAR uses a self-consistent regression (SCR) algorithm. SCR is based on the regularized least-squares method. Unlike stepwise regression and other methods of combinatorial search, the initial SCR model includes all regressors. The basic purpose of the SCR method is to remove the variables that poorly describe the modeled value but to retain the set of variables correctly representing the existing relationship. The number of final variables in the QSAR equation selected after the SCR procedure is typically significantly lower than the number of the initial variables. The details of the algorithms for descriptor calculation and the self-consistent regression methods have been described previously [18,19].
StarDrop uses 2D SMARTS-based descriptors, which are counts of atom types and functionalities, along with whole molecule properties such as log P, molecular weight and polar surface area (for a total of 330 descriptors). For model building, StarDrop uses several different techniques: partial least squares, radial basis function fitting, Gaussian processes and decision trees. For the categorical QSAR models built in this study, only radial basis function fitting (RBF SD), Gaussian processes (GP SD) and decision trees (DT SD) were used.
In addition, model building was performed with the program KNIME, using the following methods: k-nearest neighbor (kNN), multilayer perceptron, SVM, Bayes network (BayesNet), RBF network and logistic regression (Logistic). We employed the program Mold2 [21] to calculate all descriptors used for model building with KNIME. Mold2 calculates a large and diverse set of molecular descriptors from the 2D structure of a molecule, including physicochemical properties, fragmental descriptors, structural features and functional group counts, for a total of 777 different descriptors. The descriptors calculated for compounds from all datasets were normalized, a pair-wise correlation analysis was performed, and the most-correlated (R2 >0.95) and low variance descriptors (4% cut-off) were removed, which left 311 descriptors.
Evaluation of prediction accuracy
The statistical parameters that were calculated as an estimate of the accuracy of prediction are shown in Table 2.
Table 2.
Prediction accuracy parameters calculated.
|
|
Accuracy: probability of correctly classifying compounds. | |
|
|
Sensitivity: probability of predicting positive (unstable) when true state is positive. | |
|
|
Specificity: probability of predicting negative (stable) when true state is negative. | |
| Youden = Sensitivity + Specificity − 1 | Youden index: indicator of balance between Sensitivity and Specificity. |
FN: Number of false negatives; FP: Number of false positives; TN: Number of true negatives; TP: Number of true positives.
Due to the nature of the unbalanced data in the external validation sets mentioned previously, the Youden index [22] was calculated in addition to the accuracy to provide an additional evaluation of model performance. The Youden index, the difference between the true positive rate and the false positive rate, has the advantage that it allows one to take into account unbalanced data in external validation sets.
Results & discussion
Data analysis
We performed both an analysis of compound diversity in the training set and a determination of the structural overlap between the various evaluation test sets. The chemical diversity in the set Etrain (Table 1) was estimated using pair-wise similarity analysis. The similarity between compounds was calculated using MNA descriptors (vector representation) and Tanimoto coefficients using the following equation:
where A and B are compounds, which are described by MNA descriptors.
The distribution of compounds from Etrain according to their pair-wise similarity is shown in Figure 1.
Figure 1.
Distribution of pair-wise compound similarity in the Evolvus training set.
One can see that approximately 30% of the compounds in the Evolvus training set are diverse, that is, they have no neighbor in the training set closer than 0.7 in similarity. On the other hand, approximately 70% of compounds are so similar to each other that it might be possible to construct QSAR models of essentially the same quality while leaving out half of this part of the training set.
Our analysis of structural overlap between the evaluation test sets showed that only two of the evaluation sets had significant structural overlap: Chext and Ghext. These two sets have 156 structures in common. A plot of the experimental t1/2 data in Chext versus Ghext is shown in Figure 2.
Figure 2.
Plot of observed t1/2 data in the ChEMBL (Chext) versus Goodman and Gilman’s (Ghext) set.
Since the Chext and Ghext sets have the same end-point data for 156 structures obtained from different sources, this allows us to estimate the experimental error of the t1/2 measurements. The averages of the differences of the measured t1/2 values in the two different databases, binned by t1/2 value ranges, are shown in Table 3.
Table 3.
Experimental difference of t1/2 measurement.
| t1/2 (min) | 0–15 | 15–60 | 60–360 | 360–720 | 720–1440 | 1440–6000 |
| t1/2 (hours) | 0.25 | 0.25–1 | 1–6 | 6–12 | 12–24 | 24–100 |
| Number of compounds | 3 | 10 | 82 | 28 | 19 | 14 |
| Average difference† (min) | 12 | 9 | 50 | 197 | 188 | 411 |
| Average difference (%) | 43 | 18 | 23 | 31 | 18 | 20 |
Average difference in the t1/2 data for equivalent structures. The upper limit in each t1/2 range is an exclusive bound, that is, the intervals could mathematically be written as, for example, [0, 15), [15, 60).
Table 3 shows that the estimated experimental difference of measurement for unstable compounds is approximately 12 min. For stable compounds, the estimated experimental difference of measurement increases to 3 h or more. The average relative difference across all time ranges is on the order of 25%. This suggests that the experimental measurement of metabolic half-life time is still a complicated process and that therefore these data may be fully suited for categorical QSAR models only, that is, models that produce quantitative predictions as their primary output but for which one takes only the correct classification according to the chosen cutoff as the actual predicted outcome.
QSAR modeling
A series of categorical QSAR models was created from the Evolvus training set (Etrain), using different descriptor sets and methods. Each method generated a single model, except for GUSAR, for which we calculated a consensus model of 35 individual models obtained using both MNA and QNA descriptors. All created models were validated on both the Evolvus and the external validation test sets. Unfortunately, not all techniques used here provided assessments of the AD of the generated models, particularly for the methods implemented in KNIME. Thus, in this work, predictions for all compounds in the evaluation sets were used for estimation of model performance. The validation results of all created models are shown in Table 4.
Table 4.
Accuracy of half-life time data prediction.
| Method | Dataset | TP | TN | FP | FN | Sens. | Spec. | Youden | Accuracy | Place, Youden | Place, Acc. | Rank† |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| GUSAR | Etest | 40 | 170 | 11 | 23 | 0.635 | 0.939 | 0.574 | 0.861 | 1 | 1 | 1 |
| RBF SD (StarDrop) | Etest | 42 | 163 | 18 | 21 | 0.667 | 0.901 | 0.567 | 0.840 | 3 | 2 | 2.5 |
| DT SD (StarDrop) | Etest | 43 | 152 | 29 | 20 | 0.683 | 0.840 | 0.522 | 0.799 | 4 | 5 | 4.5 |
| GP SD (StarDrop) | Etest | 34 | 167 | 14 | 29 | 0.540 | 0.923 | 0.462 | 0.824 | 6 | 3 | 4.5 |
| kNN (KNIME) | Etest | 36 | 159 | 22 | 27 | 0.571 | 0.878 | 0.450 | 0.799 | 7 | 6 | 6.5 |
| MLP (KNIME) | Etest | 40 | 151 | 30 | 23 | 0.635 | 0.834 | 0.469 | 0.783 | 5 | 8 | 6.5 |
| SVM (KNIME) | Etest | 23 | 136 | 45 | 40 | 0.365 | 0.751 | 0.116 | 0.652 | 10 | 9 | 9.5 |
| RBF network (KNIME) | Etest | 22 | 172 | 9 | 41 | 0.349 | 0.950 | 0.299 | 0.795 | 9 | 7 | 8 |
| Logistic (KNIME) | Etest | 45 | 155 | 26 | 18 | 0.714 | 0.856 | 0.571 | 0.820 | 2 | 4 | 3 |
| BayesNet (KNIME) | Etest | 46 | 113 | 68 | 17 | 0.730 | 0.624 | 0.354 | 0.652 | 8 | 10 | 9 |
| GUSAR | Chext | 1 | 600 | 64 | 4 | 0.200 | 0.904 | 0.104 | 0.898 | 2 | 1 | 1.5 |
| RBF SD (StarDrop) | Chext | 0 | 580 | 84 | 5 | 0.000 | 0.873 | −0.127 | 0.867 | 6 | 2 | 4 |
| DT SD (StarDrop) | Chext | 1 | 533 | 131 | 4 | 0.200 | 0.803 | 0.003 | 0.798 | 3 | 7 | 5 |
| GP SD (StarDrop) | Chext | 0 | 546 | 118 | 5 | 0.000 | 0.822 | −0.178 | 0.816 | 9 | 5 | 7 |
| kNN (KNIME) | Chext | 1 | 532 | 132 | 4 | 0.200 | 0.801 | 0.001 | 0.797 | 4 | 8 | 6 |
| MLP (KNIME) | Chext | 0 | 564 | 100 | 5 | 0.000 | 0.849 | −0.151 | 0.843 | 8 | 4 | 6 |
| SVM (KNIME) | Chext | 2 | 525 | 139 | 3 | 0.400 | 0.791 | 0.191 | 0.788 | 1 | 9 | 5 |
| RBF network (KNIME) | Chext | 0 | 577 | 87 | 5 | 0.000 | 0.869 | −0.131 | 0.862 | 7 | 3 | 5 |
| Logistic (KNIME) | Chext | 0 | 537 | 127 | 5 | 0.000 | 0.809 | −0.191 | 0.803 | 10 | 6 | 8 |
| BayesNet (KNIME) | Chext | 1 | 512 | 152 | 4 | 0.200 | 0.771 | −0.029 | 0.767 | 5 | 10 | 7.5 |
| GUSAR | Ghext | 1 | 214 | 27 | 4 | 0.200 | 0.888 | 0.088 | 0.874 | 2 | 2 | 2 |
| RBF SD (StarDrop) | Ghext | 0 | 207 | 34 | 5 | 0.000 | 0.859 | −0.141 | 0.841 | 5 | 3 | 4 |
| DT SD (StarDrop) | Ghext | 0 | 190 | 51 | 5 | 0.000 | 0.788 | −0.212 | 0.772 | 10 | 10 | 10 |
| GP SD (StarDrop) | Ghext | 0 | 195 | 46 | 5 | 0.000 | 0.809 | −0.191 | 0.793 | 7 | 7 | 7 |
| kNN (KNIME) | Ghext | 1 | 203 | 38 | 4 | 0.200 | 0.842 | 0.042 | 0.829 | 3 | 4 | 3.5 |
| MLP (KNIME) | Ghext | 0 | 200 | 41 | 5 | 0.000 | 0.830 | −0.170 | 0.813 | 6 | 5 | 5.5 |
| SVM (KNIME) | Ghext | 0 | 194 | 47 | 5 | 0.000 | 0.805 | −0.195 | 0.789 | 8 | 8 | 8 |
| RBF network (KNIME) | Ghext | 1 | 216 | 25 | 4 | 0.200 | 0.896 | 0.096 | 0.882 | 1 | 1 | 1 |
| Logistic (KNIME) | Ghext | 1 | 195 | 46 | 4 | 0.200 | 0.809 | 0.009 | 0.797 | 4 | 6 | 5 |
| BayesNet (KNIME) | Ghext | 0 | 192 | 49 | 5 | 0.000 | 0.797 | −0.203 | 0.780 | 9 | 9 | 9 |
| GUSAR | Shext | 10 | 51 | 8 | 11 | 0.476 | 0.864 | 0.341 | 0.763 | 1 | 1 | 1 |
| RBF SD (StarDrop) | Shext | 7 | 53 | 6 | 14 | 0.333 | 0.898 | 0.232 | 0.750 | 5 | 2 | 3.5 |
| DT SD (StarDrop) | Shext | 4 | 46 | 13 | 17 | 0.190 | 0.780 | −0.030 | 0.625 | 10 | 10 | 10 |
| GP SD (StarDrop) | Shext | 12 | 45 | 14 | 9 | 0.571 | 0.763 | 0.334 | 0.713 | 2 | 6 | 4 |
| kNN (KNIME) | Shext | 5 | 54 | 5 | 16 | 0.238 | 0.915 | 0.153 | 0.738 | 6 | 4 | 5 |
| MLP (KNIME) | Shext | 11 | 46 | 13 | 10 | 0.524 | 0.780 | 0.303 | 0.713 | 3 | 7 | 5 |
| SVM (KNIME) | Shext | 3 | 56 | 3 | 18 | 0.143 | 0.949 | 0.092 | 0.738 | 8 | 5 | 6.5 |
| RBF network (KNIME) | Shext | 3 | 57 | 2 | 18 | 0.143 | 0.966 | 0.109 | 0.750 | 7 | 3 | 5 |
| Logistic (KNIME) | Shext | 10 | 45 | 14 | 11 | 0.476 | 0.763 | 0.239 | 0.688 | 4 | 9 | 6.5 |
| BayesNet (KNIME) | Shext | 2 | 54 | 5 | 19 | 0.095 | 0.915 | 0.010 | 0.700 | 9 | 8 | 8.5 |
Rank is calculated as the average of the place achieved by the QSAR method in the Youden and Accuracy results. Note that ‘Positive’ in this sense is defined as a compound being unstable (i.e., having t1/2 ≤15 min).
Acc.: Accuracy; DT: Decision tree; GP: Gaussian process; kNN: k-nearest neighbor; MLP: Multilayer perceptron; FP: Number of false positives; FN: Number of false negatives; RBF: Radial basis function fitting; Sens.: Sensitivity; Spec.: Specificity; SVM: Support vector machine; TN: Number of true negatives; TP: Number of true positives.
The half-life time of compounds in the Evolvus test set (Etest) was accurately predicted by all methods except SVM and BayesNet. This set is more balanced compared with the others and has the same distribution of unstable and stable compounds as the training set. Two methods showed better sensitivity – BayesNet and Logistic regression – while others had higher specificity values: RBF network, GP SD, RBF SD and GUSAR. The best results in terms of the accuracy and Youden index were achieved by Logistic regression, the GUSAR software, and radial basis function fitting implemented in StarDrop (RBF SD).
The ChEMBL set (Chext) is more unbalanced and has only five unstable compounds. However, five methods out of ten predicted at least one unstable compound out of the five: DT SD, SVM, kNN, GUSAR and BayesNet. The highest specificity results (>0.85) were achieved by RBF SD, GUSAR and RBF network. Although Logistic regression showed a high sensitivity value for the Evolvus test set (Etest), for the Chext this approach gave poor results. This suggests that prediction results achieved by a QSAR method on one test set might not be replicated in another test set. Taking into account both the accuracy of prediction and the Youden index, the most accurate prediction for Chext was produced by GUSAR.
The G&G set (Ghext), like Chext, is also unbalanced and has only five unstable compounds. In this case, only four methods out of ten predicted at least one unstable compound out of the five: GUSAR, RBF network, kNN and Logistic regression. It is necessary to emphasize that SVM showed poor sensitivity values in comparison with its performance for Chext. Conversely, RBF network showed better sensitivity results here than for Chext. This appears to be another example of models showing different performance for different test sets. All methods predicted Ghext with high accuracy of prediction (>0.8) except GP SD, DT SD, SVM, logistic regression and BayesNet. The best results in terms of combined accuracy of prediction and the Youden index were achieved by RBF network.
The SBMRI set (Shext) is more balanced compared with Chext and Ghext, and has a similar distribution of unstable and stable compounds as Etest. All methods showed reasonable prediction accuracy, but nevertheless significantly lower than the accuracy achieved with Etest. The most balanced result in terms of the sensitivity and specificity was produced by GUSAR, whereas some other methods showed a large difference between sensitivity and specificity, in particular DT SD and BayesNet. In comparison with Etest, BayesNet and DT SD performed poorly for Shext in terms of sensitivity. Taking into account both the accuracy of prediction and the Youden index, the most accurate prediction of the Shext set was produced by GUSAR.
Evaluation of the QSAR techniques used
All methods showed reasonable accuracy of prediction on the external independent sets. To further differentiate, a rank statistic analysis was performed to determine the best QSAR technique for prediction of t1/2. The rank was calculated for each QSAR method as the average of their places achieved in the Youden and accuracy statistics, respectively, for each external dataset (Table 4). For example, if the method was better than all others in the accuracy results and had the second highest Youden index for a particular external set, it would be assigned a rank of 1.5 (average of first place and second place) for this external set. All methods were ranked for each external test set (last column of Table 4). According to the rank statistic results, the best QSAR technique was achieved by the program GUSAR, which showed the highest results in the accuracy and Youden index values on three external sets out of four.
It is of interest in this context that similar models were built by Hu et al. [8] for two test sets. In comparison with their models, which had an accuracy of 0.72 and 0.77, respectively, the models presented in this study showed better prediction results for our four independent test sets, having accuracies of 0.86, 0.90, 0.87 and 0.76, respectively.
Influence of data distribution on prediction accuracy
All four external validation sets are unbalanced, two of them (Chext and Ghext) particularly so with unstable: stable ratios >48 at the 15 min cut-off. This might be the reason of the particularly poor sensitivity results of the investigated QSAR techniques for both these sets. To investigate this hypothesis, the four external evaluation sets were combined into one set (all structures common to two or more of the four evaluation sets were excluded) and then randomly re-split into four new test sets with the same relative ratio of compounds as in the original external evaluation sets, yielding new test sets 1–4 (Table 5) with 208, 570, 68 and 209 compounds, respectively (vs 244 [Etest], 669 [Chext], 80 [Ghext], and 246 [Shext] for the original external test sets). The same QSAR models created by GUSAR that had been used for prediction of the original external evaluation sets were used for prediction of the new test sets. The results (Table 5) show that the sensitivity values for all four new test sets are significantly higher than the sensitivity results achieved for the original external evaluation sets. This strongly supports our hypothesis that the poor sensitivity results for the Ghext and Chext sets are mostly related to the fact that these test sets are very unbalanced, and are not due to poor quality of the models per se.
Table 5.
Accuracy of t1/2 predictions for combined and randomly re-split test sets.
| Method | Dataset | TP | TN | FP | FN | Sensitivity | Specificity | Youden | Accuracy |
|---|---|---|---|---|---|---|---|---|---|
| GUSAR | Test 1 | 11 | 156 | 36 | 5 | 0.688 | 0.813 | 0.500 | 0.803 |
| GUSAR | Test 2 | 27 | 442 | 78 | 23 | 0.540 | 0.850 | 0.390 | 0.823 |
| GUSAR | Test 3 | 5 | 46 | 12 | 5 | 0.500 | 0.793 | 0.293 | 0.750 |
| GUSAR | Test 4 | 8 | 165 | 29 | 7 | 0.533 | 0.851 | 0.384 | 0.828 |
| GUSAR | One combined test set | 51 | 809 | 155 | 40 | 0.560 | 0.839 | 0.400 | 0.815 |
FP: Number of false positives; FN: Number of false negatives; TN: Number of true negatives; TP: Number of true positives.
Influence of cut-off time on model quality
We also conducted a limited test of how a different cut-off time for classifying t1/2 values as ‘metabolically unstable’ versus ‘metabolically stable’ would change our models. Other studies, for example Schwaighofer et al.[7], had used a value of 30 min. We only recomputed the GUSAR models with this longer cut-off time. Table 6 shows both the distribution of the various test sets and the model results for both the 15 min and the 30 min cut-off.
Table 6.
Test set distribution and GUSAR model results for cut-off times of 15 and 30 min.
| Dataset | Unstable | Stable | TP | TN | FP | FN | Sensitivity | Specificity | Youden | Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|
| Cut-off time: 15 min | ||||||||||
| Etest | 63 | 181 | 40 | 170 | 11 | 23 | 0.635 | 0.939 | 0.574 | 0.861 |
| Shext | 21 | 59 | 10 | 51 | 8 | 11 | 0.476 | 0.864 | 0.341 | 0.763 |
| Ghext | 5 | 241 | 1 | 214 | 27 | 4 | 0.200 | 0.888 | 0.088 | 0.874 |
| Chext | 5 | 664 | 1 | 600 | 64 | 4 | 0.200 | 0.904 | 0.104 | 0.898 |
| Cut-off time: 30 min | ||||||||||
| Etest | 119 | 125 | 101 | 104 | 21 | 18 | 0.849 | 0.832 | 0.681 | 0.840 |
| Shext | 36 | 44 | 20 | 24 | 20 | 16 | 0.556 | 0.545 | 0.101 | 0.550 |
| Ghext | 7 | 239 | 2 | 151 | 88 | 5 | 0.286 | 0.632 | −0.082 | 0.622 |
| Chext | 18 | 651 | 5 | 442 | 209 | 13 | 0.278 | 0.679 | −0.043 | 0.668 |
FP: Number of false positives; FN: Number of false negatives; TN: Number of true negatives; TP: Number of true positives.
While the distributions of the two sets of in vitro data (i.e., Evolvus [Etest] and SBMRI [Shext]) became more balanced, approaching a 50:50 ratio, the distribution of the ChEMBL set (Chext) of in vivo data changed only marginally, and the distribution of the G&G set (Ghset, in vivo) practically not at all. Nevertheless, if the degree of unbalance were the only factor (negatively) inf luencing model quality, one would expect model quality to improve at least somewhat. Interestingly, however, while sensitivity improved slightly, including for Ghset and Chext, all other model quality parameters declined. In fact, the Youden index became negative for both Ghset and Chext. One hypothesis for the explanation of this effect could be the well-known (see above) tendency of predictive models based on in vitro microsomal stability data to underestimate in vivo clearance especially for rapidly cleared compounds. For example, if a drug(-like) compound has an HLM t1/2 of 40 min but an in vivo half-life of 20 min, it would be quantitatively wrongly predicted but still correctly classified by a highly accurate HLM t1/2 model based on 15 min cut-off, whereas a 30 min cut-off would lead to misclassification. Since the models for the generally best-performing software had declined in quality for the longer cut-off, no further model-building was pursued with the 30 min cut-off.
QSAR model application to public structures
The 35 QSAR models created by GUSAR that went into the consensus model used for prediction of the evaluation sets (Table 4) were also applied in the same consensus mode to categorical prediction of microsomal stability of the compounds in the Open NCI database [23]. The Open NCI database includes 265,251 compounds, which are the publicly available part of the more than half a million structures assembled by the US NCI in the course of its 55 year long efforts in screening compounds against cancer and, more recently, AIDS. Each compound from the NCI database was classified as stable or unstable. The prediction output also includes an assessment of the AD as provided by GUSAR (described in detail elsewhere [18,24]). We very briefly outline the three different approaches realized in GUSAR for estimation of the AD of the obtained models: similarity, leverage, and accuracy assessment.
Similarity
The three nearest neighbors from the training set are calculated for each compound under study using similarity estimation. The pair-wise similarity of this compound with each of its three neighbors is estimated as Pearson’s coefficient calculated in the space of independent variables obtained after SCR. The average of these three similarity values is used for assessment of the AD of the model. In this study an AD threshold of 0.7 was used.
Leverage
The leverage value (also called ‘hat value’) [25] is also used for domain applicability assessment. The leverage values, which are representing the ‘distance’ of the molecule to the model structural space and are a measure of the contribution of the nth molecule to its own predicted value and thus can be used to identify outliers, were calculated as:
where x is the vector of descriptors of a query compound, and X is the matrix formed with rows corresponding to the descriptors of the molecules from the training set. The leverage warning value was calculated for each compound of the training set and then the distribution of the obtained values was determined. In this study a warning level for leverage values was set to the 99th percentile, that is, if a compound from the external test set had a leverage value exceeding this warning level, then this compound was considered as being outside the AD.
Accuracy assessment
For this type of the assessment of the AD the following equation is used:
where ADvalue is the AD value, RMSE3NN is the root-mean square error of prediction of the three most similar compounds from the training set (see ‘Similarity’ section), and RMSEtrain is the root-mean square error of predictions for the training set. In this study a threshold of 1 was used for AD.
The prediction results of microsomal stability fell in the AD of the GUSAR models for 196,460 of the NCI database compounds. These data can be downloaded as an SD file from our web server [106]. Both the data and the developed models may be integrated in some of our interactive web services in the future.
Conclusion
We have developed categorical QSAR models for prediction of human liver microsomal stability using different QSAR approaches. A high accuracy of prediction was achieved for several external test sets. It was shown that for three external sets out of four the best results were achieved by the GUSAR program. The KNIME software performed better on the remaining external set. We also showed that the accuracy of prediction achieved on one external test set might not transfer to another external set. Thus, we recommend using several external sets, ideally from different sources, for a true estimation of the predictivity of metabolic QSAR models. The most-predictive categorical QSAR models produced by GUSAR were used for prediction of the metabolic stability of a sizable subset of the quarter-million structures of the Open NCI database, made available for download to the public on the NCI/CADD Group’s web site [106].
Future perspective
During the last decade, the number of publications dealing with the development of QSAR models for the prediction of metabolic halflife data has steadily increased. This indicates strong interest in this field. Although t1/2 values are known to have been measured for many compounds, mainly in industry, the limited size and chemical space coverage of the datasets we were able to compile for this study demonstrated that the field is definitely far from saturation with freely available, or affordably-priced, experimental data. Efforts along the lines of the large-scale biological assaying as performed for hundreds of thousands of compounds in the context of the NIH Roadmap (now Common Funds) Screening Project [107], but specifically geared toward generation of metabolism-related data independent of any specific disease-related target, would most likely greatly move the field forward. In a similar vein, we found fairly large differences in the experimental measurements from different sources for identical compounds, suggesting that current experimental methods and hence model-building might still benefit from increased accuracy. A decrease in experimental error, if achievable, would give more confidence when developing continuous QSAR models, which are still a challenge for this endpoint. At the same time, one has to be aware that any increases in both the experimental accuracy and quality of predictions of HLM t1/2 values will find their limits of relevance for what truly counts (i.e., the metabolic fate of drugs in the patient) through both the intrinsically less-than-perfect correlation between in vitro and in vivo stability and the inter-personal variability of clearance rates and metabolization.
Although the results obtained in this specific study gave commercial QSAR software a slight advantage over open-source software, we believe it will be beneficial for the field to find avenues for freely sharing the obtained predictive QSAR models with the scientific community. One of these avenues might be the development of online web services based on an open application programming interface platform, which could be used for sharing models and for property predictions for compounds of interest.
Executive summary.
The most important parameter related to the metabolic excretion of compounds and thus to potential drug molecules’ stability in the body is their half-life time. A widely used substitute for determination of stability in the human body is the measurement of the compounds’ stability in human liver microsomes.
Information about chemical structures and their half-life data was collected from several public and commercial sources and used for the construction of categorical QSAR models.
Predictive QSAR models were developed using both commercial (StarDrop, GUSAR) and open-source software (KNIME).
For estimation of the predictivity of the models, several external test sets were used.
The obtained QSAR models showed generally high accuracy of prediction.
The accuracy of prediction achieved for one external test set did not necessarily translate to a similar accuracy for a different test set.
Models based on a longer cut-off of 30 min led to a decline of model quality, in particular for the in vivo stability data test sets.
The best obtained model was used to predict metabolic stability for 196,460 structures from the Open NCI database. These data are available for free download.
Acknowledgments
The authors thank A Mangravita-Novo and M Vicchiarelli for their work in gathering the SBMRI data.
Footnotes
For reprint orders, please contact reprints@future-science.com
Disclaimer
The content of this publication does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the US Government.
Financial & competing interests disclosure
This work was supported in part by NIH grant HG005033 (L Smith), in part by the Intramural Research Program of NIH, Center for Cancer Research, and in part with Federal funds from the Frederick National Laboratory for Cancer Research, National Institutes of Health, under contract HHSN261200800001E. AV Zakharov is a member of the GUSAR Development Group, and co-holder of the Russian Patent No. 2006613591 of the GUSAR software, issued 16 October 2006 by the Russian State Patent Agency. The authors have no other relevant affiliations or financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.
No writing assistance was utilized in the production of this manuscript.
References
- 1.Peach ML, Zakharov AV, Liu R. Computational tools and resources for metabolism-related property predictions. 1 Overview of publicly available (free and commercial) databases and software. Future Med Chem. 2012;4(15):1907–1932. doi: 10.4155/fmc.12.150. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Clarke SE, Jeffrey P. Utility of metabolic stability screening: comparison of in vitro and in vivo clearance. Xenobiotica. 2001;31:591–598. doi: 10.1080/00498250110057350. [DOI] [PubMed] [Google Scholar]
- 3.Masimirembwa CM, Bredberg U, Andersson TB. Metabolic stability for drug Discovery and development: pharmacokinetic and biochemical challenges. Clin Pharmacokinet. 2003;42:515–528. doi: 10.2165/00003088-200342060-00002. [DOI] [PubMed] [Google Scholar]
- 4.Di L, Kerns EH, Ma XJ, Huang Y, Carter GT. Applications of high throughput microsomal stability assay in drug discovery. Comb Chem High Throughput Screen. 2008;11:469–476. doi: 10.2174/138620708784911429. [DOI] [PubMed] [Google Scholar]
- 5.Pil HL, Cucurull-Sanchez Lourdes, Jing L, Yuhua JD. Development of in silico models for human liver microsomal stability. J Comput Aided Mol Des. 2007;21:665–673. doi: 10.1007/s10822-007-9124-0. [DOI] [PubMed] [Google Scholar]
- 6.Sakiyama Y, Yuki H, Moriya T, et al. Predicting human liver microsomal stability with machine learning techniques. J Mol Graph Model. 2008;26:907–915. doi: 10.1016/j.jmgm.2007.06.005. [DOI] [PubMed] [Google Scholar]
- 7.Schwaighofer A, Schroeter T, Mika S, et al. A probabilistic approach to classifying metabolic stability. J Chem Inf Model. 2008;48:785–796. doi: 10.1021/ci700142c. [DOI] [PubMed] [Google Scholar]
- 8.Hu Y, Unwalla R, Denny RA, Bikker J, Di L, Humblet C. Development of QSAR models for microsomal stability: identification of good and bad structural features for rat, human and mouse microsomal stability. J Comput Aided Mol Des. 2010;24:23–35. doi: 10.1007/s10822-009-9309-9. [DOI] [PubMed] [Google Scholar]
- 9.Bursi R, de Gooyer ME, Grootenhuis A, Jacobs PL, van der Louw J, Leysen D. (Q) SAR Study on theMetabolic Stability of Steroidal Androgens. J Mol Graph Model. 2001;19:552–556. doi: 10.1016/s1093-3263(01)00089-4. [DOI] [PubMed] [Google Scholar]
- 10.Shen M, Xiao Y, Golbraikh A, Gombar V, Tropsha A. Development and validation of k-nearest-neighbor QSPR models of metabolic stability of drug candidates. J Med Chem. 2003;46:3013–3020. doi: 10.1021/jm020491t. [DOI] [PubMed] [Google Scholar]
- 11.Jensen BF, Sorensen MD, Kissmeyer AM, et al. Prediction of in vitro metabolic stability of calcitriol analogs by QSAR. J Comput Aided Mol Des. 2003;17:849–859. doi: 10.1023/b:jcam.0000021861.31978.da. [DOI] [PubMed] [Google Scholar]
- 12.Gombar VK, Alberts JJ, Cassidy KC, Mattioni BE, Mohutsky MA. In silico metabolism studies in drug discovery: prediction of metabolic stability. J Comput Aided Drug Des. 2006;2:177–188. [Google Scholar]
- 13.Tetko IV, Sushko I, Pandey AK, et al. Critical assessment of QSAR models of environmental toxicity against Tetrahymena pyriformis: focusing on applicability domain and overfitting by variable selection. J Chem Inf Model. 2008;48:1733–1746. doi: 10.1021/ci800151m. [DOI] [PubMed] [Google Scholar]
- 14.Huang J, Fan X. Why QSAR fails: an empirical evaluation using conventional computational approach. Mol Pharm. 2011;8:600–608. doi: 10.1021/mp100423u. [DOI] [PubMed] [Google Scholar]
- 15.Kerns EH, Di L. Drug-like Properties: Concepts, Structure Design and Methods. Academic Press; Amsterdam, The Netherlands: 2008. p. 526. [Google Scholar]
- 16.Hardman JG, Limbird LE, Gilman AG, editors. The Pharmacological Basis of Therapeutics. 10. McGraw-Hill; NY, USA: 2001. pp. 1917–2023. [Google Scholar]
- 17.Obach RS, Lombardo F, Waters NJ. Trend analysis of a database of intravenous pharmacokinetic parameters in humans for 670 drug compounds. Drug Metab Dispos. 2008;36:1385–1405. doi: 10.1124/dmd.108.020479. [DOI] [PubMed] [Google Scholar]
- 18.Lagunin AA, Zakharov AV, Filimonov DA, Poroikov VV. QSAR modelling of rat acute toxicity on the basis of PASS prediction. Mol Inform. 2011;30:241–250. doi: 10.1002/minf.201000151. [DOI] [PubMed] [Google Scholar]
- 19.Filimonov DA, Zakharov AV, Lagunin AA, Poroikov VV. QNA-based ‘Star Track’ QSAR approach. SAR QSAR Environ Res. 2009;20:679–709. doi: 10.1080/10629360903438370. [DOI] [PubMed] [Google Scholar]
- 20.Poroikov VV, Filimonov DA, Borodina YV, Lagunin AA, Kos A. Robustness of biological activity spectra predicting by computer program PASS for noncongeneric sets of chemical compounds. J Chem Inf Comput Sci. 2000;40:1349–1355. doi: 10.1021/ci000383k. [DOI] [PubMed] [Google Scholar]
- 21.Hong H, Xie Q, Ge W, et al. Mold2, molecular descriptors from 2D structures for chemoinformatics and toxicoinformatics. J Chem Inf Comput Sci. 2008;48:1337–1344. doi: 10.1021/ci800038f. [DOI] [PubMed] [Google Scholar]
- 22.Youden WJ. Index for rating diagnostic tests. Cancer. 1950;3:32–35. doi: 10.1002/1097-0142(1950)3:1<32::aid-cncr2820030106>3.0.co;2-3. [DOI] [PubMed] [Google Scholar]
- 23.Ihlenfeldt W-D, Voigt JH, Bienfait B, Oellien F, Nicklaus MC. Enhanced CACTVS browser of the open NCI database. J Chem Inf Comput Sci. 2002;42:46–57. doi: 10.1021/ci010056s. [DOI] [PubMed] [Google Scholar]
- 24.Kokurkina GV, Dutov MD, Shevelev SA, Popkov SV, Zakharov AV, Poroikov VV. Synthesis, antifungal activity and QSAR study of 2-arylhydroxynitroindoles. Eur J Med Chem. 2011;46:4374–4382. doi: 10.1016/j.ejmech.2011.07.008. [DOI] [PubMed] [Google Scholar]
- 25.Tropsha A, Gramatica P, Gombar VK. The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR Comb Sci. 2003;22:69–77. [Google Scholar]
Websites
- 101.Evolvus. www.evolvus.com/di.htm.
- 102.EMBL-EBI. ChEMBL database. www.ebi.ac.uk/chembldb.
- 103.EMBL-EBI. www.ebi.ac.uk.
- 104.Molecular Libraries Program Resources. http://mli.nih.gov/mli/mlpcn.
- 105.PubChem. Bioassay summary. http://pubchem.ncbi.nlm.nih.gov/assay/assay.cgi?aid=1555&version=3.1.
- 106.NCI. Downloadable structure files of NCI open database compounds. http://cactus.nci.nih.gov/download/nci.
- 107.Molecular Libraries Program. http://mli.nih.gov/mli.


