Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2013 Apr 2.
Published in final edited form as: Mol Pharm. 2012 Mar 16;9(4):996–1010. doi: 10.1021/mp300023x

ADMET Evaluation in Drug Discovery. 12. Development of Binary Classification Models for Prediction of hERG Potassium Channel Blockage

Sichao Wang a, Youyong Li a, Junmei Wang b, Lei Chen a, Liling Zhang a, Huidong Yu a, Tingjun Hou a,*
PMCID: PMC3324100  NIHMSID: NIHMS364930  PMID: 22380484

Abstract

Inhibition of the human Ether-a-go-go Related Gene (hERG) potassium channel may result in QT interval prolongation, which causes severe cardiac side effects and is a major problem in clinical studies of drug candidates. The development of in silico tools to filter out potential hERG potassium channel blockers in early stages of the drug discovery process is of considerable interest. Here, a diverse set of 806 compounds with hERG inhibition data was assembled, and the binary hERG classification models using naïve Bayesian classification and recursive partitioning (RP) techniques were established and evaluated. The naïve Bayesian classifier based on molecular properties and the ECFP_8 fingerprints yielded 84.8% accuracy for the training set using the leave-one-out (LOO) cross-validation procedure and 85% accuracy for the test set of 120 molecules. For the two additional test sets, the model achieved 89.4% accuracy for the WOMBAT-PK test set, and 86.1% accuracy for the PubChem test set. The naïve Bayesian classifiers gave better predictions than the PR classifiers. Moreover, the Bayesian classifier, employing molecular fingerprints, highlights the important structural fragments favorable or unfavorable for hERG potassium channel blockage, which offers extra valuable information for the design of compounds avoiding undesirable hERG activity.

Keywords: hERG, ADMET, QSAR, Naïve Bayesian Classification, Recursive Partitioning

Introduction

The human ether-a-go-go-related-gene (hERG) protein is a tetrameric potassium channel, which plays an important role in cardiac action potential.13 Blockage of hERG potassium ion channel is believed to be the major cause of drug-induced QT syndrome, which can lead to sudden death.4 And it may cause acquired long QT syndrome, leading to Torsades de Points (TdP), a severe cardiac side effect which represents a major problem in clinical studies of drug candidates.5 Many major drugs, such as terfenadine, cisapride, sertindole, thioridazine, and grepa-floxacin, were withdrawn from the market as a result of their undesirable hERG-related cardiotoxicity. Therefore, it is important to assess the hERG channel binding of lead compounds early in the preclinical phase of drug discovery.

The hERG IC50 value can be measured experimentally through in vitro or in vivo assays. There are many techniques for early cardiotoxicity assessment, such as rubidium-flux assays, radioligand binding assays, in vitro electrophysiology measurements, and fluorescence-based assays.6 However, these in vitro experimental assays for evaluating hERG binding are time-consuming, expensive, and labor-intensive. Therefore, it is valuable to develop in silico models that provide rapid and cheap screening platforms for identifying hERG inhibitors in the early stage of drug design and optimization.

Many computational models have been developed by structure-based and ligand-based approaches to discriminate hERG blockers from non-blockers. Since the crystal structure of hERG potassium channel is not available, homology modeling technique, in combination with molecular docking and free energy calculations, were used to predict the binding capabilities of drug candidates to hERG potassium channel. For example, Farid and co-workers docked five sertindole analogues into the active site of the homology model of hERG channel using KvAP as a template,7 and the predicted binding affinities show good relationship with the experimental values. Rajamani et al. docked 27 known hERG channel binders into the active sites of the multiple homology models of hERG channel,8 and they found that the flexibility was essential to rank the binding affinities of a set of diverse ligands correctly. However, the structure-based studies were only based on very limited analog datasets, and the prediction capability based on the homology model has still not be validated by large datasets with highly structural diversity. It is still difficult to successfully utilize hERG potassium channel homology models for prospective molecular docking studies.

Due to the lack of a crystal structure of hERG potassium channel, the ligand-based prediction models are still the mainstream for the predictions of hERG potassium channel blockers. In 2002, Ekins et al. first built a preliminary hERG pharmacophore model based on a training set of 11 molecules.9 This pharmacophore model with four hydrophobic and one positive ionizable points could successfully rank 15 molecules in a test set. In 2002, Cavalli et al. established a pharmacophore model from the CoMFA analysis of a set of 31 QT-prolonging drugs,10 and this pharmacophore model contained a positively charged tertiary amine flanked by three aromatic or hydrophobic centers. In 2003, Pearlstein and co-workers performed CoMSIA analysis on a dataset of 10 structurally diverse hERG blockers and 22 sertindole analogues,11 and the results given by the CoMSIA analysis were consistent with the structural information given by a homology model of hERG. In 2005, Cianchetta et al. developed correlation models using pharmacophore-based GRIND descriptors for the datasets with and without a basic nitrogen, and the models showed satisfactory predictions for the training and test sets.12 In 2006, Sun developed a naive Bayes classifier based on a large training set of 1979 compounds, and this model exhibited an ROC accuracy of 0.87 and could correctly classify 58 molecules from 66 tested molecules.13 In 2008, Thai and co-workers developed binary classification models based on P_VSA descriptors, and the prediction accuracy of the models was 82%~88%.14 In 2006, Song and Clark calculated 2D fragment-based descriptors for a training set of 71 compounds and a test set of 19 compounds,15 resulting in a hERG QSAR model with q2 values of 0.912 and 0.848 for the training and test sets, respectively, developed by linear support vector regression. In 2008, Li et al. developed a hERG classification model based on GRIND descriptors and support vector machine (SVM),16 and this model achieved good prediction accuracy for the training set (up to 92%); however, it could not give satisfactory prediction for the test set (global accuracy is ~60%). Recently, Su and co-workers assembled a dataset of 250 structurally diverse compounds with hERG activity, and they developed a regression model and a binary classification model based on 204 traditional 2D descriptors and 76 3D VolSurf-like descriptors.17 This binary model achieved 91% accuracy for the training set, but it could not give satisfactory prediction for the test sets.

In summary, most reported datasets have limited number of compounds (less or close to 300). Although Sun used a large dataset of 1979 molecules in modeling13, but the dataset is not an open data source. The broad multi-specificity of hERG and the lack of an extensive dataset of hERG activities become two barriers for establishing accurate prediction models. Here we present a large dataset of 806 molecules that are categorized into the blocker and non-blocker classes. And we examined the impact of eight important molecular properties widely used in ADME prediction on hERG blocking. Then, utilizing the new dataset, the naïve Bayesian and recursive partitioning classifiers based on molecular properties and fingerprints were developed and validated by the external test sets. The comparison studies show that naïve Bayesian classification outperformed PR in respect to prediction accuracy. Moreover, the naïve Bayesian classifiers identified the key structural features for differentiating hERG active and inactive.

Methods and Materials

1. Preparation of Dataset

The dataset has 806 molecules in total, and about 60% of the data (495) were collected by Li and co-workers.16 The second data source is an external test set of 66 compounds with hERG information extracted from the WOMBAT-PK database. In addition, we updated the data-set with an extra set of 245 molecules collected from recent publications.6, 14, 15, 1837 The hERG blocking activity was determined primarily by the IC50 measurements using mammalian cell lines, such as HEK, CHO, and COS. When mammalian cell line data were not available, IC50 measurements from non-mammalian cell line, XO (Xenopus laevis oocytes), were included. An ideal training set needs to be extensive, consistent, and diverse. But since the purpose of our work is to create a qualitative rather than a quantitative model for predicting hERG blockage, the variation of the activities among different cell lines may be tolerated.38 The distributions of molecular weight and IC50 shown in Figure S1 in the supplementary materials indicate that the dataset is relatively diverse and will not lead to data-biased prediction models. The molecules in the dataset were minimized in MOE by using molecular mechanics (MM) with the MMFF94 force field.39 The dataset is available from the supporting website: http://cadd.suda.edu.cn/admet.

We used different thresholds of 1, 5, 10, 20, 30, and 40 µm, respectively, to find the most suitable threshold to identify the hERG blockers and non-blockers in our dataset. The threshold up to 50 µM was not considered because it was not typically relevant based upon previous investigation.16 First, the whole dataset was split into three parts: a training set of 620 molecules randomly selected from the dataset, a test set (test set I) of 120 molecules randomly selected from the dataset, and another test set (test set II) with 66 molecules from the WOMBAT-PK database. The classification models were developed based on the training set and validated by the two test sets. Moreover, in order to give a more stringent validation, the dataset of 1953 molecules with hERG binding activity from the PubChem bioassay database was used.

2. Molecular Descriptors

Fourteen molecular descriptors widely used in ADME predictions40, 41 were used in our study. The descriptors include octanol-water partitioning coefficient (AlogP) based on the Ghose and Crippen's method,42 apparent partition coefficient at pH=7.4 (logD) based on the Csizmadia’s method,43 molecular solubility (logS) based on the multiple linear regression model developed by Tetko et al.,44 molecular weight (MW), the number of hydrogen bond donors (nHBD), the number of hydrogen bond acceptors (nHBA), the number of rotatable bonds (nrot), the number of rings (nR), the number of aromatic rings (nAR), the sum of oxygen and nitrogen atoms (nO+N), polar surface area (PSA), molecular fractional polar surface area (MFPSA), and molecular surface area (MSA). All descriptors can be divided into two classes: physiochemical properties (AlogP, logD, logS, MW, nHBD, nHBA, nR, nAR and nO+N), and geometry-related descriptors (PSA, MFPSA and MSA).45 All the descriptors were calculated by using Discovery Studio molecular simulation package.46

3. Calculation of Molecular Fingerprints

The SciTegic extended-connectivity fingerprints (ECFP, FCFP and LCFP) and Daylight-style path-based fingerprints (EPFP, FPFP and LPFP) were generated using a variant of the Morgan algorithm.47 The initial assignment of atom identifiers has different rules (F, E or L): F represents atom’s functional role code, E represents properties used in the Daylight atomic invariants rule, and L represents AlogP atom type code. The number of connections to an atom, the element types, the charge, and the atomic mass form the atom type code. In order to denote the type of fingerprint, the second character, C or P, in the fingerprint name was used. Character C represents extended-connectivity fingerprints and P represents path-based fingerprints. A fingerprint class is followed by an underscore and a number, which designates the effective diameter of the largest feature. Here for each fingerprint class, three diameters, 4, 6 and 8, were considered. The smaller diameter, 2, was not used because the structural fragments based on the diameter of 2 are so small and too general. The detailed descriptions of these fingerprints were reported in the literature.48

The fingerprints we used here are significantly different from the substructures in the prediction of ADME properties used previously.4952 Firstly, the fingerprints used here represent a much larger set of features than a set of pre-defined substructures. Furthermore, these fingerprints are generated directly from molecules, and they do not need to be pre-selected or pre-defined. Therefore, novel molecular classes are handled as easily as the common classes. Here, we generated the structural fingerprints by using Discovery Studio molecular simulation package.46

5. Naïve Bayesian Classifiers

The naïve Bayesian classification technique was used to develop classifiers to discriminate between hERG blockers and non-blockers. Bayes' theorem relates the conditional and marginal probabilities of events A and B, provided that the probability of B does not equal to zero. Its simple form is shown by Eq. 1:

P(A|B)=P(B|A)P(A)P(B) (1)

A naive Bayesian classifier is a simple probabilistic classifier based on applying Bayes' theorem with strong (naive) independence assumptions. A more descriptive term for the underlying probability model is "independent feature model". Abstractly, the probability model for a classifier is a conditional model P(C|A1,…, An) over a dependent class variable C with a small number of outcomes or classes, conditional on feature variables A1 through An. In our study, C represents the hERG activity class of a molecule: blocker class (+) or non-blocker class (−). A1 to An represents the calculated values for the feature variables (molecular properties and fingerprints). When Bayes’s theorem is used, we get:

P(+|A1,,An)=P(A1,,An|+)P(+)P(A1,,An) (2)

In plain English, the above equation is written as

posterior=likelihood × priorevidence (3)

So in Equation 2, P(A1,…, An|+) is the conditional probability of a particular compound being classified as hERG avtive; P(+) is the prior probability, a probability induced from a set of compounds in the training set; P(A1,…, An) is the marginal probability of the given descriptors that will occur in the training set. Then, we get an assumption that each feature, Ai, is conditionally independent of every other feature Aj. The mathematical procedure to train a naïve Bayesian classifier was described previously.53 An advantage of the naïve Bayesian classification is that it requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification. Moreover, Bayesian classification can process large amounts of data, learn fast, and be tolerant of random noise. The naïve Bayesian classifiers were developed in Discovery Studio molecular simulation package.46

5. Recursive Partitioning Classifiers

RP is a statistical method for multivariable analysis that can create a decision tree to classify the members in a dataset based on predefined variables. Models are constructed by successively splitting a dataset into increasingly homogeneous subsets until it is infeasible to continue, based on a set of "stopping rules." At each splitting point or node, the RP algorithm searches a pool of independent variables (i.e., descriptors) and identifies a single variable and the corresponding splitting value that best purifies the group of compounds entering the node. The splitting process continues until either no further improvement can be achieved, or the number of compounds in each purified group is too small to justify further splitting. 5-fold cross-validation was used to determine the degree of pruning required for the best predictive performance. The minimum number of samples at each node was set to 7, and the maximum tree depth was changed from 2 to 10 systematically in order to evaluate the effect of the tree depth on the prediction. The decision trees were created in Discovery Studio molecular simulation package46 based on the training set and validated by three different test sets.

6. Validating the prediction accuracy of the RP and Bayesian models

The quality of the Bayesian and RP classifiers was measured by the quantity of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), sensitivity (SE), specificity (SP), the prediction accuracy for active (or blockers) (Q+), the prediction accuracy for non-active (or non-blockers) (Q), the global accuracy (GA), and the Matthews correlation coefficient (C):

SE=TPTP+FN (10)
SP=TNTN+FP (11)
PRE1=TPTP+FP (12)
PRE2=TNTN+FN (13)
GA=TP+TNTP+TN+FP+FN (14)
C=TP × TNFN × FP(TP+FN)(TP+FP)(TN+FN)(TN+FP) (15)

The values of GA and C are two important indicators for the classification accuracy of models. The above quantities were calculated for both training and test sets.

Results and Discussions

1. The impact of eight molecular properties on hERG binding

For the predictions of ADME properties,41, 50, 52, 5467 many molecular descriptors have been proven to be helpful, and they were used to describe a lot of molecular properties, such as lipophilicity, hydrogen bonding ability, molecular flexibility, molecular weight, etc. Here, we systematically examined the relationships between eight molecular properties that are widely used in ADME predictions. Eight molecular properties include MW, logS, logD7.4, AlogP, PSA, nHBA, nHBD and nrot. The distributions of these eight molecular properties for the active and inactive classes at the threshold of 30µm are shown in Figure 1. We evaluated the significance of the difference between the means by Student’s t-test. As a complementary test, the linear correlations between each of these eight molecular properties and the IC50 values of 557 compounds are shown in Figure 2.

Figure 1.

Figure 1

Distributions of eight molecular properties, including AlogP, logD, logS, MW, PSA, nrot, nHBD and nHDA, for the active and inactive classes

Figure 2.

Figure 2

Correlations between eight molecular properties, including AlogP, logD, logS, MW, PSA, nrot, nHBD and nHDA, and pIC50 values.

In the eight molecular properties we studied here, AlogP, logD and logS are related to hydrophobicity of a molecule. AlogP is distributed between −5.339 and 11.512, with a mean of 1.856, logD between −5.122 and 11.512, with a mean of 2.186, and logS between −14.751 and 2.218, with a mean of −5.226. The mean values of AlogP for the 369 inactive versus the 344 active are 1.334 and 2.416, respectively; it suggests that with the raise of hydrophobicity, hERG binding capability increases. The p-value, which measures the difference in the mean of two distributations, is 3.26e−18 for AlogP at the 95% confidence level, indicating that the distributions of actives and inactives are significantly different. Compared with AlogP, logD shows better classification capability because of a smaller p-value of 5.14e−23 as shown in Figure 1. It is obvious that the actives tend to be more lipophilic than the inactive agents. Besides hERG channel, it is a general phenomenon that increasing lipophilicity is usually favorable for the binding of the studied molecules to protein receptors.21 The mean values of logS for the 369 inactives versus the 344 actives are −4.01 and −6.53, respectively. It is interesting that logS shows even better classification capability than logP and logD because its p-value is 3.85e−49. In Figure 2, we also observed similar results that logS showed a better linear correlation (r = −0.480) with IC50 than logP (r = 0.28) and logD (r = 0.338). Based on the above analysis, we can draw a conclusion that logS is better to predict hERG blockage than logD and AlogP. However, the two distributions of logS for the actives and inactives are still strongly overlapped. Our results are question interesting because it appears that hERG blockage does not have direct connection with solubility. However, solubility is also an important indicator of hydrophobicity.52 Therefore, it is quite possible that logS gives better description for the hydrophobicity of the studied molecules.

As seen from Figure 1, molecular weight also demonstrates a high impact on hERG binding. The mean values for the actives and inactives are 405.77 and 313.65, respectively. At the 95% confidence level, the p-value of the difference of MW for the two classes is 5.47e−27. It means that MW also has great impact on hERG activity, but its ability to discriminate the actives from the inactives is worse than logS. Moreover, In Figure 1, the descriptor, nrot, is a less desirable classifier itself on hERG binding because the p-value associated with the difference in the mean nrot values of the two classes is 1.74e−14, even less than logD (5.20e−21). nrot was used as a descriptor in many cases to characterize the bulkiness of a molecule because a larger molecule usually has more rotatable bonds.68 But in this study, nrot does not show an obvious correlation with molecular weight (r = −0.13). According to the former analysis, if molecular weight of a molecule is less than 250, it is less likely to be a hERG blocker,69 which is consistent with the analysis shown in Figure 1. Almost all hERG blockers have molecular weight larger than 250; however, most non-blockers also have molecular weight larger than 250. Therefore, molecular weight of 250 is not a good indicator for hERG binding.

The other three descriptors (PSA, nHBD and nHDA) to characterize electrostatic or H-bond features show different capability to discriminate the actives from the inactives. The p-values associated with the difference in the mean values of the two classes are 7.836e−6, 0.0041 and 2.69e−4 for PSA, nHBD and nHDA, respectively. In Figure 2, very low correlations between IC50 values and PSA (r = −0.157), nHBD (r = 0.0378) and nHDA (r = −0.164) were observed. Our findings are not consistent with the reported pharmacophore models of hERG blockers because all these pharmacophore models have H-bond acceptor features. These three descriptors, PSA, nHBD and nHDA, are very general and has low capacity to characterize the spatial constrain of pharmacophore features imposed by a pharmacophore model.

2. Naïve Bayesian classifiers

The analysis in the previous section shows that a single molecular property is not a good classification criterion for hERG binding. For this reason, a statistical technique, naïve Bayesian classification, popularly used in bioinformatic analysis, was applied to develop classification models.

We divided our dataset into three parts, a training set of 620 molecules, a test set of 120 molecules (test set I) and an external test set of 66 molecules from the WOMBAT-PK database (test set II). For the threshold to class hERG blocker or non-bolcker, different values of IC50 have been proposed. Aronov and Goldman adopted a threshold of 40 µM;38 Sun selected a threshold values of 30 µM;13 Tobite et al. used IC50 thresholds of 1 and 40 μ,19; Roche et al. opted for threshold values of 1 and 10 µM.21 Here, 6 levels of thresholds were used, and they are 1, 5, 10, 20, 30, and 40 µM, respectively. It must be noted that some compounds were removed from the training set as those compounds are ungroupable for certain thresholds. Take Dexrazoxane (IC50>30) as an example, and it is not included in the training set when a threshold of 40 µM is used. To find out the best threshold and fingerprints for hERG classification, 36 classifiers based on the 14 molecular properties and one type of structural fingerprint set were developed. The 36 types of structural fingerprint sets include ECFC, ECFP, EPFC, EPFP, FCFC, FCFP, FPFC, FPFP, LCFC, LCFP, LPFC and LPFP with three different diameters, 4, 6 and 8. The statistical results for these Bayesian classifiers are summarized in Table 1. For the training set, among all the fingerprint sets, the best Bayesian classifier based on the 14 molecular properties and the LPFC_8 fingerprint set has a sensitivity of 85.0%, a specificity of 92.9%, a prediction accuracy of 72.2% for the active class, a prediction accuracy of 96.6% for the inactive class and a GA of 91.5% based on a leave-one-out (LOO) cross-validation when the threshold is 1µm. But at different thresholds the fingerprint sets that achieve the highest global accuracies are not always the same. For example, the fingerprint sets in the best five Bayesian classifiers based on the other five thresholds are EPFC_8 (5µm, GA=87.1%), EPFC_8 (10µm, GA=84.3%), LPFC_8 (20µm, GA=90.4%), LPFC_6 (30µm, GA=89.0%), and LPFC_8 (40µm, GA=88.7), respectively.

Table 1.

The performance of the Bayesian classifiers for the training and test sets based on different thresholds

Training Test

Thresholds Model Fingerprints TP FN FP TN SE SP PRE1 PRE2 GA C TP FN FP TN SE SP PRE1 PRE2 GA C
1µm BC-1 LPFC_8 91 16 35 458 0.850 0.929 0.722 0.966 0.915 0.733 23 4 10 83 0.852 0.892 0.697 0.954 0.883 0.696
BC-2 ECFP_4 81 26 46 447 0.757 0.907 0.638 0.945 0.880 0.622 25 2 16 77 0.926 0.828 0.610 0.975 0.850 0.664
BC-3 EPFC_8 94 13 62 431 0.879 0.874 0.603 0.971 0.875 0.657 23 4 14 79 0.852 0.849 0.622 0.952 0.850 0.634

5µm BC-4 FCFP_8 135 47 59 358 0.742 0.859 0.696 0.884 0.823 0.590 39 3 10 68 0.929 0.872 0.796 0.958 0.892 0.777
BC-5 LCFC_6 148 34 70 347 0.813 0.832 0.679 0.911 0.826 0.617 40 2 15 63 0.952 0.808 0.727 0.969 0.858 0.728
BC-6 SEFP_4 144 38 64 353 0.791 0.847 0.692 0.903 0.830 0.616 35 7 10 68 0.833 0.872 0.778 0.907 0.858 0.695

10µm BC-7 FCFP_6 191 34 101 273 0.849 0.730 0.654 0.889 0.775 0.561 48 6 13 53 0.889 0.803 0.787 0.898 0.842 0.689
BC-8 LCFP_8 167 58 72 302 0.742 0.807 0.699 0.839 0.783 0.544 45 9 11 55 0.833 0.833 0.804 0.859 0.833 0.665
BC-9 FCFC_6 193 32 90 284 0.858 0.759 0.682 0.899 0.796 0.599 37 5 13 65 0.881 0.833 0.740 0.929 0.850 0.691

20µm BC-10 FCFP_8 229 35 66 264 0.867 0.800 0.776 0.883 0.830 0.663 49 8 11 52 0.860 0.825 0.817 0.867 0.842 0.684
BC-11 LCFP_6 240 24 84 246 0.909 0.745 0.741 0.911 0.818 0.653 53 4 15 48 0.930 0.762 0.779 0.923 0.842 0.697
BC-12 FCFP_4 220 44 59 271 0.833 0.821 0.789 0.860 0.827 0.652 48 9 11 52 0.842 0.825 0.814 0.852 0.833 0.667

30µm BC-13 ECFP_8 242 41 49 261 0.855 0.842 0.832 0.864 0.848 0.696 53 8 10 49 0.869 0.831 0.841 0.860 0.850 0.700
BC-14 LCFP_6 249 34 65 245 0.880 0.790 0.793 0.878 0.833 0.671 55 6 12 47 0.902 0.797 0.821 0.887 0.850 0.703
BC-15 FCFP_6 248 35 61 249 0.876 0.803 0.803 0.877 0.838 0.679 53 8 11 48 0.869 0.814 0.828 0.857 0.842 0.684

40µm BC-16 LCFP_4 254 49 65 235 0.838 0.783 0.796 0.827 0.811 0.623 54 7 12 47 0.885 0.797 0.818 0.870 0.842 0.685
BC-17 LCFC_4 260 43 60 240 0.858 0.800 0.813 0.848 0.829 0.659 53 8 12 47 0.869 0.797 0.815 0.855 0.833 0.668
BC-18 ECFC_8 233 70 29 271 0.769 0.903 0.889 0.795 0.836 0.678 48 13 8 51 0.787 0.864 0.857 0.797 0.825 0.653

All the Bayesian classifiers were then validated by the predictions on the test set I of 120 molecules. As shown in Table 1, the classifier that gives the best prediction for this test set is also based on the threshold of 1µM. In addition, we find that if we choose GAtrainingGAtest ≥ 0.85 as an criterion for good prediction, only the classifier based on the threshold of 1µM can satisfy this criterion. The best prediction of the Bayesian classifier based on the threshold of 1µM is understandable. 1µM is a relatively low threshold for hERG binding; therefore, the number of the non-blockers in both the training and test sets are much larger than that of the blockers. The classifiers based on the highly biased data set tend to give better predictions for the non-blocker class and yield better global accuracy. At each threshold, the average and standard deviation of the GA and C values of the 36 classifiers are shown in Table 2. For the training set, the classifiers based on the threshold of 30µM perform best judged by the mean value and standard deviation of GA. For the test set, the classifiers based on the threshold of 5µM give the highest mean GA, but the standard deviation of GA also reaches the highest. The 5µM models have the best mean C value but their standard deviation is much higher than those of the models based on the other five thresholds, so they may be not stable models for classification. In our study, taking consideration of both GA and C of the training and test sets, the threshold of 30µM may be the best choice. For the training and test sets, we compared the performance of the classifiers based on different fingerprint sets with or without 14 molecular properties and the result is shown in Figure 3. Obviously, compared with the classifiers only based on molecular properties, the addition of structural fingerprints can improve the classification accuracy significantly.

Table 2.

Mean and standard deviation (SD) of global accuracy (GA) and C values for the Baeysian classifiers based on different thresholds

Training set Test set

GA C GA C

Threshold Average SD Average SD Average SD average SD
1µm 0.829 0.0470 0.558 0.0751 0.800 0.0431 0.572 0.0655
5µm 0.806 0.0368 0.576 0.0694 0.809 0.0471 0.618 0.1050
10µm 0.776 0.0340 0.549 0.0650 0.774 0.0405 0.553 0.0858
20µm 0.826 0.0286 0.653 0.0575 0.785 0.0407 0.573 0.0828
30µm 0.831 0.0254 0.663 0.0507 0.794 0.0424 0.591 0.0836
40µm 0.819 0.0260 0.643 0.0524 0.792 0.0405 0.588 0.0795

Figure 3.

Figure 3

The comparison of global accuracy of the Bayesian classsifiers using different thresholds based on molecular properties and fingerprints for the (a) training and (b) test sets.

The prediction accuracy of the Bayesian classifier based on the 14 molecular properties and the ECFP_8 fingerprint set to distinguish blockers from non-blockers was evaluated with two bimodal histograms of the training and test sets (Figure 4). The histograms show that the blockers tend to be more positive while the non-blockers tend to be more negative. Compared with the histograms of the molecular properties shown in Figure 1, the Bayesian classifier separates two classes substantially better. The overlapped area is much smaller than that shown in Figure 1. For the training set, the compounds in both classes overlap between −10 and 10. So the region between −10 and 0 can be defined as the “uncertain zone”. When the Bayesian score of a molecule is in the uncertain zone, the prediction for this molecule is not reliable. For the 120 molecules in the test set, 28 of them are located in the uncertain zone. If these 28 molecules without accurate predictions are eliminated from the test set, the sensitivity and specificity are improved from 86.9% and 83.0% to 88.9% and to 92.1%.

Figure 4.

Figure 4

The distributions of the Bayesian scores given by the Bayesian classifier based on molecular properties and the ECFP_8 fingerprint set for the active and inactive classes for the (a) training and (b) test sets. The Bayesian scores for the training set were obtained by using the LOO cross-validation process

3. Recursive partitioning models

Compared with the Bayesian classifiers, the decision trees given by PR is more “visible”, because a RP model can be converted to simple hierarchical rules and it is easier to be interpreted. Here the threshold of 30µM was used. First, a decision tree based on the 14 molecular properties without fingerprints was constructed and evaluated, and the tree depth was set to 7. For the training and test sets, the GA and C values are 84.5% and 79.2%, and 69.0% and 58.5%, respectively, indicating that the prediction accuracy for the test set is much worse than that for the training set. Then, molecular fingerprints, together with molecular properties, were used simultaneously as the descriptors in RP analysis. Here the fingerprint ECFP_8 set was added into RP modeling. Obviously, after the addition of fingerprints, the performance of the RP model becomes much better. The GA values of the training and test set are 86.8% and 80.8%, respectively.

The depth of decision tree plays a very important role in RP analysis, and it indicates the complexity of a decision tree. So the tree depth should be carefully calibrated based on the predictions on the external test set. Here, the tree depth was changed from 2 to 10 and the corresponding performance of the models for the training and test sets was evaluated. The change of the GA values versus the tree depth is shown in Figure 5. For the training set the value of GA increases with the increase of the tree depth; however for the test set the value of C does not always increase. For the test set, the tree depth of 3 or 4 reaches the best prediction. So according to our analysis, the depth of 4 is the best choice. The best RP model shown in Figure 6 gives a sensitivity of 0.869, a specificity of 0.763, a classification accuracy of 0.791 for the blocker class, a classification accuracy of 0.849 for the non-blocker class, and C of 0.636 for the test set. Based on the same descriptors and data set, the naïve Bayesian classifier performs much better than the RP classifier.

Figure 5.

Figure 5

The change of GA versus the tree depth for the training and test sets. The PR model was constructed based on molecular properties and the ECFP_8 fingerprint set.

Figure 6.

Figure 6

Decision tree to classify compounds into the blocker and non-blockers classes created by RP (tree depth is set to 4).

4. Model validation using the additional test sets

To evaluate and validate the actual prediction capability of the Bayesian classifier, two additional data sets were used.

The test set II from the WOMBAT-PK database has 66 molecules. The six best Bayesian classifiers based on six different thresholds were used to evaluate the actual prediction accuracy for this test set (Table 4). Here we built the Bayesian classifiers based on the dataset of 740 molecules (the training set plus the test set I) while not only the separated training set of 620 molecules. For test set II, using the threshold of 30µM, the Bayesian classifier reaches a GA of 89.4%, a specificity of 53.8% and a sensitivity of 98.1%, respectively; while the Bayesian classifier reaches a GA of 89.4%, a specificity of 72.7% and a sensitivity of 92.7%, respectively, using the threshold of 40 µM. Among all the classifiers shown in Table 4, five of them achieve high prediction performance for the tested molecules, indicated by the GA values larger than 80%. Compared with Li’s results, our predictions for the 66 tested molecules are much better. Based on the threshold of 40µM, Li’s model only achieved the best overall accuracy of 72%, 85% (47/55) for the blockers and 36% (4/11) for the non-blockers. Therefore, we believe that the Bayesian classifiers based on molecular properties and structural fingerprints are highly reliable and functionally useful for the prediction of hERG toxicity across a wide range of chemical space.

Table 4.

Performance of the Bayesian classifiers for the WOMBAT-PK test set

Training set Test set

Thresholds Model Fingerprints TP FN FP TN SE SP PRE1 PRE2 GA C TP FN FP TN SE SP PRE1 PRE2 GA C
1µm BC-19 ECFC_8 106 28 54 532 0.791 0.908 0.663 0.950 0.886 0.654 13 6 7 40 0.684 0.851 0.650 0.870 0.803 0.527
5µm BC-20 FCFP_8 180 44 76 419 0.804 0.846 0.703 0.905 0.833 0.629 31 1 19 15 0.969 0.441 0.620 0.938 0.697 0.478
10µm BC-21 LCFC_6 240 39 101 339 0.860 0.770 0.704 0.897 0.805 0.615 42 3 10 11 0.933 0.524 0.808 0.786 0.803 0.521
20µm BC-22 LCFC_4 278 43 79 314 0.866 0.799 0.779 0.880 0.829 0.662 46 1 9 10 0.979 0.526 0.836 0.909 0.848 0.614
30µm BC-23 ECFC_4 279 65 48 321 0.811 0.870 0.853 0.832 0.842 0.683 52 1 6 7 0.981 0.538 0.897 0.875 0.894 0.633
40µm BC-24 ECFC_4 289 75 40 319 0.794 0.889 0.878 0.810 0.841 0.685 51 4 3 8 0.927 0.727 0.944 0.667 0.894 0.632

We also downloaded a larger test set of hERG bioassay data from PubChem. The blocking activities provided by PubChem are represented by the percentage of hERG blockade but not an actual IC50 value. The molecules in the PubChem data set were classified as blockers or non-blockers, using a threshold value of 20% hERG blockage. The IC50 values collected in our data set cannot directly match the percentage of hERG blockade in PubChem. So we investigated the internal relationship between them. In the PubChem bioassay, for the positive control, the value obtained using 10uM terfenadine is defined as 100% hERG blockade. To find an appropriate threshold to use for validating the PubChem test set, we tested the prediction performance of the best six classifiers shown in Table 1, and the results are summarized in Table 5. For the same PubChem data, Su and co-workers reported a rather low overall accuracy of 52% when using the threshold of 40 µm.17 Using the threshold of 40µm, the predictions given by the Bayesian classifier developed here are much better than the Su’s results, but the predictions are still not satisfactory, indicated by a relatively low GA (GA = 70.0%). Besides the threshold of 40µm, Su et al. also tested the threshold of 10µM, and they found that the classification model based on the threshold of 10µM gave an overall accuracy of 65% for the PubChem data set. As shown in Table 5, the Bayesian classifier (GA=71%) still outperforms the Su’s model when using the threshold of 10µM. Among the six thresholds, the classifier based on the threshold of 1µm yields the best GA value (86.1%). The PubChem test set is highly biased, and the inactive number is almost seven times of the active one, and its biased degree is quite similar to that of the training set when the threshold of 1µM is used.

Table 5.

Performance of the best Bayesian classifiers for the PubChem bioassay test set of hERG blockage using the percentage of 20% or 30% as the threshold

Training set Test set

thresholds model Fingerprints TP FN FP TN SE SP PRE1 PRE2 GA C TP FN FP TN SE SP PRE1 PRE2 GA C
Percentage of 20%

1µm BC-25 LPFC_8 117 17 36 550 0.873 0.939 0.765 0.970 0.926 0.772 32 218 53 1650 0.128 0.969 0.376 0.883 0.861 0.159
5µm BC-20 FCFP_8 180 44 76 419 0.804 0.846 0.703 0.905 0.833 0.629 87 163 231 1472 0.348 0.864 0.274 0.900 0.798 0.192
10µm BC-26 FCFP_6 247 32 125 315 0.885 0.716 0.664 0.908 0.782 0.586 135 115 456 1247 0.540 0.732 0.228 0.916 0.708 0.198
20µm BC-27 FCFP_8 281 40 73 320 0.875 0.814 0.794 0.889 0.842 0.686 124 126 393 1310 0.496 0.769 0.240 0.912 0.734 0.201
30µm BC-28 ECFP_8 281 63 63 306 0.817 0.829 0.817 0.829 0.823 0.646 115 135 389 1314 0.460 0.772 0.228 0.907 0.732 0.177
40µm BC-29 LCFP_4 292 72 58 301 0.802 0.838 0.834 0.807 0.820 0.641 92 158 307 1396 0.368 0.820 0.231 0.898 0.762 0.156

Percentage of 30%

1µm BC-25 LPFC_8 117 17 36 550 0.873 0.939 0.765 0.970 0.926 0.772 42 518 43 1350 0.075 0.969 0.494 0.723 0.713 0.098
5µm BC-20 FCFP_8 180 44 76 419 0.804 0.846 0.703 0.905 0.833 0.629 149 411 169 1224 0.266 0.879 0.469 0.749 0.703 0.177
10µm BC-26 FCFP_6 247 32 125 315 0.885 0.716 0.664 0.908 0.782 0.586 254 306 337 1056 0.454 0.758 0.430 0.775 0.671 0.208
20µm BC-27 FCFP_8 281 40 73 320 0.875 0.814 0.794 0.889 0.842 0.686 331 229 288 1105 0.591 0.793 0.535 0.828 0.735 0.374
30µm BC-28 ECFP_8 281 63 63 306 0.817 0.829 0.817 0.829 0.823 0.646 345 215 289 1104 0.616 0.793 0.544 0.837 0.742 0.395
40µm BC-29 LCFP_4 292 72 58 301 0.802 0.838 0.834 0.807 0.820 0.641 178 382 221 1172 0.318 0.841 0.446 0.754 0.691 0.179

5. Analysis of the important fragments given by naïve Bayesian classifier

The important fragments given by the Bayesian classifier may be useful for experimental scientists when designing molecules avoid hERG binding. The 15 good and 15 bad fragments favorable and unfavorable for hERG binding ranked by the Bayesian scores are shown in Figure 6.

By analyzing the fingerprints with positive contributions to hERG binding shown in Figure 6a, we observe that most fragments (14 fragments) have nitrogen atoms, and four fragments among the top five have nitrogen atoms with positive charges. According to previous studies, it was believed that a positively charged nitrogen in general increases the likelihood of hERG binding.69 The basic nitrogen is involved in the cation-π interaction with Tyr652. Moreover, we found that in some fragments, such as fragments 1, 2, 3, and 4, the tertiary amine group was linked by a hydrophobic tail and the hydrophobic part in these fragments may form strong van der Waals or hydrophobic interactions with some residues in hERG, such as Phe656. We also found that some fragments in Figure 7a had hydrophobic or aromatic rings, such as fragments 2, 5, 10, 11, 13 and 15. It is believed that these hydrophobes may form favorable hydrophobic or π-π interaction with Phe656 and Tyr652.

Figure 7.

Figure 7

(a) The 15 good and (b) 15 bad fragments for hERG identified by the Bayesian classifier based on molecular properties and the ECFP_8 fingerprint set.

The top 15 fingerprints unfavorable for hERG binding are shown in Figure 7b. Analysis of these fragments indicates that top 5 fragments have at least one oxygen atom or a carboxylic acid. As we discussed above, the positive ionizable nitrogen is important for ligand-hERG interaction. The oxygen atom is negatively charged and cannot form cation-π interaction with Tyr652; moreover, the fragments with carboxylic acid group are usually hydrophilic and they are also unfavorable for the favorable ligand-hERG hydrophobic interactions. Another interesting finding is that two fragments (9 and 12) in Figure 7b have nitrogen atoms; however, the nitrogen in fragment 9 is not posively charged and that in fragment 12 is a quaternary amine, and they are also not favorable for forming cation-π interaction with Tyr652. It is obvious that not all nitrogen atoms are favorable for ligand binding.

6. Why some molecules cannot be predicted correctly?

In the test set, 10 of the 59 non-blockers were misclassified as false positives and 8 of the 61 blockers were misclassified as false negatives by the naïve Bayesian classifier based on the threshold of 30µM (Figure 8). Among the 18 outliers, 14 have the Bayesian scores located in the uncertain zone from −10 to 10. We believe that the following reasons can account for the misclassification.

Figure 8.

Figure 8

(a) The 10 non-blockers and (b) 8 blockers misclassified by the Bayesian classifier based on molecular properties and the ECFP_8 fingerprint set.

First, the misclassification may be primarily caused by the intrinsic limitation of the Bayesian classifiers based on fingerprints. All of the 10 misclassified non-blockers have aromatic rings that are thought to form π-π interaction with Phe656 and Tyr652. Moreover, most of them contain one or even more important groups, e.g. amine, favorable for hERG blockage shown in Figure 7. It should be noted that the hERG binding capability of a molecule is not solely determined by the presence of the important fragments favorable for hERG binding. The spatial location and arrangement of these important fragments are also essential, but they cannot be well characterized by the 2-D fingerprints used in our study. Moreover, if a positively charged nitrogen is already involved in the interactions with Tyr652, the other positively charged nitrogen atoms cannot give positive contribution anymore. For example, esevenlafaxine has seven nitrogen atoms, and this molecule may be highly overestimated.

Second, IC50 is an important but not a direct measure of hERG inhibition. The threshold of IC50 to define the blocker and non-blocker classes is also arbitrary. Six misclassified blockers do not have precise IC50 values, suggesting that they may be weak hERG blockers and be misclassified. Third, the exprimental data is collected from a variety of sources, and the diversity of the experimental data will increase the data uncertainty and prediction complexity. Moreover, it is quite possible that the data collected from literatures may have mistakes in experimental evaluation.

Conclusions

In this study, based on an extensive hERG inhibition dataset of 806 molecules, we first examined the relationships between eight important molecular properties and hERG binding capability. We found that solubility and molecular weight were more important for hERG binding than the other six molecular properties, but no single molecular property can be used to discriminate between hERG blockers and non-blockers efficiently. Then, the naïve Bayesian classification and RP technique were used to establish the classifiers to distinguish blockers from non-blockers. The Bayesian classifiers perform better than the RP models. Using the threshold of 30µm, the best Bayesian classifier with the 14 molecular properties and the ECFP fingerprint set demonstrates good predictivity, indicated by the high prediction accuracies for the training set (GA=84.8%), the test set I (GA=85%), the test set II (GA=89.4%), and the PubChem test set (GA=75.3%), indicating that the naïve Bayesian classifiers are efficient for the predictions of hERG blockage. Moreover, the important molecular fragments favorable or unfavorable for hERG blockage are highlighted by the Bayesian analysis, and they are very helpful for the design of new drugs avoiding unfavorable hERG blockage.

The spatial arrangement of the features important for hERG binding may not be well characterized simply by the Bayesian classifier if they are not located in the same fingerprint. In the future, based on different principles, we can develop different prediction model that complements each other. Then, it is possible that the combination of two or more models based on different principles can give higher confidence for predicting hERG blockage.

Supplementary Material

1_si_001

Table 3.

The performance of the RP models based on the ECFP_8 fingerprint set for the training and test sets based on different tree depths

Training set Test set

Tree depth Model TP FN FP TN SE SP PRE1 PRE2 GA C TP FN FP TN SE SP PRE1 PRE2 GA C
2 RP-1 209 74 60 250 0.739 0.806 0.777 0.772 0.774 0.547 48 13 12 47 0.787 0.797 0.800 0.783 0.792 0.583
3 RP-2 235 48 60 250 0.830 0.806 0.797 0.839 0.818 0.636 53 8 14 45 0.869 0.763 0.791 0.849 0.817 0.636
4 RP-3 233 50 54 256 0.823 0.826 0.812 0.837 0.825 0.649 53 8 14 45 0.869 0.763 0.791 0.849 0.817 0.636
5 RP-4 254 29 62 248 0.898 0.800 0.804 0.895 0.847 0.698 55 6 19 40 0.902 0.678 0.743 0.870 0.792 0.596
6 RP-5 255 28 53 257 0.901 0.829 0.828 0.902 0.863 0.730 55 6 17 42 0.902 0.712 0.764 0.875 0.808 0.626
7 RP-6 251 32 46 264 0.887 0.852 0.845 0.892 0.868 0.738 54 7 16 43 0.885 0.729 0.771 0.860 0.808 0.623
8 RP-7 251 32 46 264 0.887 0.852 0.845 0.892 0.868 0.738 53 8 16 43 0.869 0.729 0.768 0.843 0.800 0.604
9 RP-8 251 32 46 264 0.887 0.852 0.845 0.892 0.868 0.738 54 7 16 43 0.885 0.729 0.771 0.860 0.808 0.623
10 RP-9 251 32 46 264 0.887 0.852 0.845 0.892 0.868 0.738 54 7 16 43 0.885 0.729 0.771 0.860 0.808 0.623

Acknowledgements

This study was supported by the National Science Foundation of China (20973121 to T. Hou), the National Basic Research Program of China (973 program, 2012CB932600 to T. Hou), the NIH (R21GM097617 to J. Wang) and and the Priority Academic Program Development of Jiangsu Higher Education Institutions (PAPD)

Footnotes

Supporting materials

The distributions of molecular weight and IC50 are shown in Figure S1.

References

  • 1.Sanguinetti MC, Tristani-Firouzi M. hERG potassium channels and cardiac arrhythmia. Nature. 2006;440:463–469. doi: 10.1038/nature04710. [DOI] [PubMed] [Google Scholar]
  • 2.Recanatini M, Poluzzi E, Masetti M, Cavalli A, De Ponti F. QT prolongation through hERG K+ channel blockade: Current knowledge and strategies for the early prediction during drug development. Medicinal Research Reviews. 2005;25:133–166. doi: 10.1002/med.20019. [DOI] [PubMed] [Google Scholar]
  • 3.Witchel HJ. The hERG potassium channel as a therapeutic target. Expert Opinion on Therapeutic Targets. 2007;11:321–336. doi: 10.1517/14728222.11.3.321. [DOI] [PubMed] [Google Scholar]
  • 4.Keating MT, Sanguinetti MC. Molecular genetic insights into cardiovascular disease. Science. 1996;272:681–685. doi: 10.1126/science.272.5262.681. [DOI] [PubMed] [Google Scholar]
  • 5.Hancox JC, Mitcheson JS. Combined hERG channel inhibition and disruption of trafficking in drug-induced long QT syndrome by fluoxetine: a case-study in cardiac safety pharmacology. British Journal of Pharmacology. 2006;149:457–459. doi: 10.1038/sj.bjp.0706890. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Polak S, Wisniowska B, Brandys J. Collation, assessment and analysis of literature in vitro data on hERG receptor blocking potency for subsequent modeling of drugs' cardiotoxic properties. Journal of Applied Toxicology. 2009;29:183–206. doi: 10.1002/jat.1395. [DOI] [PubMed] [Google Scholar]
  • 7.Farid R, Day T, Friesner RA, Pearlstein RA. New insights about HERG blockade obtained from protein modeling, potential energy mapping, and docking studies. Bioorganic & Medicinal Chemistry. 2006;14:3160–3173. doi: 10.1016/j.bmc.2005.12.032. [DOI] [PubMed] [Google Scholar]
  • 8.Rajamani R, Tounge BA, Li J, Reynolds CH. A two-state homology model of the hERG K+ channel: application to ligand binding. Bioorganic & Medicinal Chemistry Letters. 2005;15:1737–1741. doi: 10.1016/j.bmcl.2005.01.008. [DOI] [PubMed] [Google Scholar]
  • 9.Ekins S, Crumb WJ, Sarazan RD, Wikel JH, Wrighton SA. Three-dimensional quantitative structure-activity relationship for inhibition of human ether-a-go-go-related gene potassium channel. Journal of Pharmacology and Experimental Therapeutics. 2002;301:427–434. doi: 10.1124/jpet.301.2.427. [DOI] [PubMed] [Google Scholar]
  • 10.Cavalli A, Poluzzi E, De Ponti F, Recanatini M. Toward a pharmacophore for drugs inducing the long QT syndrome: Insights from a CoMFA study of HERG K+ channel blockers. Journal of Medicinal Chemistry. 2002;45:3844–3853. doi: 10.1021/jm0208875. [DOI] [PubMed] [Google Scholar]
  • 11.Pearlstein RA, Vaz RJ, Kang JS, Chen XL, Preobrazhenskaya M, Shchekotikhin AE, Korolev AM, Lysenkova LN, Miroshnikova OV, Hendrix J, Rampe D. Characterization of HERG potassium channel inhibition using CoMSiA 3D QSAR and homology modeling approaches. Bioorganic & Medicinal Chemistry Letters. 2003;13:1829–1835. doi: 10.1016/s0960-894x(03)00196-3. [DOI] [PubMed] [Google Scholar]
  • 12.Cianchetta G, Li Y, Kang JS, Rampe D, Fravolini A, Cruciani G, Vaz RJ. Predictive models for hERG potassium channel blockers. Bioorganic & Medicinal Chemistry Letters. 2005;15:3637–3642. doi: 10.1016/j.bmcl.2005.03.062. [DOI] [PubMed] [Google Scholar]
  • 13.Sun HM. An accurate and interpretable Bayesian classification model for prediction of hERG liability. Chemmedchem. 2006;1:315–322. doi: 10.1002/cmdc.200500047. [DOI] [PubMed] [Google Scholar]
  • 14.Thai KM, Ecker GF. Binary QSAR model for classification of hERG potassium channel blockers. Bioorganic & Medicinal Chemistry. 2008;16:4107–4119. doi: 10.1016/j.bmc.2008.01.017. [DOI] [PubMed] [Google Scholar]
  • 15.Song MH, Clark M. Development and evaluation of an in silico model for hERG binding. Journal of Chemical Information and Modeling. 2006;46:392–400. doi: 10.1021/ci050308f. [DOI] [PubMed] [Google Scholar]
  • 16.Li QY, Jorgensen FS, Oprea T, Brunak S, Taboureau O. hERG classification model based on a combination of support vector machine method and GRIND descriptors. Molecular Pharmaceutics. 2008;5:117–127. doi: 10.1021/mp700124e. [DOI] [PubMed] [Google Scholar]
  • 17.Su BH, Shen MY, Esposito EX, Hopfinger AJ, Tseng YJ. In Silico Binary Classification QSAR Models Based on 4D-Fingerprints and MOE Descriptors for Prediction of hERG Blockage. Journal of Chemical Information and Modeling. 2010;50:1304–1318. doi: 10.1021/ci100081j. [DOI] [PubMed] [Google Scholar]
  • 18.Kongsamut S, Kang JS, Chen XL, Roehr J, Rampe D. A comparison of the receptor binding and HERG channel affinities for a series of antipsychotic drugs. European Journal of Pharmacology. 2002;450:37–41. doi: 10.1016/s0014-2999(02)02074-5. [DOI] [PubMed] [Google Scholar]
  • 19.Tobita M, Nishikawa T, Nagashima R. A discriminant model constructed by the support vector machine method for HERG potassium channel inhibitors. Bioorganic & Medicinal Chemistry Letters. 2005;15:2886–2890. doi: 10.1016/j.bmcl.2005.03.080. [DOI] [PubMed] [Google Scholar]
  • 20.Du LP, Li MY, You QD, Xia L. A novel structure-based virtual screening model for the hERG channel blockers. Biochemical and Biophysical Research Communications. 2007;355:889–894. doi: 10.1016/j.bbrc.2007.02.068. [DOI] [PubMed] [Google Scholar]
  • 21.Roche O, Trube G, Zuegge J, Pflimlin P, Alanine A, Schneider G. A virtual screening method for prediction of the hERG potassium channel liability of compound libraries. Chembiochem. 2002;3:455–459. doi: 10.1002/1439-7633(20020503)3:5<455::AID-CBIC455>3.0.CO;2-L. [DOI] [PubMed] [Google Scholar]
  • 22.Gintant G. An evaluation of hERG current assay performance: Translating preclinical safety studies to clinical QT prolongation. Pharmacology & Therapeutics. 2011;129:109–119. doi: 10.1016/j.pharmthera.2010.08.008. [DOI] [PubMed] [Google Scholar]
  • 23.De Bruin ML, Pettersson M, Meyboom RHB, Hoes AW, Leufkens HGM. Anti-HERG activity and the risk of drug-induced arrhythmias and sudden death. European Heart Journal. 2005;26:590–597. doi: 10.1093/eurheartj/ehi092. [DOI] [PubMed] [Google Scholar]
  • 24.Gavaghan CL, Arnby CH, Blomberg N, Strandlund G, Boyer S. Development, interpretation and temporal evaluation of a global QSAR of hERG electrophysiology screening data. Journal of Computer-Aided Molecular Design. 2007;21:189–206. doi: 10.1007/s10822-006-9095-6. [DOI] [PubMed] [Google Scholar]
  • 25.Mittelstadt SW, Hemenway CL, Craig MP, Hove JR. Evaluation of zebrafish embryos as a model for assessing inhibition of hERG. Journal of Pharmacological and Toxicological Methods. 2008;57:100–105. doi: 10.1016/j.vascn.2007.10.004. [DOI] [PubMed] [Google Scholar]
  • 26.Obrezanova O, Csanyi G, Gola JMR, Segall MD. Gaussian processes: A method for automatic QSAR Modeling of ADME properties. Journal of Chemical Information and Modeling. 2007;47:1847–1857. doi: 10.1021/ci7000633. [DOI] [PubMed] [Google Scholar]
  • 27.Obrezanova O, Segall MD. Gaussian Processes for Classification: QSAR Modeling of ADMET and Target Activity. Journal of Chemical Information and Modeling. 2010;50:1053–1061. doi: 10.1021/ci900406x. [DOI] [PubMed] [Google Scholar]
  • 28.Fraley ME, Garbaccio RM, Arrington KL, Hoffman WF, Tasber ES, Coleman PJ, Buser CA, Walsh ES, Hamilton K, Fernandes C, Schaber MD, Lobell RB, Tao WK, South VJ, Yan YW, Kuo LC, Prueksaritanont T, Shu C, Torrent M, Heimbrook DC, Kohl NE, Huber HE, Hartman GD. Kinesin spindle protein (KSP) inhibitors. Part 2: The design, synthesis, and characterization of 2,4-diaryl-2,5-dihydropyrrole inhibitors of the mitotic kinesin KSP. Bioorganic & Medicinal Chemistry Letters. 2006;16:1775–1779. doi: 10.1016/j.bmcl.2006.01.030. [DOI] [PubMed] [Google Scholar]
  • 29.Garbaccio RM, Fraley ME, Tasber ES, Olson CM, Hoffman WF, Arrington KL, Torrent M, Buser CA, Walsh ES, Hamilton K, Schaber MD, Fernandes C, Lobell RB, Tao WK, South VJ, Yan YW, Kuo LC, Prueksaritanont T, Slaughter DE, Shu C, Heimbrook DC, Kohl NE, Huber HE, Hartman GD. Kinesin spindle protein (KSP) inhibitors. Part 3: Synthesis and evaluation of phenolic 2,4-diaryl-2,5-dihydropyrroles with reduced hERG binding and employment of a phosphate prodrug strategy for aqueous solubility. Bioorganic & Medicinal Chemistry Letters. 2006;16:1780–1783. doi: 10.1016/j.bmcl.2005.12.094. [DOI] [PubMed] [Google Scholar]
  • 30.McBriar MD, Guzik H, Shapiro S, Xu R, Paruchova J, Clader JW, O'Neill K, Hawes B, Sorota S, Margulis M, Tucker K, Weston DJ, Cox K. Bicyclo[3.1.0]hexyl urea melanin concentrating hormone (MCH) receptor-1 antagonists: Impacting hERG liability via aryl modifications. Bioorganic & Medicinal Chemistry Letters. 2006;16:4262–4265. doi: 10.1016/j.bmcl.2006.05.069. [DOI] [PubMed] [Google Scholar]
  • 31.Alberati D, Hainzl D, Jolidon S, Krafft EA, Kurt A, Maier A, Pinard E, Thomas AW, Zimmerli D. Discovery of 4-substituted-8-(2-hydroxy-2-phenyl-cyclohexyl)-2,8-diaza-spiro[4.5]decan-1-one as a novel class of highly selective GlyT1 inhibitors with improved metabolic stability. Bioorganic & Medicinal Chemistry Letters. 2006;16:4311–4315. doi: 10.1016/j.bmcl.2006.05.058. [DOI] [PubMed] [Google Scholar]
  • 32.Price DA, Armour D, de Groot M, Leishman D, Napier C, Perros M, Stammen BL, Wood A. Overcoming HERG affinity in the discovery of the CCR5 antagonist maraviroc. Bioorganic & Medicinal Chemistry Letters. 2006;16:4633–4637. doi: 10.1016/j.bmcl.2006.06.012. [DOI] [PubMed] [Google Scholar]
  • 33.Zhu BY, Jia ZJ, Zhang PL, Su T, Huang WR, Goldman E, Tumas D, Kadambi V, Eddy P, Sinha U, Scarborough RM, Song YH. Inhibitory effect of carboxylic acid group on hERG binding. Bioorganic & Medicinal Chemistry Letters. 2006;16:5507–5512. doi: 10.1016/j.bmcl.2006.08.039. [DOI] [PubMed] [Google Scholar]
  • 34.Fluxe A, Wu SD, Sheffer JB, Janusz JM, Murawsky M, Fadayel GM, Fang B, Hare M, Djandjighian L. Discovery and synthesis of tetrahydroindolone-derived carbamates as Kv1.5 blockers. Bioorganic & Medicinal Chemistry Letters. 2006;16:5855–5858. doi: 10.1016/j.bmcl.2006.08.059. [DOI] [PubMed] [Google Scholar]
  • 35.Wu SD, Fluxe A, Janusz JM, Sheffer JB, Browning G, Blass B, Cobum K, Hedges R, Murawsky M, Fang B, Fadayel GM, Hare M, Djandjighian L. Discovery and synthesis of tetrahydroindolone derived semicarbazones as selective Kv1.5 blockers. Bioorganic & Medicinal Chemistry Letters. 2006;16:5859–5863. doi: 10.1016/j.bmcl.2006.08.057. [DOI] [PubMed] [Google Scholar]
  • 36.Lynch JK, Freeman JC, Judd AS, Iyengar R, Mulhern M, Zhao G, Napier JJ, Wodka D, Brodjian S, Dayton BD, Falls D, Ogiela C, Reilly RM, Campbell TJ, Polakowski JS, Hernandez L, Marsh KC, Shapiro R, Knourek-Segel V, Droz B, Bush E, Brune M, Preusser LC, Fryer RM, Reinhart GA, Houseman K, Diaz G, Mikhail A, Limberis JT, Sham HL, Collins CA, Kym PR. Optimization of chromone-2-carboxamide melanin concentrating hormone receptor 1 antagonists: Assessment of potency, efficacy, and cardiovascular safety. Journal of Medicinal Chemistry. 2006;49:6569–6584. doi: 10.1021/jm060683e. [DOI] [PubMed] [Google Scholar]
  • 37.Nisius B, Goller AH, Bajorath J. Combining Cluster Analysis, Feature Selection and Multiple Support Vector Machine Models for the Identification of Human Ether-a-go-go Related Gene Channel Blocking Compounds. Chemical Biology & Drug Design. 2009;73:17–25. doi: 10.1111/j.1747-0285.2008.00747.x. [DOI] [PubMed] [Google Scholar]
  • 38.Aronov AM, Goldman BB. A model for identifying HERG K+ channel blockers. Bioorganic & Medicinal Chemistry. 2004;12:2307–2315. doi: 10.1016/j.bmc.2004.02.003. [DOI] [PubMed] [Google Scholar]
  • 39.Halgren TA. Merck molecular force field .1. Basis, form, scope, parameterization, and performance of MMFF94. Journal of Computational Chemistry. 1996;17:490–519. [Google Scholar]
  • 40.Hou T, Wang J. Structure - ADME relationship: still a long way to go? Expert Opinion on Drug Metabolism & Toxicology. 2008;4:759–770. doi: 10.1517/17425255.4.6.759. [DOI] [PubMed] [Google Scholar]
  • 41.Hou TJ, Li YY, Zhang W, Wang JM. Recent Developments of In Silico Predictions of Intestinal Absorption and Oral Bioavailability. Combinatorial Chemistry & High Throughput Screening. 2009;12:497–506. doi: 10.2174/138620709788489082. [DOI] [PubMed] [Google Scholar]
  • 42.Ghose AK, Viswanadhan VN, Wendoloski JJ. Prediction of hydrophobic (lipophilic) properties of small organic molecules using fragmental methods: An analysis of ALOGP and CLOGP methods. Journal of Physical Chemistry A. 1998;102:3762–3772. [Google Scholar]
  • 43.Csizmadia F, TsantiliKakoulidou A, Panderi I, Darvas F. Prediction of distribution coefficient from structure .1. Estimation method. Journal of Pharmaceutical Sciences. 1997;86:865–871. doi: 10.1021/js960177k. [DOI] [PubMed] [Google Scholar]
  • 44.Tetko IV, Tanchuk VY, Kasheva TN, Villa AEP. Estimation of aqueous solubility of chemical compounds using E-state indices. Journal of Chemical Information and Computer Sciences. 2001;41:1488–1493. doi: 10.1021/ci000392t. [DOI] [PubMed] [Google Scholar]
  • 45.Xue Y, Li ZR, Yap CW, Sun LZ, Chen X, Chen YZ. Effect of molecular descriptor feature selection in support vector machine classification of pharmacokinetic and toxicological properties of chemical agents. Journal of Chemical Information and Computer Sciences. 2004;44:1630–1638. doi: 10.1021/ci049869h. [DOI] [PubMed] [Google Scholar]
  • 46.Discovery Studio 2.5 Guide. San Diego: Accelrys Inc.; 2009. http://www.accelrys.com. [Google Scholar]
  • 47.Rogers D, Hahn M. Extended-Connectivity Fingerprints. Journal of Chemical Information and Modeling. 2010;50:742–754. doi: 10.1021/ci100050t. [DOI] [PubMed] [Google Scholar]
  • 48.Rogers D, Brown RD, Hahn M. Using extended-connectivity fingerprints with Laplacian-modified Bayesian analysis in high-throughput screening follow-up. Journal of Biomolecular Screening. 2005;10:682–686. doi: 10.1177/1087057105281365. [DOI] [PubMed] [Google Scholar]
  • 49.Yoshida F, Topliss JG. QSAR model for drug human oral bioavailability. Journal of Medicinal Chemistry. 2000;43:2575–2585. doi: 10.1021/jm0000564. [DOI] [PubMed] [Google Scholar]
  • 50.Hou TJ, Xia K, Zhang W, Xu XJ. ADME evaluation in drug discovery. 4. Prediction of aqueous solubility based on atom contribution approach. Journal of Chemical Information and Computer Sciences. 2004;44:266–275. doi: 10.1021/ci034184n. [DOI] [PubMed] [Google Scholar]
  • 51.Hou TJ, Xu XJ. ADME evaluation in drug discovery. 2. Prediction of partition coefficient by atom-additive approach based on atom-weighted solvent accessible surface areas (vol 43, pg 1058, 2003) Journal of Chemical Information and Computer Sciences. 2004;44:1516–1516. doi: 10.1021/ci034007m. [DOI] [PubMed] [Google Scholar]
  • 52.Wang JM, Hou TJ, Xu XJ. Aqueous Solubility Prediction Based on Weighted Atom Type Counts and Solvent Accessible Surface Areas. Journal of Chemical Information and Modeling. 2009;49:571–581. doi: 10.1021/ci800406y. [DOI] [PubMed] [Google Scholar]
  • 53.Chen L, Li YY, Zhao Q, Peng H, Hou TJ. ADME Evaluation in Drug Discovery. 10. Predictions of P-Glycoprotein Inhibitors Using Recursive Partitioning and Naive Bayesian Classification Techniques. Molecular Pharmaceutics. 2011;8:889–900. doi: 10.1021/mp100465q. [DOI] [PubMed] [Google Scholar]
  • 54.Hou TJ, Wang JM, Li YY. ADME evaluation in drug discovery. 8. The prediction of human intestinal absorption by a support vector machine. Journal of Chemical Information and Modeling. 2007;47:2408–2415. doi: 10.1021/ci7002076. [DOI] [PubMed] [Google Scholar]
  • 55.Beresford AP, Selick HE, Tarbit MH. The emerging importance of predictive ADME simulation in drug discovery. Drug Discovery Today. 2002;7:109–116. doi: 10.1016/s1359-6446(01)02100-6. [DOI] [PubMed] [Google Scholar]
  • 56.Cruciani C, Crivori P, Carrupt PA, Testa B. Molecular fields in quantitative structure-permeation relationships: the VolSurf approach. Journal of Molecular Structure-Theochem. 2000;503:17–30. [Google Scholar]
  • 57.Gola J, Obrezanova O, Champness E, Segall M. ADMET property prediction: The state of the art and current challenges. Qsar & Combinatorial Science. 2006;25:1172–1180. [Google Scholar]
  • 58.Hou TJ, Wang JM, Zhang W, Wang W, Xu X. Recent advances in computational prediction of drug absorption and permeability in drug discovery. Current Medicinal Chemistry. 2006;13:2653–2667. doi: 10.2174/092986706778201558. [DOI] [PubMed] [Google Scholar]
  • 59.Hou TJ, Wang JM, Zhang W, Xu XJ. ADME evaluation in drug discovery. 7. Prediction of oral absorption by correlation and classification. Journal of Chemical Information and Modeling. 2007;47:208–218. doi: 10.1021/ci600343x. [DOI] [PubMed] [Google Scholar]
  • 60.Huuskonen J. Estimation of water solubility from atom-type electrotopological state indices. Environmental Toxicology and Chemistry. 2001;20:491–497. [PubMed] [Google Scholar]
  • 61.Dearden JC. In silico prediction of ADMET properties How far have we come? Expert Opinion on Drug Metabolism & Toxicology. 2007;3:635–639. doi: 10.1517/17425255.3.5.635. [DOI] [PubMed] [Google Scholar]
  • 62.Jolivette LJ, Ekins S. Methods for predicting human drug metabolism. Advances in Clinical Chemistry, Vol 43. 2007;43:131–176. doi: 10.1016/s0065-2423(06)43005-5. [DOI] [PubMed] [Google Scholar]
  • 63.Liu RF, Sun HM, So SS. Development of quantitative structure-property relationship models for early ADME evaluation in drug discovery. 2. Blood-brain barrier penetration. Journal of Chemical Information and Computer Sciences. 2001;41:1623–1632. doi: 10.1021/ci010290i. [DOI] [PubMed] [Google Scholar]
  • 64.Norinder U, Haeberlein M. Computational approaches to the prediction of the blood-brain distribution. Advanced Drug Delivery Reviews. 2002;54:291–313. doi: 10.1016/s0169-409x(02)00005-4. [DOI] [PubMed] [Google Scholar]
  • 65.Norinder U, Bergstrom CAS. Prediction of ADMET properties. Chemmedchem. 2006;1:920–937. doi: 10.1002/cmdc.200600155. [DOI] [PubMed] [Google Scholar]
  • 66.van de Waterbeemd H, Gifford E. ADMET in silico modelling: Towards prediction paradise? Nature Reviews Drug Discovery. 2003;2:192–204. doi: 10.1038/nrd1032. [DOI] [PubMed] [Google Scholar]
  • 67.Zhao YH, Le J, Abraham MH, Hersey A, Eddershaw PJ, Luscombe CN, Boutina D, Beck G, Sherborne B, Cooper I, Platts JA. Evaluation of human intestinal absorption data and subsequent derivation of a quantitative structure-activity relationship (QSAR) with the Abraham descriptors. Journal of Pharmaceutical Sciences. 2001;90:749–784. doi: 10.1002/jps.1031. [DOI] [PubMed] [Google Scholar]
  • 68.Hou TJ, Wang JM, Zhang W, Xu XJ. ADME evaluation in drug discovery. 6. Can oral bioavailability in humans be effectively predicted by simple molecular property-based rules? Journal of Chemical Information and Modeling. 2007;47:460–463. doi: 10.1021/ci6003515. [DOI] [PubMed] [Google Scholar]
  • 69.Aronov MM. Predictive in silico modeling for hERG channel blockers. Drug Discovery Today. 2005;10:149–155. doi: 10.1016/S1359-6446(04)03278-7. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

1_si_001

RESOURCES