Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Nov 5.
Published in final edited form as: J Med Chem. 2006 May 4;49(9):2713–2724. doi: 10.1021/jm050260x

The Development of Quantitative Structure-Binding Affinity Relationship (QSBR) Models Based on Novel Geometrical Chemical Descriptors of the Protein-Ligand Interfaces

Shuxing Zhang 1, Alexander Golbraikh 1, Alexander Tropsha 1,*
PMCID: PMC2773514  NIHMSID: NIHMS144489  PMID: 16640331

Abstract

Novel geometrical chemical descriptors have been derived based on the computational geometry of protein-ligand interfaces and Pauling atomic electronegativities (EN). Delaunay tessellation has been applied to a diverse set of 517 X-ray characterized protein-ligand complexes yielding a unique collection of interfacial nearest neighbor atomic quadruplets for each complex. Each quadruplet composition was characterized by a single descriptor calculated as the sum of the EN values for the four participating atom types. We termed these simple descriptors generated from atomic EN values and derived with the Delaunay Tessellation the ENTess descriptors and used them in the variable selection k-Nearest Neighbor quantitative structure-binding affinity relationship (QSBR) studies of 264 diverse protein-ligand complexes with known binding constants. 24 complexes with chemically dissimilar ligands were set aside as an independent validation set, and the remaining dataset of 240 complexes was divided into multiple training and test sets. The best models were characterized by the leave-one-out cross-validated correlation coefficient q2 as high as 0.66 for the training set and the correlation coefficient R2 as high as 0.83 for the test set. High predictive power of these models was confirmed independently by applying them to the validation set of 24 complexes yielding R2 as high as 0.85. We conclude that QSBR models built with the ENTess descriptors can be instrumental for predicting the binding affinity of receptor-ligand complexes.

Keywords: Receptor-Ligand Interactions, Delaunay Tessellation, k-Nearest Neighbors, Quantitative Structure-Activity Relationships, QSAR, Binding Affinity, Geometrical Chemical Descriptors, Model Validation, Consensus Prediction

INTRODUCTION

The prediction of the protein-ligand binding affinity is a critical component of computational drug discovery. Rapid growth of the Protein Data Bank1 provides opportunities to enhance current protocols for molecular docking and scoring, which are at the core of structure-based drug design25 and hit identification68. Accurate estimation of binding affinities, or at least correct relative ranking of different ligands has proven to be a difficult task due to multiple energetic and entropic factors that must be accounted for9,10. The limited accuracy of current scoring functions is one of the problems hampering the broad application of docking and virtual screening in lead optimization.

Many scoring functions have been developed over the years. Force field scoring is based on the classical molecular force field (such as AMBER11, CHARMM12, MMFF9413) to compute non-bonded interaction terms between the receptor and ligand atoms. Additional empirical terms taking into account the effects of solvation and entropy have also been considered14. The second family of methods includes so-called empirical scoring functions such as LUDI15, VALIDATE16, and ChemScore17. They are based on the concept that the receptor-ligand interaction energy can be approximated by a multivariate regression of different parameters, e.g., the number of hydrogen bonds, lipophilicity, ionic interactions, entropy penalties, etc. Recently, a third family of methods, based on statistical scoring functions (e.g., DrugScore18, SMoG19,20, PMF21, BLEEP22, and distance dependent atom pair descriptors23) has become popular. These methods employ the statistical analysis of known receptor-ligand complexes to define the pairwise inter-atomic potential of protein-ligand interaction. After the calibration on the training set of complexes, these scoring functions are validated by predicting binding affinities for the complexes of the test sets.

Since the force field based scoring functions are too computationally demanding to allow for efficient virtual screening of large databases24, the application of this method is usually limited to small datasets. Of the three approaches outlined above, empirical scoring functions are the most computationally efficient and therefore most widely used in current docking programs.

Knowledge-based scoring functions are based on the compositional analysis of protein-ligand complexes. They derive their origin from protein fold recognition studies in the 70’s25. Today the growing sources1,2628 of structural information on protein-ligand complexes provide great advantages for the continuing development and enhancement of statistical scoring functions. Studies have shown that in many cases knowledge-based scoring functions surpass both force field-based and empirical ones in predicting correct binding modes and affinities of the ligands. At the same time, they are fast and accurate, and at least comparable to empirical scoring functions in the efficiency of virtual screening of large databases and combinatorial lead design24,8,18,2022,29.

All methodologies discussed above rely on the availability of structural information about protein-ligand complexes and are classified as structure-based drug design approaches. In contrast, ligand based approaches rely only on the experimental structure-activity relationships for ligands only. Quantitative structure-activity relationship (QSAR)30 methods are typically used to find correlations between ligands’ binding affinities and their chemical descriptors. Some 3D-QSAR methods such as comparative molecular field analysis (CoMFA) have been developed to find correlation between binding affinities and energetic fields surrounding small molecules such as steric, electrostatic, hydrophobic, etc.3133 The “fields” are thought to simulate the active site environment but they actually do not consider the receptor geometry or the structural information of the active site (although CoMFA does provide an option to use active site atoms as opposed to a “probe” atom to sample the interaction fields). Several so-called receptor-dependent quantitative structure-activity relationship (RD-QSAR) methods have been developed that rely on the receptor structure information to calculate independent variables23,34. Holloway and co-workers35 have derived a highly significant 3D-QSAR model for HIV-1 protease and its peptidomimetic inhibitors and used it to predict binding affinities for newly designed ligands. Several other authors16,36,37 have developed new methodologies by considering all of the enthalpic and entropic contributions as well as solvation effects of the receptor-ligand interactions and treated them as independent variables in the RD-QSAR development.

In this paper, we present a hybrid methodology to predict the binding affinities for a highly diverse dataset of protein-ligand complexes using concepts from both structure-based and ligand-based approaches. It is based on four-body statistical scoring function derived by combined application of the Delaunay tessellation of protein-ligand complexes and the definition of chemical atom types using the fundamental chemical concept of atomic electronegativity. As described in our previous publications,3842 Delaunay tessellation naturally partitions a tertiary structure of a protein or a protein-ligand complex into an aggregate of space-filling, irregular tetrahedra, or simplices; the vertices of the simplices are quadruplets of nearest neighbor residues or atoms, respectively (Figure 1). Thus Delaunay tessellation reduces a complex three-dimensional structure to a collection of explicit, elementary atomic quadruplet structural motifs. Four vertices (atoms) of a simplex form a particular quadruplet composition and the chemical properties of the atom types can characterize the type of the tetrahedron.

Figure 1.

Figure 1

Illustration of Voronoi/Delaunay tessellation in 2D space (Voronoi polyhedra are represented by dashed line, and Delaunay simplices by solid line). For the collection of points with 3D coordinates, such as atoms of the protein-ligand complex, Delaunay simplices are tetrahedra whose vertices correspond to the atoms.

Atom types can be defined in a number of ways16,2022,43. In general, atoms can be classified into polar and non-polar carbon atoms, HBA (hydrogen bond acceptor) and HBD (hydrogen bond donor), X (halogens), M (metals), cations, anions, and hydrophobic atoms. Herein we present an unconventional way to define atom types using a scale of Pauling electronegativities (EN). To the best of our knowledge, EN has never been used previously to define atom types in a statistical scoring function. We apply atomic EN values to generate descriptors of all quadruplet atomic composition observed frequently at the interface of ligand-receptor complexes in a training set of 517 diverse X-ray characterized protein-ligand complexes: the single descriptor for a specific composition is obtained as a sum of the EN values for composing chemical atom types. Since these descriptors are based on the constructs from computational geometry (Delaunay Tessellation) combined with the fundamental chemical property of composing atom types such as Pauling EN, we term them geometrical chemical, or ENTess descriptors. Herein, we report on the use of the ENTess descriptors as independent variables in multivariate correlation analysis of the experimental dataset of 264 diverse protein-ligand complexes with known binding constants. Following the protocols for developing validated and predictive QSAR models established in the course of our previous studies4447, we have divided this datasets into the training, test, and independent validation sets. We report statistically significant Quantitative Structure-Binding Affinity Relationships (QSBR) models capable of predicting the binding affinities of ligands in the independent validation set with the R2 of 0.85.

MATERIALS AND METHODS

1. Datasets

In order to develop the ENTess descriptors, we have used two datasets. The first dataset included 517 protein-ligand complexes with high resolution (below 3.0Å) X-ray crystal structures2,4,16,18,2022,28,4850. This dataset was used to generate the statistics of quadruplet atom compositions resulting from Delaunay tessellation of protein-ligand interfaces as discussed below. The second dataset was a subset of the first dataset. It included 264 protein-ligand complexes with known binding affinities (pKi) ranging between 1.48 (1XLI) and 13.96 (7CPA) log units of molar concentration. The molecular weight of ligands ranged from tens to more than one thousand Daltons. The data were collected from the recent publications2,4,16,18,2022,28,4850. All of the structures in the datasets were prepared for the subsequent analysis as follows: hydrogen atoms and water molecules were discarded; ligands were extracted from the protein-ligand complex structures using SYBYL 6.9 and the ligand structures were fixed according to Relibase which is an online ligand-receptor structure database51. We followed the routine that was used by Gohlke and co-workers in their DrugScore development18.

2. Structural and Functional Diversity Analysis of the 264 Complexes

In order to evaluate the structural and functional diversity of this dataset, we have classified the 264 complexes into different families based on their structural and functional annotations using SWISS-PROT/PDB cross-referencing system52. According to this system, each PDB entry is cross-referenced with the SWISS-PROT code, primary gene name (gene expressing that protein) and its source or species of origin. If two proteins have the same primary gene names, they will have very high sequence identity and their structures will be very similar. The family associations of all training set complexes are shown in Table 1. In those cases where no cross-referenced information was available (e.g., PDB entries 1dbb, 1mcf, etc.) the complexes were placed in a group called “MISC”.

Table 1.

The 264 Protein-Ligand Complexes and the Primary Gene Name-based Family Classification

Family
Name
Number of
Complexes
PDB Codes of the Complexes
SUBI 1 1sbp
ACON 3 8acn 7acn 5acn
6PGD 1 1pgp
PHHY 2 1phh 2phh
F16P 3 1fbc 1fbf 1fbp
IDH 2 5icd 8icd
TRY1 9 1ppc 1pph 3ptb 1tng 1tnh 1tni 1tnj
1tnk 1tnl
FKB1 1 1fkf
SAV 1 1stp
MDHC 1 4mdh
DAPB 1 1dih
RBSB 1 2dri
RBL2 2 1rus 9rub
TYSY 2 2tsc 1tlc
PENP 7 1ppk 1ppl 1ppm 1apt 1apu 1apv 1apw
RENI 1 1rne
CARP 13 6apr 4er1 4er2 4er4 1eed 2er0 2er6
2er7 2er9 5er2 3er3 1epo 1epp
PYRB 1 8atc
XYLA 6 4xia 1xli 2xim 2xis 5xia 8xia
THER 10 2tmn 5tln 5tmn 3tmn 6tmn 1tlp 1tmn
4tln 4tmn 7tln
AMYG 1 1dog
PMG1 1 3pgm
HISJ 1 1hsl
PLMN 1 2pk4
ENO1 3 1ebg 5enl 6enl
CPXA 4 5cpp 1phf 1phg 2cpp
CAH2 16 1a42 1cil 1cim 1cin 1bn1 1bn3 1bn4
1bnm 1bnn 1bnq 1bnt 1bnu 1bnv 1bnw
1bcd 1am6
LDH 1 2ldb
CBPA 7 2ctc 8cpa 3cpa 6cpa 1cps 1cbx 7cpa
HV20 1 2mcp
NUC 2 1snc 2sns
TTHY 1 1tha
POL 27 1hih 4hvp 1pro 1dif 2upj 5hvp 1hpv
1hpx 8hvp 1hbv 4phv 1sbg 1hsg 1hvk
1hvr 1hvs 1hps 9hvp 1hos 1hte 1htf
1htg 1hvi 1hvj 1hvl 1aaq 7hvp
RASH 1 5p21
SYY 1 4ts1
TPIS 5 2ypi 6tim 4tim 7tim 5tim
FABI 1 2ifb
CAT3 3 3cla 1cla 4cla
KAD3 1 2ak3
HEMA 1 4hmg
RNT1 3 6rnt 1rnt 2rnt
LDHA 2 1ldm 9ldt
LDHB 1 5ldh
OPPA 27 1b05 1b0h 1b1h 1b32 1jet 1jeu 1jev
1b2h 1b40 1b46 1b3f 1b3g 1b3h 1b3l
1b51 1b58 1b4h 1b4z 1b5h 1b5i 1b5j
1b6h 1b7h 1b9j 1qka 1qkb 2olb
THRB 4 1etr 1ets 1ett 1tmt
PRLA 6 8lpr 3lpr 6lpr 9lpr 7lpr 5lpr
MM07 3 1mmp 1mmq 1mmr
MM08 3 1mmb 1mnc 1jao
PNPH 1 1ulb
CATA 1 7cat
LYCV 10 181l 182l 1nhb 183l 184l 185l 186l
187l 1l83 188l
GSHR 1 4gr1
CATD 1 1lyb
AATM 1 9aat
NRAM 3 1nnb 1nsc 1nsd
GLNA 1 1lgr
MYG 1 1mbi
PRTA 2 4sga 5sga
ARAF 9 1apb 6abp 1abe 1abf 9abp 1bap 7abp
5abp 8abp
TRY1_TRY2 1 1bra
RETB 1 1rbp
ADHE 2 1adb 1adf
CISY 3 2csc 3csc 1csc
DYR 5 1dhf 4dfr 7dfr 1dr1 1drf
ITHH 3 1dwb 1dwc 1dwd
DGAL 1 2gbp
MALE 1 1mdq
FLAV 1 3fx2
EL1 4 7est 1ela 1elb 1elc
CONA 1 5cna
MISC 14 1dbb 1dbj 1dbk 1dbm 2dbl 1mcb 1mcf
1mch 1mcj 1mcs 1mfe 2cgr 3gap 4fab

Based on the SWISS-PROT annotation, the 264 complexes were classified into 71 families reflecting the high functional and structural diversity of this dataset. Some families had multiple members and some had only one member. All of the protein structures within one family were similar but the ligand structures were different; for different families both protein and ligand structures were dissimilar. We have found that 14 PDB entries were not annotated in SWISS-PROT/PDB cross-referencing system and they have been classified into “MISC” family.

3. Atom Type Definitions

In order to develop simple yet robust chemical geometrical descriptors, we sought some fundamental atomic property that could be attributed to any chemical atom type of either receptor or ligand and could be useful in describing interatomic interactions at the ligand-receptor interface. We decided to use the Pauling electronegativity53 as a parameter to characterize atom types. According to the chemical potential equalization principle as described by Itskowitz and Berkowitz,54 electronegativity is the first order term in the energy function of molecules:

E(Qa)=E0+aμa*Qa+12aη˜aQa2+ (1)

where E is the energy of the molecule, μa is the electronegativity of atom a, Qa is the partial charge on atom a, and η̃ is the hardness kernel. E0 is the collection of terms independent of Qa; so electronegativity is the main factor determining the atom’s polarity and its ability to form a hydrogen bond. For example, oxygen has high electronegativity and high ability to form hydrogen bond and it is a polar atom type in most cases. Thus, electronegativity could be used to describe the interactions between protein and ligand atoms. Hall et al. have introduced electrotopological state (E-state) indices, which are indirectly related to electronegativity, and successfully used them in QSAR studies of many datasets55. Recently Zefirov et al. used electronegativity equalization scheme as a source of electronic descriptors to study some types of chemical reactivity and obtained good models for thermodynamic and kinetic data such as proton affinity and Taft's inductive sigma* constants.56

To collect the most representative statistics of possible ligand atom types, we relied on chemical databases of biologically active organic compounds from the National Cancer Institute (NCI). The first database contains 237,771 compounds57 and another includes 30,000 compounds tested against 60 human cancer cell lines58. If an atom type occurred in more than 5,000 out of the 237,771 compounds in the first NCI database, and in more than 1,500 compounds out of 30,000 compounds in the NCI cancer database, we classified it as an independent atom type. For example, O (EN=3.4), N (3.0), C (2.5) and S (2.4) were classified into independent atom types according to their electronegativity values and their high occurrence in the databases. Although halogens (F, Cl, Br and I) and P are also important atom types, since they occur independently less than 5,000 times in the NCI database and less than 1,500 times in the NCI cancer database, they were classified into the same atom type X [P has very similar electronegativity value to that of halogens except for F (between 2.0 ~ 2.4)]. Similarly, all metal atoms have electronegativity values within 0.6 ~ 1.6, and along with some other rare atom types, were classified into the same atom type M. Atom type definition for proteins is relatively easier, since there are only four atom types, C, N, O and S that occur in natural amino acids.

In order to distinguish ligand vs. protein atoms, we have classified the protein and ligand C, N, O and S as different atom types. Hydrogen atoms were not considered since usually they are not defined explicitly in the X-ray structures. Thus, we have defined four atom types for receptor proteins and six atom types for the ligands. In total, there were 554 possible types of interfacial atomic quadruplet compositions, and each of them gave rise to an independent variable (a sum of EN values for composing atom types) for our QSBR studies. Atom type definitions are summarized in Table 2.

Table 2.

Atom Type Definitions

Ligand Atom Types
O EN = 3.4
N EN = 3.0
C EN = 2.5
S EN = 2.4
X P and Halogens, EN = 2.0 ~ 2.4, 4.0
M Metal and all other rare atom types,
EN = 0.6 ~ 1.6
Receptor Atom Types
O EN = 3.4
N EN = 3.0
C EN = 2.5
S EN = 2.4

4. Delaunay Tessellation of the Protein-Ligand Interfaces

We have developed programs for the protein-ligand complex tessellation based on the nnsort method59. The protein-ligand interfaces were defined by tetrahedra formed by both protein and ligand atoms. A distance cutoff value of 8Å was used to exclude Delaunay simplices with long edges (exceeding the physically meaningful interaction distance) between vertices. As shown in Figure 2, we have distinguished three classes of interfacial tetrahedra, i.e., RRRL, RRLL and RLLL, where each R and L corresponds to a receptor and ligand atom, respectively. For each class we further defined 554 types of quadruplet compositions based on our definition of chemical atom types (cf. Table 2) without taking into account their order in the quadruplet. For example, all quadruplets with atom types C_L, C_R, S_L and X_L, were assigned to the same [X_L, S_L, C_L, C_R] composition type.

Figure 2.

Figure 2

Topological Tetrahedral Types:

RRRL: Formed by three receptor atoms and one ligand atom;

RRLL: Formed by two receptor atoms and two ligand atoms;

RLLL: Formed by one receptor atom and three ligand atoms.

5. Dataset Division into Training, Test, and Independent Validation Sets

It is generally accepted that the internal validation of the QSAR models built for the training set is sufficient to establish their predictive power6069. However, our previous studies as well as those conducted by other groups have demonstrated that there exists no correlation between leave-one-out (LOO) cross-validated R2 (q2) for the training set and the correlation coefficient R2 between the predicted and observed activities for the test set44,70. Our group has advocated the importance of the external model validation which requires an independent set of compounds.45,46,71 We have developed a rational approach to dividing the dataset into multiple training and test sets for internal and external validations, respectively45,71,72. As described below, we have extended our validation requirements to require not only test sets, but also a second external test set (an independent validation set) for the additional validation.

The dataset of 264 complexes was divided into three subsets in the beginning of the calculations. The first subset of 24 complexes for independent validation was selected randomly. The remaining 240 complexes were divided into multiple chemically diverse training and test sets with the algorithm based on Sphere Exclusion (SE) developed in our group45. SE is a general procedure that is typically applied to databases of organic molecules characterized by multiple descriptors of their chemical structure such that each compound is represented as a point (or vector) in multidimensional descriptor space. The goal of the SE method is to divide a dataset (i.e., a collection of points in multidimensional chemometric space) into two subsets (training and test set) using diversity sampling procedure as follows. SE starts with the calculation of the distance matrix D between representative points in the descriptor space. Let Dmin and Dmax be the minimum and maximum elements of D, respectively. N probe sphere radii are defined by the following formulas. Rmin=R1=Dmin, Rmax=RN=Dmax/4, Ri=R1+(i−1)*(RN−R1)/(N−1), where i=2,…,N−1. Each probe sphere radius corresponds to one division into the training and test set.

In this paper, each receptor-ligand complex was characterized with multiple ENTess descriptors as discussed in the first section under Results below. The entire dataset was then treated as a collection of points (each corresponding to an individual receptor-ligand complex) in the ENTess descriptor space. Thus, the SE algorithm used in this study consisted of the following steps. (i) Select randomly a point in the ENTess descriptor space. (ii) Include it in the training set. (iii) Construct a probe sphere around this point. (iv) Select points from this sphere and include them alternatively into test and training sets. (v) Exclude all points within this sphere from further consideration. (vi) If no more compounds left, stop. Otherwise let m be the number of probe spheres constructed and n be the number of remaining points. Let dij (i=1,…,m; j=1,…,n) be the distances between the remaining points and probe sphere centers. Select a point corresponding to the lowest dij value and go to step (ii). The random division was repeated three times and the results are summarized in Table 3. The training sets were used to build models and the test sets were used for validation. The independent validation sets of 24 complexes were used for an additional external validation.

Table 3.

The Randomly Selected 24 Complexes in Three Experiments

Experiment 1 Experiment 2 Experiment 3
188l.pdb 1aaq.pdb 1adf.pdb
1b0h.pdb 1b3l.pdb 1b3f.pdb
1b4h.pdb 1b4z.pdb 1b58.pdb
1b58.pdb 1dbm.pdb 1b5h.pdb
1cim.pdb 1dih.pdb 1cim.pdb
1dbb.pdb 1ebg.pdb 1ebg.pdb
1dbm.pdb 1epo.pdb 1fkf.pdb
1dif.pdb 1hos.pdb 1hte.pdb
1fbc.pdb 1hvj.pdb 1hvl.pdb
1fbf.pdb 1hvr.pdb 1jao.pdb
1hvs.pdb 1mmr.pdb 1phh.pdb
1lgr.pdb 1ppc.pdb 1ppc.pdb
1lyb.pdb 1pph.pdb 1pph.pdb
1mmr.pdb 1qka.pdb 1qka.pdb
1nnb.pdb 1qkb.pdb 1stp.pdb
1nsc.pdb 1rne.pdb 1tmn.pdb
1phg.pdb 1rus.pdb 1tnh.pdb
1tlc.pdb 1sbg.pdb 1tnk.pdb
1tnh.pdb 1stp.pdb 2dri.pdb
2upj.pdb 3fx2.pdb 2sns.pdb
2xim.pdb 3lpr.pdb 3cpa.pdb
5ldh.pdb 4dfr.pdb 4tln.pdb
7dfr.pdb 7abp.pdb 4tmn.pdb
9abp.pdb 7tln.pdb 5ldh.pdb

6. k-Nearest Neighbor (kNN) QSBR with Variable Selection

We have described this approach elsewhere73,74 and present here only its brief overview. kNN QSAR is a stochastic variable selection procedure where the model optimization is driven by simulated annealing, as is illustrated in Figure 3 The kNN procedure is aimed at the development of the model with the highest leave-one-out (LOO) cross-validated correlation coefficient R2 (q2) for the training set.

q2=1i=1N(yiy^i)2i=1N(yiy¯)2, (2)

where N and ȳ are the number of compounds and the average observed activity of the training set, and yi and ŷi are the observed and predicted activities of the i-th compound.

Figure 3.

Figure 3

Flow chart of kNN-QSAR with Variable Selection.

The procedure starts with the random selection of a predefined number of descriptors from all descriptors. Activity of a compound yi excluded in the LOO cross-validation procedure is predicted as the weighted average of activities of its nearest neighbors according to the following formula:

yi=j=1kyjexp(dij/l=1kdil)j=1kexp(dij/l=1kdil), (3)

where dij are distances between the i-th compound and its k nearest neighbors (j=1,…,k). The optimal number of nearest neighbors that yields the highest q2 value is defined as part of the LOO cross-validation process as well. After each run of the LOO procedure, a predefined number of descriptors are randomly changed, and the new value of q2 is defined. If q2 (new) > q2(old), the new set of descriptors is accepted. If q2 (new) ≤ q2(old), the new set of descriptors is accepted with probability p = exp(q2(new) - q2(old))/T, and rejected with probability (1-p), where T is a simulated annealing “temperature” parameter. During the process, T is decreasing until the predefined value, and when this value is achieved the optimization process is terminated.

7. Y-randomization Test

The robustness of the models was examined by comparing them to those obtained when using randomized binding affinities of the training set (this procedure is commonly referred to as Y-randomization test). Briefly, we repeated the QSAR calculations with the randomized activities of the training sets. We also compared the q2 values in the process of the iteration procedure of the simulating annealing for actual and random activities of training sets to see if there is any significant difference. This randomization was repeated five times for each splitting.

8. Model Validation and the Applicability Domain

QSAR models were validated using test sets. They were considered as acceptable, if (i) q2>0.5 and R2>0.6; (ii) [R2− R02]/R2<0.1 and 0.85<k<1.15 or [R2− R’02]/R2<0.1 and 0.85<k’ <1.15, and (iii) |R02− R’02|<0.3,45 where R02 and R’02 are the coefficients of determination for regressions through the origin between predicted and observed, and observed and predicted binding energies, respectively, and k and k’ are the corresponding slopes. The whole QSAR model validation procedure, as is illustrated in Figure 4, has been successfully used in our laboratory for many datasets and is described in detail elsewhere7375.

Figure 4.

Figure 4

Statistical data modeling and model validation workflow using kNN variable selection approach.

The binding affinity of the test set compounds was predicted only if these compounds were within the applicability domain of the respective training set models. We define this domain45 as a threshold distance in multidimensional descriptor space between a test set compound and its k nearest neighbors in the training set. If the distance is beyond the threshold, the prediction is considered unreliable. This threshold distance is calculated as D2cutoff = <D2nn> + Z*VAR, where <D2nn> is the squared mean distance between each of the training set compound and its k nearest neighbors, VAR is the variance of Dnn, and Z is a user-defined parameter (the default value is 0.5).

Training set models that passed our validation criteria (i)–(iii) were used for the prediction of the independent validation set of randomly selected compounds. For this exercise, we relied on the consensus prediction, which consists of the averaging the binding affinities of each compound predicted by all acceptable models37.

9. QSBR Model Validation Using Computational Docking Studies

The goal of this component of our studies was to query the QSPR models with respect to their ability to differentiate between native bound conformations of the ligands and their decoys. In addition, we have also questioned whether QSBR models could discriminate known binders from those molecules that are known not to bind to the receptors, which is a rigorous test for any docking method. We have randomly selected three complexes from the PDB. They were human dihydrofolate reductase complexed with folate (1DHF)76, orotidine 5'-phosphate decarboxylase complexed to 6-hydroxyuridine 5'-phosphate (BMP) (1DQX)77 and human P38 Map kinase in complex with BIRB796 (1KV2)78. 1DHF docking study was done with FlexX79 implemented in SYBYL80 while 1DQX and 1KV2 poses were created using Autodock 3.081. In addition, aribinose was docked into the dihydrofolate reductase using FlexX79 and the enzyme coordinates from 1DHF to create unnatural complex since it is known that aribinose does not bind to the dihydrofolate reductase. We have employed the default docking parameters unless otherwise specified. The ligands were considered flexible and 50 conformations were docked and scored for each ligand.

RESULTS AND DISCUSSION

1. Atom Type Definition and ENTess Descriptor Generation

The nearest neighbor interacting atoms at the protein-ligand interface were defined by the means of Delaunay tessellation as described in Methods. The examples of interfacial tetrahedra are shown in Figure 5 for the complex between HIV protease and acetylpepstatin (PDB code 5HVP). Tetrahedra with edges (i.e., interatomic distances) exceeding 8Å were excluded. We have applied this procedure to 517 protein-ligand complexes in the training set as described in Methods and counted the number of occurrences of each of the 554 atom quadruplet types. If the number of times a particular type occurred was higher than 50, we considered this quadruplet type significant. Otherwise, this type was discarded leading to the reduction in the number of independent variables for the subsequent analysis. 132 types of quadruplets were found to occur with sufficiently high frequency (Figure 6). For each type of the tetrahedral composition, the EN values of the four composing atoms were added up, and the resulting sums for all of the tetrahedra belonging to this composition type were then added up again. The result of these calculations represented the value of the descriptor (i.e., one of possible 132 descriptors) for the particular protein-ligand complex (see Figure 7 for the illustration).

Figure 5.

Figure 5

Full atom-based protein-ligand interface tessellation for 5HVP. The magenta and red ribbons are two chains of the protein. Acetylpepstatin ligand is in the spacefill display. Tetrahedra formed by ligand and protein atoms are shown in yellow.

Figure 6.

Figure 6

Frequency analysis of 554 composition types for the 517 protein-ligand complex dataset. All of the quadruplets on the left of the dashed line were found more than 50 times.

Figure 7.

Figure 7

Calculation of the ENTess descriptors. The same atom type from receptor and ligand is treated differently. In the formulas, m is the m-th quadruplet composition type; n represents the number of occurrences of this composition type in a given protein-ligand complex, and j is the vertex index within the quadruplet.

All 132 descriptors were initially calculated for the dataset of 264 complexes with known binding constants. Because the 264 dataset is only a subset of the 517 dataset, we found that 32 out of 132 descriptors had zero values so they were excluded from further consideration. The final data matrix included 100 columns for the variables and 264 rows for the protein-ligand complexes. We have applied variable selection k-nearest neighbor (kNN)74 to this matrix to build models and establish correlations between binding affinities and the ENTess descriptors as described below.

2. Building QSBR Models

In order to build validated QSBR models, we have divided the dataset of 264 receptor-ligand complexes with known binding constants into training, test, and validation subsets multiple times. Three different subsets of the entire dataset were generated initially by removing 24 randomly selected complexes that constituted the independent validation sets. In each case of this initial division, the remaining subsets that included 240 compounds each were divided into multiple training and test sets using the SE program as described in Methods. For every division, 55 training set models were generated and then validated by predicting the binding constants of the test sets. Due to stochastic nature of the SE algorithm, the number of divisions was different for different chemically diverse samples selected from the original dataset. In the end, as many as 1155 models for 21 divisions of the first sample of 240 complexes, 1045 models for 19 divisions of the second sample, and 2310 models for 42 divisions of the third sample were built and validated using variable selection kNN.

Application of the acceptability criteria discussed in the Methods section resulted in 354, 515 and 567 models for the three samples described above with q2 > 0.50 and R2 > 0.60. In order to evaluate the statistically significant predictive power of the training set models, our test sets typically included no less than 15% of the dataset. As could be expected, due to the high diversity of the dataset, the q2 and R2 were found to depend on the division of the dataset. For example, we were unable to obtain acceptable training set models for the 173/67 (training/test set complexes) division but were able to generate highly predictive models for the 167/73 division, where the best model had R2 as high as 0.71 (cf. model 28 in Table 4).

Table 4.

The best 10 models for each of the three dataset divisions.

Models k q2 n R2 S2 Slope R02 R452 R12 Rcomb2 Rcomb2
Experiment
1
1 3 0.54 48 0.77 1.04 0.93 0.77 0.76 0.71 0.82 0.85
2 4 0.55 48 0.76 1.13 0.92 0.76 0.75 0.67 0.75
3 4 0.52 48 0.76 1.05 0.93 0.76 0.75 0.76 0.79
4 2 0.57 41 0.76 1.46 0.99 0.75 0.75 0.68 0.78
5 3 0.65 47 0.74 1.27 0.93 0.73 0.72 0.76 0.77
6 3 0.61 44 0.74 1.25 1.01 0.74 0.73 0.69 0.73
7 3 0.56 65 0.73 1.44 0.98 0.73 0.73 0.74 0.84
8 3 0.59 53 0.70 1.24 0.97 0.70 0.70 0.73 0.82
9 2 0.54 65 0.70 1.56 1.00 0.70 0.69 0.74 0.81
10 3 0.60 44 0.70 1.44 0.98 0.70 0.70 0.66 0.77
Experiment
2
11 3 0.65 40 0.83 0.89 0.95 0.83 0.83 0.77 0.74 0.77
12 3 0.66 40 0.83 0.92 0.97 0.83 0.83 0.57 0.55
13 3 0.66 41 0.82 0.99 0.95 0.82 0.82 0.68 0.72
14 3 0.63 47 0.81 0.92 0.97 0.80 0.80 0.6 0.64
15 2 0.58 51 0.80 0.96 1.02 0.80 0.78 0.64 0.72
16 3 0.63 51 0.83 0.82 1.05 0.82 0.78 0.61 0.62
17 3 0.60 47 0.80 0.95 0.98 0.79 0.79 0.72 0.77
18 3 0.63 47 0.80 0.95 0.97 0.79 0.79 0.58 0.64
19 3 0.57 44 0.76 1.19 0.98 0.76 0.76 0.8 0.83
20 2 0.64 50 0.78 0.93 1.00 0.77 0.76 0.62 0.77
Experiment
3
21 3 0.55 49 0.78 0.99 0.97 0.78 0.78 0.77 0.81 0.81
22 2 0.52 49 0.77 1.20 0.97 0.76 0.76 0.73 0.74
23 3 0.52 49 0.75 1.15 0.98 0.75 0.75 0.61 0.74
24 5 0.51 49 0.75 0.90 0.94 0.72 0.72 0.65 0.69
25 5 0.52 49 0.74 0.99 0.98 0.72 0.72 0.63 0.65
26 4 0.52 49 0.74 1.08 0.99 0.73 0.72 0.78 0.83
27 3 0.55 45 0.70 1.14 0.94 0.70 0.70 0.8 0.83
28 4 0.53 73 0.71 1.24 0.91 0.68 0.67 0.65 0.84
29 3 0.55 73 0.68 1.44 0.92 0.68 0.67 0.72 0.74
30 2 0.53 118 0.63 1.69 0.91 0.57 0.54 0.73 0.74

Note: k – number of the nearest neighbors; q2 – cross validated correlation coefficient for training sets; n – number of complexes in the test sets which are within the applicability domain; R2 – correlation coefficient for test sets; S2 – square of standard deviation between predicted and actual pKi; Slope – slope of the regression through the origin. R02 – correlation coefficient for test sets for the regression through the origin; R452 – correlation coefficient for test sets for the line which has slope 45 degrees; R12 – correlation coefficient for the external set; Rcomb2 – correlation coefficient for the external set by using the combination of training and test sets for predictions; Rcons2 – correlation coefficient for the external set by consensus prediction with top 10 best models.

These results could be explained as follows. As a result of the division, some complexes that are potential outliers are included in the test set, which reduces the R2. On the contrary, if these structures are included in the training set, the test set R2 could be much higher than the training set q2. With the criteria described above, an acceptable model was obtained with the test set as large as 118 complexes, i.e., almost half of the entire dataset, with q2 = 0.53 and R2 = 0.63 (cf. model 30 Table 4).

3. Prediction of the Independent Validation Sets

It should be noted that the studies described above rely on the test sets to select the acceptable training set models. So strictly speaking the above procedure can not be regarded as truly external validation. On the contrary, successful prediction of the randomly selected independent validation set of 24 compounds could be viewed as a realistic test of the models’ predictive power. We now discuss the results of this test under different prediction scenarios.

3.1. Prediction with the Best Individual Models

Table 4 presents 10 best models for each experiment. Model 11 tops the list with R2 as high as 0.83 and q2 of 0.65. Figure 8 shows the data fitting of experimental and predicted binding affinities for training and test sets. This model was built with 45 descriptors resulting from variable selection procedures and three nearest neighbors appeared to be optimal in the leave-one-out (LOO) cross-validation.

Figure 8.

Figure 8

Predictive power of the best model (Model 11, cf. Table 4).

Grey open triangles: prediction for the 200 complexes of the training set (q2 = 0.65). Black points: prediction for the 40 complexes of the test set (R2 = 0.83, RMSD = 1.06).

Figure 9 shows the trajectory of the SA-driven optimization of the q2 in developing the best kNN models and Figure 10 shows the relationship between the number of the descriptors and the q2 for the training set with real vs. randomized binding energies. The latter figure demonstrates that the models built using true binding affinities for the training set afford significantly higher q2 values as compared to the models generated with the randomized binding energies.

Figure 9.

Figure 9

Trajectories for q2 of the best model [Model 11] (solid black) and the model with the lowest q2 (dashed grey). Trajectory of the model with the highest q2 (shadowed grey) built with randomized binding energies of the training set.

Figure 10.

Figure 10

q2 vs. the number of variables selected for kNN QSAR models. The results are for both actual (black) and random (grey) datasets. Every q2 is the average of 10 independent calculations.

To further validate the models, we made predictions for the independent validation set of 24 randomly selected complexes in three independent experiments (Table 3). For each individual model, we have obtained fairly good correlation between the actual and predicted binding affinity (Table 4); with the exception of Models 12 and 18 where R2 fell below 0.60; all other models had R2 ranging from 0.60 to 0.80.

3.2. Predictions Using the Combined Training and Test Sets

All predictions described in the previous section were made using training sets only. Since the dataset of 240 complexes was divided into the training and test sets rationally and the test set predictions were used to select acceptable models it is logical to employ the (re)combined set for the prediction of the independent validation set. Thus, all 240 compounds of the recombined dataset were used for the binding affinity prediction of the independent validation set. We used the descriptors selected and the optimal number of nearest neighbors obtained by the kNN training set modeling. Perez et al.37 have reported previously that using a similar approach improves the prediction accuracy. Following this approach, we made predictions for 24 complexes with the 10 best models for each experiment (cf. Table 3) and the results were significantly better than using only the training set compounds. In addition to R2, root mean squared deviation (RMSD) between predicted and observed binding is also used to measure the accuracy of the prediction. It is defined as in the literature15,48:

RMSD=(pKipredpKiobs)2N1 (4)

where pKipredandpKiobs are predicted and observed logarithmic binding affinity respectively. N is the number of the complexes. Gibbs free energy of binding ΔG is related to the binding constant by:

ΔG=RTlnk (5)

For instance, for the predictions made with the model 7 R2increased from 0.74 to 0.84 and RMSD decreased from 0.97 (5.5kJ/mol) to 0.90 (5.1kJ/mol) (cf. Table 4 and Figure 11). Since we only use training set models that have both internal and external high predictive power, every compound in the combined set has nearest neighbors in the selected descriptor space with approximately the same binding affinity. Obviously, combining the training and test sets enriches the structural diversity of the dataset used for prediction such that there is a greater chance for every external compound of finding close nearest neighbors. Furthermore, because we are using the applicability domain threshold, the nearest neighbor relationships translate into similar binding affinities leading to high values of the external R2.

Figure 11.

Figure 11

Prediction of binding affinities for the external validation test set (24 complexes) with different approaches. (cf. Table 4). Asterisks: prediction with Model 7, Table 4. R2 = 0.74 and RMSD = 0.97; Black open triangles: prediction with Model 7 using the whole dataset of 240 complexes to select k nearest neighbors for compounds in the independent test set, R2 = 0.84 and RMSD = 0.90; Grey points: consensus prediction by the top 10 best models using the whole dataset of 240 complexes as the training set. R2 = 0.85 and RMSD = 0.98.

3.3. Prediction with the Consensus Method

With the consensus approach, the binding affinities for each of the 24 complexes in the independent validation set were predicted as the average of the predicted binding affinities for each complex based on individual models. The results, as shown in Table 4, demonstrate that the consensus prediction is relatively stable with R2 of 0.85, 0.77 and 0.81 respectively. Figure 11 shows, that the consensus approach predicts more data with higher correlation coefficient than any single model. Notably, as shown in Table 5, Model 12 has good q2 (0.66) and very high R2 (0.83) but the R2 for the prediction of the 24 external complexes is below 0.60. This indicates that even if both q2 and R2 are very high, it does not guarantee that the external predictive power of an individual model is acceptable. On the contrary, the consensus prediction usually yields acceptable predictive power. This result is consistent with our previous observations44.

Table 5.

Comparison of predictive power of ENTess models vs. that obtained with alternative scoring functions

Methods References Training
Set Size
Test
Set
Size
R2 for Test
Sets
Consensus R2 for
the external set
BLEEP Ref. 22 351 90 0.53 N/A
PMF Ref. 21 697 77 0.61 N/A
SMoG96
SMoG2001
Ref. 19
Ref. 20
120
725
46
111
0.42
0.436
N/A
N/A
DT2002 Feng, J.
Unpublished
319 67 0.71 N/A
SCORE Ref. 49 170 11 0.65 N/A
XSCORE Ref. 50 200 30 0.36 N/A
LUDI Ref. 15 82 12 0.45 N/A
VALIDATE Ref. 16 51 14 0.81 N/A
ChemScore Ref. 17 82 20 0.63 N/A
ENTess1 189 ~ 200 40 ~ 51 0.76 ~ 0.83 0.77
ENTess2 199 ~ 175 41 ~ 65 0.70 ~ 0.77 0.85
ENTess3 122 ~ 195 45 ~ 118 0.63 ~ 0.78 0.81

4. Analysis of Outliers

For each complex, if the difference between the predicted and experimental binding affinities was greater than three logarithmic units (i.e., pKd), we regarded the complex as an outlier. Based on this definition, we have observed several outliers in different experiments: 1STP82 in experiment 1, 1PHG83 in experiment 2, and 1STP and 7TLN84 in experiment 3. 1STP is a very interesting complex which was observed as an outlier by several groups working in the area of scoring function development17,21,79. The 1STP complex is unique and our predicted affinity with different models underestimated the observed binding affinity by 4 to 7 pKd units. The biotin–streptavidin complex has the highest known binding constant82 and it is the only member of the SAV family (Table 1). Consequently, there are no analogs of this complex in the training set. More importantly, Muegge and Martin21 pointed out that streptavidin functions as tetramer; we only have monomeric complex crystal structures available whereas the interaction with a second subunit increase the binding of biotin by eight orders of magnitude.

1PHG83 was predicted to have binding affinity ca. three pKd units lower than the experimental value (for instance, Model 7 predicts pKd value for this complex as 5.52 while the observed binding affinity is 8.66). It is cytochrome P450cam (Camphor 5-Monoxygenase) complexed with metyrapone, and it contains the heme group as cofactor. Crystal structure indicates that there is some interaction between the ligand and the heme group which is not taken into account by our scoring function.

7TLN84 is a metalloproteinase covalently bound to its ligand INC (CH2CO(N-OH)Leu-OCH3). In addition, there are four Ca2+ and one Zn2+ ions in the complexes. In this case, the concurrent binding of these ions could affect the prediction of the binding affinity, as was observed with 1LYB85. There are too few metal containing complexes in our training dataset and our approach may not accurately describe interactions mediated by metal ions.

In addition to the outliers, several complexes were found to be out of the applicability domain in our experiments. This means they are too different from their respective training set complexes in the 100 descriptor space. As described above, most of them have metal ions which may induce large conformational changes upon ligand binding. For example, 1EBG86 and 4TMN87 are metal complexes with four magnesium ions and four calcium ions respectively. Although we have descriptors for quadruplets that contain metal atoms, the representation of the interaction interface is probably insufficient to characterize their metal-mediated large conformational change upon ligand binding. In addition, ligands in these two complexes contain PO3 and PO2 groups, respectively, which are not frequent in the entire dataset. Another example is 1FKF,88 which is an immunophilin-immunosuppressant complex in which the protein conformation changes insignificantly upon the ascomocin (FK506) binding, but interestingly, the ligand FK506 undergoes a very large conformational change when it binds. FK506 is an antibiotic with a very large molecular weight (804 Daltons). The drug's association with the protein involves five hydrogen bonds; the protein hydrophobic binding pocket is lined with conserved aromatic residues, and contains an unusual carbonyl binding pocket88. We suppose that the training set model is incapable of describing these unique interactions accurately. However, despite the small number of outliers, we suggest that the ENTess descriptors as applied in kNN QSBR calculations in general led to highly predictive models.

5. Robustness of the Models

As described in Methods, to evaluate the model robustness, we have performed the Y-randomization test. As shown in Figure 9 and Figure 10, q2 values for models built with real activities of the training set were always much higher than for those built with randomized activities. In order to exclude a possibility of a chance correlations and overfitting, the Y-randomization test was repeated five times for each splitting. The highest q2 for the random datasets was 0.14 while the lowest q2 for the real datasets was 0.51. In general, if the relationships between binding affinities and descriptors are not random, the models built with randomized affinities of the training sets complexes must have no predictive ability. Indeed, no predictive model built with randomized training set data was found.

6. Comparison with Other Scoring Functions

Our results were compared with those obtained earlier using both knowledge-based and empirical scoring functions, as shown in Table 5. Since there are no standard training and test sets used by different groups, the direct comparison is impossible. Compared to SMoG96,19 our training sets were a little bigger, but our prediction accuracy was much better, even for a much bigger test set (118 complexes). As compared to other published results, we had test sets of comparable size and much smaller training sets, but nevertheless our correlation coefficients are much higher. Importantly, we have demonstrated that our method afforded high predictive power for an external structurally diverse dataset. The alternative empirical scoring functions demonstrated comparable results with relatively smaller training sets (except SCORE and XSCORE48,49), but the test sets are also small, which highly influences the value of R2. In summary, our models were rigorously validated using test sets, using the additional external prediction set of 24 compounds to simulate the real application of the models, and by performing Y-randomization test. The results demonstrate the high prediction power of our models and the applicability of our novel geometrical chemical descriptors to binding affinity prediction.

7. Validation Using Docking Studies

For each docking case, the resulted poses were grouped into different bins based on their RMSD against the crystal structure (for 1DQX, 1DHF and 1KV2) or the lowest energy binding conformation (for unnatural aribinose-DHFR complex); the bin width was 0.5Å. The poses with RMSD above 8Å were not considered. This process led to six non-empty bins for both 1DHF (actual pKd = 7.4)76 and 1KV2 (actual pKd = 10.0)78, and four non-empty bins for both 1DQX (actual pKd = 11.05)89 and the DHFR-aribinose unnatural complex. The poses with the lowest estimated binding free energy were selected as representatives of each bin. Thus, we have obtained six poses for 1DHF and 1KV2 and four poses for 1DQX and DHFR-aribinose complexes.

The pKd resulting from consensus prediction using the best 30 ENTess models were used to rank the aforementioned poses and the results are shown in Table 6. These results demonstrate that, in all cases, ENTess predictions could clearly differentiate the native crystallographic bound conformation from the other decoy poses. For instance, our results for 1DHF are consistent with FlexX79 for the top ranked poses: ENTess top 1 and 2 were ranked 1 and 4 by FlexX79 with 1.64 Å and 1.12Å of RMSD, respectively. Both of them actually belong to the same binding conformation and orientation mode. All of the poses ranked low by FlexX were also ranked low by ENTess. The low binding affinity (ca. 1 mM) predicted by ENTess corresponded to poses with weak binding to the DHFR receptor. Similarly, ENTess estimations were accurate for 1DQX and 1KV2: based on ENTess predictions all ligand conformations with low RMSD are strong binders while the low ranked poses are decoys. Most interestingly, aribinose was successfully docked into the DHFR binding pocket using FlexX79 while we knew that the binding did not happen at all. Probably this is the problem of many of not all existing docking programs. In contrast, ENTess suggests that all of the docked poses have very low binding affinity (lower than 1 mM). This observation suggests that binding affinity estimates using ENTess for poses generated with available docking programs can be used to eliminate false positives.

Table 6.

Binding affinity prediction and the ranking of docked poses based on their predicted pKd.

Docking
Poses
Predicted pKd
By ENTess
RMSD
(Å)
Ranking Docking
Poses
Predicted pKd
By ENTess
RMSD
(Å)
Ranking
1dqx.pdb 10.694 0 Native abp_1dhf_1.pdb 2.687 0 Lowest
Energy
1dqx_2.pdb 7.696 2.06 1 abp_1dhf_22.pdb 2.687 0.70 1
1dqx_6.pdb 7.685 1.93 2 abp_1dhf_13.pdb 2.686 1.56 2
1dqx_1.pdb 4.813 3.32 3 abp_1dhf_41.pdb 2.685 4.37 3
1dqx_47.pdb 3.786 6.17 4 abp_1dhf_3.pdb 2.668 6.28 4
1dhf.pdb 7.760 0 Native 1kv2.pdb 8.702 0 Native
1dhf_1.pdb 6.678 1.64 1 1kv2_5.pdb 5.698 1.53 1
1dhf_4.pdb 5.246 1.12 2 1kv2_21.pdb 4.863 1.21 2
1dhf_49.pdb 4.111 2.18 3 1kv2_46.pdb 3.741 7.61 3
1dhf_31.pdb 4.110 2.97 4 1kv2_34.pdb 3.702 4.55 4
1dhf_26.pdb 3.839 7.78 5 1kv2_13.pdb 3.699 2.73 5
1dhf_8.pdb 3.637 6.31 6 1kv2_40.pdb 3.613 6.04 6

Note: the numbers after the pdb codes are the rankings in the original docking methods.

8. Chemical Properties of Descriptors Implicated in Significant QSBR Models

QSBR models generated with variable selection kNN method can be characterized not only by their statistical characteristics but also analyzed in terms of ENTess descriptors that best models are built with. To this end, we have calculated the frequency of occurrence of those selected descriptors found in 30 best models used for the prediction of external test sets. Table 7 shows the most frequently occurring descriptor types. They demonstrate that frequent quadruplet compositions of atom types include purely hydrophobic (such as four carbon atom tetrahedra), hydrophilic (such as four oxygens or nitrogens or mixed polar atom type quadruplet compositions) as well as tetrahedra with mixed polar and non-polar atom composition (e.g., including two carbon and two oxygen or nitrogen atoms). These results indicate that variable selection kNN models tend to rely on chemically diverse descriptor types that capture major intermolecular binding interactions such as hydrophobic effect and hydrogen bonds.

Table 7.

The occurrence of 100 tetrahedra types in best 30 QSBR models.

Descriptor Types Occurrence Descriptor Types Occurrence Descriptor Types Occurrence
CL-CL-CL-NR 27 CL-NR-NR-OR 16 CL-NL-OL-NR 12
CL-OR-OR-OR 24 CL-CL-CL-OR 15 CL-OL-OL-NR 12
CL-CL-NL-NR 22 CL-CL-OL-OR 15 CL-CL-OR-OR 12
CL-NL-OL-OR 22 CL-OL-OL-CR 15 OL-OL-NR-NR 12
CL-CL-NR-NR 22 CL-OL-OL-OR 15 CL-SR-CR-CR 12
CL-NL-CR-CR 22 NL-NL-OL-CR 15 NL-NR-OR-OR 12
OL-OL-CR-OR 22 OL-OL-OL-NR 15 XL-OL-OL-OR 11
OL-OL-OR-OR 22 CL-NL-CR-NR 15 CL-CL-NL-OR 11
NL-SR-CR-OR 22 CL-OL-CR-CR 15 NL-OL-OL-NR 11
NL-NL-CR-CR 21 NL-NL-NR-OR 15 OL-OL-OL-CR 11
XL-CR-CR-CR 21 NL-OL-CR-OR 15 SL-OL-CR-NR 11
CL-NL-NL-NR 20 OL-OL-CR-CR 15 CL-CL-SR-CR 11
XL-CR-CR-OR 20 OL-OL-NR-OR 15 CL-CL-CR-OR 11
CL-SR-CR-OR 20 CL-CR-NR-OR 15 CL-NL-NR-OR 11
OL-OL-OL-OR 19 XL-OL-OL-NR 14 CL-OL-OR-OR 11
CL-OL-NR-NR 18 NL-OL-OL-OR 14 NL-OL-OR-OR 11
NL-NL-OR-OR 18 SL-CL-CR-NR 14 CL-CR-CR-OR 11
NL-OL-CR-CR 18 CL-CL-CR-NR 14 CL-CL-OL-NR 10
XL-CR-NR-OR 18 CL-OL-SR-CR 14 SL-OL-CR-CR 10
SL-CR-CR-OR 18 CL-CR-CR-NR 14 CL-OL-CR-NR 10
CL-CL-NR-OR 17 CL-NR-OR-OR 14 CL-OL-CR-OR 10
CL-NL-CR-OR 17 NL-CR-NR-OR 14 NL-CR-OR-OR 10
CL-NL-OR-OR 17 NL-NR-NR-OR 14 CL-CL-CL-SR 9
SL-CR-CR-NR 17 XL-OL-OL-CR 13 CL-NL-OL-CR 9
CL-SR-CR-NR 17 CL-NL-NL-CR 13 NL-CR-NR-NR 9
NL-CR-CR-CR 17 CL-NL-NL-OR 13 CL-CR-CR-CR 8
NL-CR-CR-NR 17 CL-NL-NR-NR 13 NL-CR-CR-OR 8
SL-CL-CL-CR 16 CL-OL-NR-OR 13 CL-CL-CL-CR 7
SL-CL-OL-CR 16 NL-OL-NR-OR 13 CL-CL-OL-CR 7
NL-OL-OL-CR 16 OL-OL-CR-NR 13 SL-CL-CR-OR 7
CL-CL-CR-CR 16 SL-CR-CR-CR 13 SL-CL-CR-CR 6
NL-NL-CR-OR 16 CL-CR-NR-NR 13 NL-ML-CR-NR 6
NL-OL-CR-NR 16 CL-CR-OR-OR 13
XL-CR-CR-NR 16 CL-CL-NL-CR 12

9. The Importance of Electronegativity for ENTess Descriptors

ENTess descriptors are very simple; since their values are approximately proportional to the number of quadruplets with certain compositions it may appear that significant models could be generated without taking into account the electronegativity values at all. In order to address the importance of EN, we have repeated all calculations described above but using only the numbers of occurrence of different tetrahedra as descriptors. Interestingly, the statistical parameters for training and test set models were comparable with those using the ENTess descriptors, with q2 ranging from 0.5 to 0.7 and R2 from 0.6 to 0.8 (data not shown). However, the predictions of the external validation set with these models were much less accurate than using the ENTess descriptors (the consensus prediction R2 values were always below 0.5). Furthermore, the acceptable training set models, on average, constituted only about 15% of all of the models built, which is far fewer than the 40% obtained when using the ENTess descriptors.

In a separate experiment, we used atomic weights as the property to generate descriptors in place of EN. Similarly, the q2 and R2 for training/test set models, respectively, were comparable with those generated with the ENTess descriptors. However, although the prediction of the external validation set gave better results than using the occurrence numbers the models were not as robust and stable as those built using EN values (the best R2 value for consensus prediction was 0.63 for only one of the three external validation sets and much lower for the other two validation sets, data not shown). We reason that using electronegativity to calculate the ENTess descriptors affords better models since EN implicitly incorporates major atomic properties that are important in intermolecular interactions such as polarity, energy, ability to form hydrogen bond, etc. Including other atomic parameters certainly could further improve our method as we continue its development. In the future studies, we plan to combine charges with EN to derive more sophisticated and perhaps more robust descriptors. Nevertheless we believe that the simplicity of the approach proposed in this paper and our demonstrated ability to generate reliable QSBR models using ENTess descriptors makes these descriptors attractive for a wide range of QSBR studies.

CONCLUSIONS

To the best of our knowledge, our studies represent the first attempt to use electronegativity (EN) as a main parameter for the definition of atom types and descriptors for protein-ligand binding affinity prediction based on QSBR approach. To develop structure-based scoring function, we have combined the atomic EN with the geometrical description of the receptor-ligand interface using Delaunay tessellation. Delaunay tessellation is a unique way to represent the geometrical complementarity between receptors and ligands. Electronegativity has been found to define important terms in the molecular energy functions. Based on these two concepts, we have developed novel geometrical chemical descriptors. The descriptors have been applied in QSBR studies of binding energies for a dataset of 264 receptor-ligand complexes. QSBR models were built with the variable selection k-nearest neighbors (kNN) algorithm based on simulated annealing.

Using the ENTess descriptors, we have built and validated the QSBR models for receptor-ligand binding affinity prediction. Robust and accurate binding affinity predictions with R2 up to 0.83 for the test sets and 0.85 for the independent validation set have been obtained (Table 4). Compared to the conventional atom type definitions16,2022,43, our method is very simple yet uses fundamental chemical and geometrical principles. Our current analysis relies only on 10 atom types in total and relatively small number of descriptors, which can be considered as an additional advantage of this methodology. Comparison with other scoring functions has demonstrated that our approach is accurate and efficient for the prediction of binding affinities for diverse protein-ligand structures. Our QSBR models can be used to predict binding free energy for protein-ligand complexes resulting from experimental studies or docking calculations. We expect that as additional data become available90, the accuracy and the range of applicability of our statistical scoring function will increase.

ACKNOWLEDGEMENTS

Special thanks are to Dr. M. Karthikeyan for providing the statistics for different atom types in chemical databases and Dr. P. Itskowitz for providing the docking poses from AutoDock and valuable discussions concerning the use of electronegativity in deriving the ENTess descriptors. We also thank Drs. J. Feng, B. Krishnamoorthy and S.Q. Zong for their help with programming, and Mr. R. Shah for the discussions concerning the protein family classification. The studies presented in this paper were supported by the NIH research grant GM066940.

REFERENCES

  • 1.Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28:235–242. doi: 10.1093/nar/28.1.235. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Gohlke H, Klebe G. Statistical potentials and scoring functions applied to protein-ligand binding. Curr. Opin. Struct. Biol. 2001;11:231–235. doi: 10.1016/s0959-440x(00)00195-0. [DOI] [PubMed] [Google Scholar]
  • 3.Halperin I, Ma B, Wolfson H, Nussinov R. Principles of docking: An overview of search algorithms and a guide to scoring functions. Proteins. 2002;47:409–443. doi: 10.1002/prot.10115. [DOI] [PubMed] [Google Scholar]
  • 4.Tame JR. Scoring functions: a view from the bench. J. Comput. Aided Mol. Des. 1999;13:99–108. doi: 10.1023/a:1008068903544. [DOI] [PubMed] [Google Scholar]
  • 5.Taylor RD, Jewsbury PJ, Essex JW. A review of protein-small molecule docking methods. J. Comput. Aided Mol. Des. 2002;16:151–166. doi: 10.1023/a:1020155510718. [DOI] [PubMed] [Google Scholar]
  • 6.Bohm HJ, Boehringer M, Bur D, Gmuender H, Huber W, Klaus W, Kostrewa D, Kuehne H, Luebbers T, Meunier-Keller N, Mueller F. Novel inhibitors of DNA gyrase: 3D structure based biased needle screening, hit validation by biophysical methods, and 3D guided optimization. A promising alternative to random screening. J. Med. Chem. 2000;43:2664–2674. doi: 10.1021/jm000017s. [DOI] [PubMed] [Google Scholar]
  • 7.Gruneberg S, Wendt B, Klebe G. Subnanomolar Inhibitors from Computer Screening: A Model Study Using Human Carbonic Anhydrase II. Angew. Chem. Int. Ed Engl. 2001;40:389–393. doi: 10.1002/1521-3773(20010119)40:2<389::aid-anie389>3.0.co;2-#. [DOI] [PubMed] [Google Scholar]
  • 8.Grzybowski BA, Ishchenko AV, Shimada J, Shakhnovich EI. From knowledge-based potentials to combinatorial lead design in silico. Acc. Chem. Res. 2002;35:261–269. doi: 10.1021/ar970146b. [DOI] [PubMed] [Google Scholar]
  • 9.Ajay, Murcko MA. Computational methods to predict binding free energy in ligand-receptor complexes. J. Med. Chem. 1995;38:4953–4967. doi: 10.1021/jm00026a001. [DOI] [PubMed] [Google Scholar]
  • 10.Martin YC. Diverse viewpoints on computational aspects of molecular diversity. J. Comb. Chem. 2001;3:231–250. doi: 10.1021/cc000073e. [DOI] [PubMed] [Google Scholar]
  • 11.Cornell WD, Cieplak P, Bayly CI, Gould IR, Merz KM, Jr, Ferguson DM, Spellmeyer DC, Fox T, Caldwell JW, Kollman PA. A second generation force-field for the simulation of proteins, nucleic acids and organic molecules. J. Am. Chem. Soc. 1995;117:5179–5187. [Google Scholar]
  • 12.MacKerell AD, Jr, Banavali N, Foloppe N. Development and current status of the CHARMM force field for nucleic acids. Biopolymers. 2000;56:257–265. doi: 10.1002/1097-0282(2000)56:4<257::AID-BIP10029>3.0.CO;2-W. [DOI] [PubMed] [Google Scholar]
  • 13.Halgren TA. Merck molecular force field: 1. Basis, form, scope, parameterization, and performance of MMFF94. J. Comput. Chem. 1996;17:490–519. [Google Scholar]
  • 14.Shoichet BK, Leach AR, Kuntz ID. Ligand solvation in molecular docking. Proteins. 1999;34:4–16. doi: 10.1002/(sici)1097-0134(19990101)34:1<4::aid-prot2>3.0.co;2-6. [DOI] [PubMed] [Google Scholar]
  • 15.Bohm HJ. Prediction of binding constants of protein ligands: a fast method for the prioritization of hits obtained from de novo design or 3D database search programs. J. Comput. Aided Mol. Des. 1998;12:309–323. doi: 10.1023/a:1007999920146. [DOI] [PubMed] [Google Scholar]
  • 16.Head RD, Smythe ML, Oprea TI, Waller CL, Green SM, Marshall GR. VALIDATE: A new method for the receptor-based prediction of binding affinities of novel ligands. J. Am. Chem. Soc. 1996;118:3959–3969. [Google Scholar]
  • 17.Eldridge MD, Murray CW, Auton TR, Paolini GV, Mee RP. Empirical scoring functions: I. The development of a fast empirical scoring function to estimate the binding affinity of ligands in receptor complexes. J. Comput. Aided Mol. Des. 1997;11:425–445. doi: 10.1023/a:1007996124545. [DOI] [PubMed] [Google Scholar]
  • 18.Gohlke H, Hendlich M, Klebe G. Knowledge-based scoring function to predict protein-ligand interactions. J. Mol. Biol. 2000;295:337–356. doi: 10.1006/jmbi.1999.3371. [DOI] [PubMed] [Google Scholar]
  • 19.DeWitte RS, Shakhnovich EI. SMoG: de novo design method based on simple, fast, and accurate free energy estimates. 1. Methodology and supporting evidence. J. Am. Chem. Soc. 1996;118:11733–11744. [Google Scholar]
  • 20.Ishchenko AV, Shakhnovich EI. SMall Molecule Growth 2001 (SMoG2001): an improved knowledge-based scoring function for protein-ligand interactions. J. Med. Chem. 2002;45:2770–2780. doi: 10.1021/jm0105833. [DOI] [PubMed] [Google Scholar]
  • 21.Muegge I, Martin YC. A general and fast scoring function for protein-ligand interactions: a simplified potential approach. J. Med. Chem. 1999;42:791–804. doi: 10.1021/jm980536j. [DOI] [PubMed] [Google Scholar]
  • 22.Mitchell JBO, Laskowski RA, Alex A, Thornton JM. BLEEP-potential of mean force describing protein-ligand interactions: I. Generating potential. J. Comput. Chem. 1999;20:1165–1176. [Google Scholar]
  • 23.Deng W, Breneman C, Embrechts MJ. Predicting protein-ligand binding affinities using novel geometrical descriptors and machine-learning methods. J. Chem. Inf. Comput. Sci. 2004;44:699–703. doi: 10.1021/ci034246+. [DOI] [PubMed] [Google Scholar]
  • 24.Kollman PA. Free energy calculations: application to chemical and biochemical phenomenon. Chem. Rev. 1993;93:2395–2417. [Google Scholar]
  • 25.Tanaka S, Scheraga HA. Statistical mechanical treatment of protein conformation. I. Conformational properties of amino acids in proteins. Macromolecules. 1976;9:142–159. doi: 10.1021/ma60049a026. [DOI] [PubMed] [Google Scholar]
  • 26.Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003;31:248–250. doi: 10.1093/nar/gkg056. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Zhang SX, Ying WS, Siahaan TJ, Jois SDS. Solution structure of a peptide derived from the beta subunit of LFA-1. Peptides. 2003;24:827–835. doi: 10.1016/s0196-9781(03)00170-0. [DOI] [PubMed] [Google Scholar]
  • 28.Roche O, Kiyama R, Brooks CL., III Ligand-protein database: linking protein-ligand complex structures to binding data. J. Med. Chem. 2001;44:3592–3598. doi: 10.1021/jm000467k. [DOI] [PubMed] [Google Scholar]
  • 29.Muegge I, Martin YC, Hajduk PJ, Fesik SW. Evaluation of PMF scoring in docking weak ligands to the FK506 binding protein. J. Med. Chem. 1999;42:2498–2503. doi: 10.1021/jm990073x. [DOI] [PubMed] [Google Scholar]
  • 30.Martin YC. Quantiative Drug Design: A Critical Introduction. New York, Basel: Marcel Decker Inc; 1978. pp. 1–425. [Google Scholar]
  • 31.Cramer RD, III, Patterson DE, Bunce JD. Comparative molecular field analysis (CoMFA). 1. Effect of shape on binding of steroids to carrier proteins. J. Am. Chem. Soc. 1988;110:5959–5967. doi: 10.1021/ja00226a005. [DOI] [PubMed] [Google Scholar]
  • 32.Kulkarni SS, Gediya LK, Kulkarni VM. Three-dimensional quantitative structure activity relationships (3-D-QSAR) of antihyperglycemic agents. Bioorg. Med. Chem. 1999;7:1475–1485. doi: 10.1016/s0968-0896(99)00063-2. [DOI] [PubMed] [Google Scholar]
  • 33.Kulkarni SS, Kulkarni VM. Three-dimensional quantitative structure-activity relationship of interleukin 1-beta converting enzyme inhibitors: A comparative molecular field analysis study. J. Med. Chem. 1999;42:373–380. doi: 10.1021/jm9708442. [DOI] [PubMed] [Google Scholar]
  • 34.Tokarski JS, Hopfinger AJ. Prediction of ligand-receptor binding thermodynamics by free energy force field (FEFF) 3D-QSAR analysis: application to a set of peptidometic renin inhibitors. J. Chem. Inf. Comput. Sci. 1997;37:792–811. doi: 10.1021/ci970006g. [DOI] [PubMed] [Google Scholar]
  • 35.Holloway MK, Wai JM, Halgren TA, Fitzgerald PM, Vacca JP, Dorsey BD, Levin RB, Thompson WJ, Chen LJ, deSolms SJ. A priori prediction of activity for HIV-1 protease inhibitors employing energy minimization in the active site. J. Med. Chem. 1995;38:305–317. doi: 10.1021/jm00002a012. [DOI] [PubMed] [Google Scholar]
  • 36.Ortiz AR, Pisabarro MT, Gago F, Wade RC. Prediction of drug binding affinities by comparative binding energy analysis. J. Med. Chem. 1995;38:2681–2691. doi: 10.1021/jm00014a020. [DOI] [PubMed] [Google Scholar]
  • 37.Perez C, Pastor M, Ortiz AR, Gago F. Comparative binding energy analysis of HIV-1 protease inhibitors: incorporation of solvent effects and validation as a powerful tool in receptor-based drug design. J. Med. Chem. 1998;41:836–852. doi: 10.1021/jm970535b. [DOI] [PubMed] [Google Scholar]
  • 38.Carter CW, Jr, LeFebvre BC, Cammer SA, Tropsha A, Edgell MH. Four-body potentials reveal protein-specific correlations to stability changes caused by hydrophobic core mutations. J. Mol. Biol. 2001;311:625–638. doi: 10.1006/jmbi.2001.4906. [DOI] [PubMed] [Google Scholar]
  • 39.Sherman DB, Zhang SX, Pitner JB, Tropsha A. Evaluation of the relative stability of liganded versus ligand-free protein conformations using simplicial neighborhood analysis of protein packing (SNAPP) method. Proteins. 2004;56:828–838. doi: 10.1002/prot.20131. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zhang SX, Kaplan AH, Tropsha A. HIV-1 Protease Function and Structure Studies with Novel Computational Geometrical Method. Proteins. doi: 10.1002/prot.22094. Unpublished. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Singh RK, Tropsha A, Vaisman II. Delaunay tessellation of proteins: four body nearest-neighbor propensities of amino acid residues. J. Comput. Biol. 1996;3:213–221. doi: 10.1089/cmb.1996.3.213. [DOI] [PubMed] [Google Scholar]
  • 42.Tropsha A, Singh RK, Vaisman II, Zheng W. Statistical geometry analysis of proteins: implications for inverted structure prediction. Pac. Symp. Biocomput. 1996:614–623. [PubMed] [Google Scholar]
  • 43.Bush BL, Sheridan RP. PATTY: A Programmable Atom Typer and Language for Automatic Classification of Atoms in Molecular Databases. J. Chem. Inf. Comput. Sci. 1993;33:756–762. [Google Scholar]
  • 44.Golbraikh A, Tropsha A. Beware of q2! J. Mol. Graph. Model. 2002;20:269–276. doi: 10.1016/s1093-3263(01)00123-1. [DOI] [PubMed] [Google Scholar]
  • 45.Golbraikh A, Shen M, Xiao Z, Xiao YD, Lee KH, Tropsha A. Rational selection of training and test sets for the development of validated QSAR models. J. Comput. Aided Mol. Des. 2003;17:241–253. doi: 10.1023/a:1025386326946. [DOI] [PubMed] [Google Scholar]
  • 46.Tropsha A, Gramatica P, Gomba VK. The improtance of being earnest: validation is the absolute essential for the successful application and interpretaion of QSPR models. QSAR Comb. Sci. 2003;22:69–77. [Google Scholar]
  • 47.Tropsha A. Recent Trends in Quantitative Structure-Activity Relationships. In: Abraham D, editor. Burger's Medicinal Chemistry and Drug Discovery. New York: John Wiley & Sons, Inc; 2003. pp. 49–77. [Google Scholar]
  • 48.Wang RX, Liu L, Lai LH, Tang YQ. SCORE: A new empirical method for estimating the binding affinity of a protein-ligand complex. J. Mol. Model. 1998;4:379–394. [Google Scholar]
  • 49.Wang RX, Lai LH, Wang SM. Further development and validation of empirical scoring functions for structure-based binding affinity prediction. J. Comput. Aided Mol. Des. 2002;16:11–26. doi: 10.1023/a:1016357811882. [DOI] [PubMed] [Google Scholar]
  • 50.Wang RX, Lu YP, Wang SM. Comparative evaluation of 11 scoring functions for molecular docking. J. Med. Chem. 2003;46:2287–2303. doi: 10.1021/jm0203783. [DOI] [PubMed] [Google Scholar]
  • 51.Hendlich M, Bergner A, Gunther J, Klebe G. Relibase: design and development of a database for comprehensive analysis of protein-ligand interactions. J. Mol. Biol. 2003;326:607–620. doi: 10.1016/s0022-2836(02)01408-0. [DOI] [PubMed] [Google Scholar]
  • 52.2005 http://www.imb-jena.de/ImgLibPDB/pages/SWP/index.php. [Google Scholar]
  • 53.Pauling L. The Nature of the Chemical Bond. IV. The Energy of Single Bonds and the Relative Electronegativity of Atoms. J. Am. Chem. Soc. 1932;54:3570–3582. [Google Scholar]
  • 54.Itskowitz P, Berkowitz ML. Chemical potential equalization principle: Direct approach from density functional theory. J. Phys. Chem. A. 1997;101:5687–5691. [Google Scholar]
  • 55.Kellogg GE, Kier LB, Gaillard P, Hall LH. E-state fields: applications to 3D QSAR. J. Comput. Aided Mol. Des. 1996;10:513–520. doi: 10.1007/BF00134175. [DOI] [PubMed] [Google Scholar]
  • 56.Oliferenko AA, Krylenko PV, Palyulin VA, Zefirov NS. A new scheme for electronegativity equalization as a source of electronic descriptors: application to chemical reactivity. SAR QSAR Environ. Res. 2002;13:297–305. doi: 10.1080/10629360290002785. [DOI] [PubMed] [Google Scholar]
  • 57.2005 http://dtp.nci.nih.gov/docs/3d_database/structural_information/smiles_strings.html.
  • 58.1999 http://dtp.nci.nih.gov/docs/cancer/cancer_data.html.
  • 59.Watson DF. Computing the n-dimensional Delaunay tessellation with application to Voronoi polytopes. The Computer J. 1981;24:167–172. [Google Scholar]
  • 60.Basak SC, Mills D. Prediction of mutagenicity utilizing a hierarchical QSAR approach. SAR QSAR Environ. Res. 2001;12:481–496. doi: 10.1080/10629360108039830. [DOI] [PubMed] [Google Scholar]
  • 61.Benigni R, Giuliani A, Franke R, Gruska A. Quantitative structure-activity relationships of mutagenic and carcinogenic aromatic amines. Chem. Rev. 2000;100:3697–3714. doi: 10.1021/cr9901079. [DOI] [PubMed] [Google Scholar]
  • 62.Cronin MT, Dearden JC, Duffy JC, Edwards R, Manga N, Worth AP, Worgan AD. The importance of hydrophobicity and electrophilicity descriptors in mechanistically-based QSARs for toxicological endpoints. SAR QSAR Environ. Res. 2002;13:167–176. doi: 10.1080/10629360290002316. [DOI] [PubMed] [Google Scholar]
  • 63.Fan Y, Shi LM, Kohn KW, Pommier Y, Weinstein JN. Quantitative structure-antitumor activity relationships of camptothecin analogues: cluster analysis and genetic algorithm-based studies. J. Med. Chem. 2001;44:3254–3263. doi: 10.1021/jm0005151. [DOI] [PubMed] [Google Scholar]
  • 64.Girones X, Gallegos A, Carbo-Dorca R. Modeling antimalarial activity: application of Kinetic Energy Density Quantum Similarity Measures as descriptors in QSAR. J. Chem. Inf. Comput. Sci. 2000;40:1400–1407. doi: 10.1021/ci0004558. [DOI] [PubMed] [Google Scholar]
  • 65.Moss GP, Dearden JC, Patel H, Cronin MT. Quantitative structure-permeability relationships (QSPRs) for percutaneous absorption. Toxicol. In Vitro. 2002;16:299–317. doi: 10.1016/s0887-2333(02)00003-6. [DOI] [PubMed] [Google Scholar]
  • 66.Randic M, Basak SC. Construction of high-quality structure-property-activity regressions: the boiling points of sulfides. J. Chem. Inf. Comput. Sci. 2000;40:899–905. doi: 10.1021/ci990115q. [DOI] [PubMed] [Google Scholar]
  • 67.Suzuki T, Ide K, Ishida M, Shapiro S. Classification of environmental estrogens by physicochemical properties using principal component analysis and hierarchical cluster analysis. J. Chem. Inf. Comput. Sci. 2001;41:718–726. doi: 10.1021/ci000333f. [DOI] [PubMed] [Google Scholar]
  • 68.Trohalaki S, Gifford E, Pachter R. Improved QSARs for predictive toxicology of halogenated hydrocarbons. Comput. Chem. 2000;24:421–427. doi: 10.1016/s0097-8485(99)00093-5. [DOI] [PubMed] [Google Scholar]
  • 69.Wang X, Yin C, Wang L. Structure-activity relationships and response-surface analysis of nitroaromatics toxicity to the yeast (Saccharomyces cerevisiae) Chemosphere. 2002;46:1045–1051. doi: 10.1016/s0045-6535(01)00148-5. [DOI] [PubMed] [Google Scholar]
  • 70.Kubinyi H, Hamprecht FA, Mietzner T. Three-dimensional quantitative similarity-activity relationships (3D QSiAR) from SEAL similarity matrices. J. Med. Chem. 1998;41:2553–2564. doi: 10.1021/jm970732a. [DOI] [PubMed] [Google Scholar]
  • 71.Golbraikh A, Tropsha A. Predictive QSAR modeling based on diversity sampling of experimental datasets for the training and test set selection. J. Comput. Aided Mol. Des. 2002;16:357–369. doi: 10.1023/a:1020869118689. [DOI] [PubMed] [Google Scholar]
  • 72.Shen M, LeTiran A, Xiao Y, Golbraikh A, Kohn H, Tropsha A. Quantitative structure-activity relationship analysis of functionalized amino acid anticonvulsant agents using k nearest neighbor and simulated annealing PLS methods. J. Med. Chem. 2002;45:2811–2823. doi: 10.1021/jm010488u. [DOI] [PubMed] [Google Scholar]
  • 73.Hoffman B, Cho SJ, Zheng WF, Wyrick S, Nichols DE, Mailman RB, Tropsha A. Quantitative structure-activity relationship modeling of dopamine D-1 antagonists using comparative molecular field analysis, genetic algorithms-partial least-squares, and K nearest neighbor methods. J. Med. Chem. 1999;42:3217–3226. doi: 10.1021/jm980415j. [DOI] [PubMed] [Google Scholar]
  • 74.Zheng W, Tropsha A. Novel variable selection quantitative structure--property relationship approach based on the k-nearest-neighbor principle. J. Chem. Inf. Comput. Sci. 2000;40:185–194. doi: 10.1021/ci980033m. [DOI] [PubMed] [Google Scholar]
  • 75.Golbraikh A, Bonchev D, Tropsha A. Novel ZE-isomerism descriptors derived from molecular topology and their application to QSAR analysis. J. Chem. Inf. Comput. Sci. 2002;42:769–787. doi: 10.1021/ci0103469. [DOI] [PubMed] [Google Scholar]
  • 76.Davies JF, Delcamp TJ, Prendergast NJ, Ashford VA, Freisheim JH, Kraut J. Crystal-Structures of Recombinant Human Dihydrofolate-Reductase Complexed with Folate and 5-Deazafolate. Biochem. 1990;29:9467–9479. doi: 10.1021/bi00492a021. [DOI] [PubMed] [Google Scholar]
  • 77.Miller BG, Hassell AM, Wolfenden R, Milburn MV, Short SA. Anatomy of a proficient enzyme: The structure of orotidine 5 '-monophosphate decarboxylase in the presence and absence of a potential transition state analog. Proceedings of the National Academy of Sciences of the United States of America. 2000;97:2011–2016. doi: 10.1073/pnas.030409797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 78.Pargellis C, Tong L, Churchill L, Cirillo PF, Gilmore T, Graham AG, Grob PM, Hickey ER, Moss N, Pav S, Regan J. Inhibition of p38 MAP kinase by utilizing a novel allosteric binding site. Nature Structural Biology. 2002;9:268–272. doi: 10.1038/nsb770. [DOI] [PubMed] [Google Scholar]
  • 79.Rarey M, Kramer B, Lengauer T, Klebe G. A fast flexible docking method using an incremental construction algorithm. J. Mol. Biol. 1996;261:470–489. doi: 10.1006/jmbi.1996.0477. [DOI] [PubMed] [Google Scholar]
  • 80.SYBYL. Version 6.9. St. Louis, MO: Tripos, Inc.; 2002. [Google Scholar]
  • 81.Goodsell DS, Olson AJ. Automated docking of substrates to proteins by simulated annealing. Proteins. 1990;8:195–202. doi: 10.1002/prot.340080302. [DOI] [PubMed] [Google Scholar]
  • 82.Weber PC, Ohlendorf DH, Wendoloski JJ, Salemme FR. Structural origins of high-affinity biotin binding to streptavidin. Science. 1989;243:85–88. doi: 10.1126/science.2911722. [DOI] [PubMed] [Google Scholar]
  • 83.Poulos TL, Howard AJ. Crystal structures of metyrapone- and phenylimidazole-inhibited complexes of cytochrome P-450cam. Biochem. 1987;26:8165–8174. doi: 10.1021/bi00399a022. [DOI] [PubMed] [Google Scholar]
  • 84.Holmes MA, Tronrud DE, Matthews BW. Structural analysis of the inhibition of thermolysin by an active-site-directed irreversible inhibitor. Biochem. 1983;22:236–240. doi: 10.1021/bi00270a034. [DOI] [PubMed] [Google Scholar]
  • 85.Baldwin ET, Bhat TN, Gulnik S, Hosur MV, Sowder RC, Cachau RE, Collins J, Silva AM, Erickson JW. Crystal structures of native and inhibited forms of human cathepsin D: implications for lysosomal targeting and drug design. Proc. Natl. Acad. Sci. U. S. A. 1993;90:6796–6800. doi: 10.1073/pnas.90.14.6796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 86.Wedekind JE, Poyner RR, Reed GH, Rayment I. Chelation of serine 39 to Mg2+ latches a gate at the active site of enolase: structure of the bis(Mg2+) complex of yeast enolase and the intermediate analog phosphonoacetohydroxamate at 2.1-A resolution. Biochem. 1994;33:9333–9342. doi: 10.1021/bi00197a038. [DOI] [PubMed] [Google Scholar]
  • 87.Holden HM, Tronrud DE, Monzingo AF, Weaver LH, Matthews BW. Slow-and fast-binding inhibitors of thermolysin display different modes of binding: crystallographic analysis of extended phosphonamidate transition-state analogues. Biochem. 1987;26:8542–8553. doi: 10.1021/bi00400a008. [DOI] [PubMed] [Google Scholar]
  • 88.Van Duyne GD, Standaert RF, Karplus PA, Schreiber SL, Clardy J. Atomic structure of FKBP-FK506, an immunophilin-immunosuppressant complex. Science. 1991;252:839–842. doi: 10.1126/science.1709302. [DOI] [PubMed] [Google Scholar]
  • 89.Miller BG, Hassell AM, Wolfenden R, Milburn MV, Short SA. Anatomy of a proficient enzyme: The structure of orotidine 5 '-monophosphate decarboxylase in the presence and absence of a potential transition state analog. Proc. Natl. Acad. Sci. U. S. A. 2000;97:2011–2016. doi: 10.1073/pnas.030409797. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 90.Wang RX, Fang XL, Lu YP, Wang SM. The PDBbind database: Collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J. Med. Chem. 2004;47:2977–2980. doi: 10.1021/jm030580l. [DOI] [PubMed] [Google Scholar]

RESOURCES