Skip to main content
Wiley Open Access Collection logoLink to Wiley Open Access Collection
. 2016 Apr 29;36(1-2):1600013. doi: 10.1002/minf.201600013

Quantitative Structure‐activity Relationship (QSAR) Models for Docking Score Correction

Yoshifumi Fukunishi 1,, Satoshi Yamasaki 2, Isao Yasumatsu 2,3, Koh Takeuchi 1, Takashi Kurosawa 2,4, Haruki Nakamura 5
Editors: Gisbert Schneider, Kimito Funatsu, Ysushi Okuno, Dave Winkler
PMCID: PMC5297997  PMID: 28001004

Abstract

In order to improve docking score correction, we developed several structure‐based quantitative structure activity relationship (QSAR) models by protein‐drug docking simulations and applied these models to public affinity data. The prediction models used descriptor‐based regression, and the compound descriptor was a set of docking scores against multiple (∼600) proteins including nontargets. The binding free energy that corresponded to the docking score was approximated by a weighted average of docking scores for multiple proteins, and we tried linear, weighted linear and polynomial regression models considering the compound similarities. In addition, we tried a combination of these regression models for individual data sets such as IC50, Ki, and %inhibition values. The cross‐validation results showed that the weighted linear model was more accurate than the simple linear regression model. Thus, the QSAR approaches based on the affinity data of public databases should improve docking scores.

Keywords: Binding free energy, ChEMBL, Docking score, Protein-compound docking

1.  Introduction

De novo drug design is a key factor in lead optimizations and the selective optimization of side activities, and the quantitative structure‐activity relationship (QSAR) approach is a useful tool for predicting target/off‐target activities. QSAR‐based affinity predictions are useful for the drug repositioning (drug repurposing) of known approved drugs, poly‐pharmacology and the prediction of drug−drug interactions.1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 hERG‐inhibition and cytochrome P450 (CYP)‐inhibition predictions represent classical achievements in off‐target predictions, and this kind of target/off‐target prediction is known as counter screening. The recent accumulation of protein‐compound affinity data in public repositories, such as the PubChem and ChEMBL projects, has enabled us to carry out proteome‐wide target/off‐target predictions.23,24 These predictions are based on structure‐activity relationship models for multiple proteins, just as in the conventional computer‐aided drug design and virtual screening.

In a previous study, we attempted affinity and target predictions of a compound by using docking studies against multiple proteins. These trials worked well in virtual screenings. However, these methods provided only binary, active/inactive information and could not provide quantitative affinities.6,9,10 Target/off‐target predictions based on QSAR and counter screening have succeeded in many studies, including ours.4,5,7,8,11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22 However, the somewhat primitive approach of a similarity search relying on the QSAR model is not sufficiently versatile. To broaden its applicability, the similarity search has been improved by considering the similarity that is shared among different groups of compounds rather than the similarity between two compounds.11 Most QSAR models rely on descriptors with sets of two‐dimensional (2D) substructures; the most popular such descriptors are the MDL's MACCS key and Dragon‐X (Talete srl, Milano, Italy). Moreover, there have been many regression models. In our previous studies, we used a protein‐compound affinity matrix as the set of descriptors and successfully predicted CYP inhibitors and substrates.7,12

As an extension of our previous work,6,7 in the present study we developed and examined some principal component regression‐type prediction methods based on the machine‐learning score modification (MSM) method and the docking score index (DSI: protein‐compound affinity matrix) using public repository data. In the MSM method, the docking score against a target protein could be corrected by a linear combination of docking scores against multiple nontarget proteins. The DSI of a compound is a set of docking scores of the compound against multiple target and nontarget proteins. Here, the nontarget proteins are used as “probes” to check the three‐dimensional (3D) shape and distribution of the atomic charge of the compound. We applied our methods to kinases of the ChEMBL database, because kinases form a major protein family that is key to understanding and controlling cellular signal transduction, and because their structures have been studied thoroughly.25, 26, 27

2.  Methods

2.1.  Prediction Models

The present method predicts the binding energy (affinity) of given protein‐compound pairs based on the protein‐compound docking scores obtained by a docking program; this method is a descriptor‐based machine‐learning or regression method.6,7,28,29 The present method requires a learning set of 3D structures of compounds, the binding energy data between those compounds, and target proteins. Let si b, Rb a, and β be the docking score of the i‐th compound, that of the b‐th protein, and parameters, respectively. The set of {b} can include the target protein (the a‐th protein). We proposed a score modification method as follows. (Eq. (1))6

ΔGia=b=1sib·Rba+β (1)

Here, ΔGi a is the binding free energy between the i‐th compound and the a‐th protein. Proteins that are similar to the target protein could bind the ligands of the target protein. Docking scores correspond to the binding free energy, and the ensemble average should improve the accuracy. Thus eq. 1 should work, and indeed, this approximation was successfully applied to several targets.6,7,28,29 The score modification method increased the area under the curve (AUC) values of the database enrichment curves by 50 %. Namely, the AUC values of 60–70 % were improved to 80–90 % in these previous reports. In addition, there is one advantage to the other conventional QSAR models with ordinary molecular descriptors. One of the most serious problems of QSAR models is the limited range of applicable domains, since QSAR models cannot work for unexpected input data.30 If the docking score is precisely proportional to the binding free energy without computational error, ΔGia=si a. Thus, eq. 1 can work without any experimental affinity data and the problem of identifying an applicable domain is avoided.

In eq. 1, the number of parameters is equal to the number of proteins. The number of parameters can be reduced by principal component regression (PCR). Docking scores should be a form of some kinds of similarity scores (see APPENDIX A). Thus, the docking scores could be used as the descriptors for compounds. If we allow docking scores as descriptors, the docking score for a target protein is not needed in eq. 1.

In the present model, the protein‐compound binding energy ΔGi a is approximated by the PCR method based on the protein‐compound docking scores si b. As shown in Figure 1, we tried six PCR models (Models 1–6) as follows. In each model, the optimal principal component analysis (PCA) axis was selected to maximize the q value by the leave‐one‐out (LOO) cross‐validation test. The selection of the PCA axis corresponds to the factor rotation.29

Figure 1.

Figure 1

Schematic representation of each principal component regression (PCR) model.

(Model 1) Linear PCR model (Eq. (2), (3))

ΔGia=j=1Ncja·pij+b (2)
pij=b=1Npdbj·(sib-<sb>) (3)

Here, cj a, b, p, and db j are the parameter, offset parameter, principal component vector, and loading vector, respectively. The ⟨ ⟩ represents an average. The PCA of the protein‐compound docking score matrix s gives the loading vector d and the principal component vector (axis) p. The parameters c and b are determined by a multilinear regression (MLR). Np is the total number of docked proteins and N (N<Np) is determined to maximize the q‐value obtained by the LOO cross‐validations. The parameters are determined based on the learning set and then are used for prediction.

The protein‐compound docking scores si b were obtained by the program Sievgene,31 which is a protein‐ligand flexible docking program for in silico drug screening. Sievgene is a part of the myPresto system, which is available online (http://presto.protein.osaka‐u.ac.jp/myPresto4/) and is free for academic use.

(Model 2) Polynomial PCR model (Eq. (4), (5))

ΔGia=j=1Ncja·pij+k=1NjkNc2ja·pij·pik+b (4)
pij=b=1Npdbj·(sib-<sb>) (5)

Here, c2j a represents the parameter for the second‐order term. We tried only the second‐order and did not try the higher‐order polynomials. The other terms and parameters are defined exactly as in Model 1.

(Model 3) Weighted learning PCR model. (Eq. (6))

S=b=1Np(sib-sjb)2 (6)

S is the distance between the i‐th and j‐th compounds. Let Smax(i)=max{S; for all j}. The data of compounds that satisfy S<x ⋅ Smax are replicated M times. In the present study, x=0.1, 0.2, 0.3, and 0.5 were examined and M was set to 1, 2, 4, and 8.

(Model 4) Classified PCR model

In the classified PCR model, the experimental data are classified into IC50, Ki, and %inhibition data, and the simple PCR method is applied to each classified data set.

(Model 5) Combined PCR model

This model is a linear combination of the regression models made by the classified PCR model. (Eq. (7))

ΔGia=m=1NtypeCm·Δgma+c0 (7)

Here, Δg, Cm and c0 are the binding free energy obtained from the m‐th data set and the fitting parameters, respectively. The Δg is given by eq. 1, and the coefficient Cm is determined to minimize the root‐mean‐square difference between the coefficients of {c} of Δg. In the present study, the experimental data were classified into IC50, Ki, and %inhibition data (m=Ki, IC50 and %inhibition). We based this classification on the individual source of the experimental data.

(Model 6) Replica and partial‐replica PCR models

Cortes‐Ciriano et al. suggested that the use of multiple replica data sets permutated by random noise could improve the QSAR accuracy.32 In the replica PCR method, the experimental data and docking scores are replicated by the permutation of 5 % noise. Also, Steinmetz et al. suggested that the QSAR result could be improved by considering the importance of data due to reliability.33 In the partial‐replica PCR method, experimental ΔG values <−10 kcal/mol are replicated by permutation of 2 % noise, since the strong affinities of lead‐level compounds are observed by multiple experiments in many cases and should be more reliable than the weak affinities of hit‐level compounds.

2.2.  Generation of the Docking Score Index by Protein‐compound Docking

The protein‐compound docking scores si b in all models were calculated by the protein‐compound docking program Sievgene.31 Sievgene generates multiple possible conformers for each compound and keeps the structures of target proteins more or less rigid, with the exception that soft interaction forces can change the structures slightly. This docking program reconstructed about 50 % of the receptor‐compound complexes in PDB (132 in total) with an accuracy of less than 2 Å root mean square deviation (RMSD),31 which is mostly equivalent to the predictions by other docking programs. In the present study, the Sievgene program generated up to 100 conformers for each compound, and 200×200×200 grid potentials were adapted for all proteins. The pocket regions were suggested by the coordinates of the original ligands in the receptor‐compound complex structures. The details of the docking score are summarized in Appendix B (Supporting Information). It takes 3 seconds to dock one compound against one protein on a single core of the Xeon 5570 CPU (2.98 GHz).

2.3.  Data‐conversion Method

The protein‐compound binding energy ΔG is calculated from the Kd value as follows: (Eq. (8))

ΔG=kB·T·ln(Kd) (8)

where kB and T are the Boltzmann constant and temperature.

The experimental Kd and ΔG values are difficult to obtain and quite rare in public databases. On the other hand, the %inhibition, Ki and IC50 values are relatively easy to obtain and abundant in public databases such as PubChem and ChEMBL. In the present study, we assumed that Kd=Ki, since the binding affinities of the natural ligands have been reported to be much weaker than those of the reported artificial ligands in many proteins. For the %inhibition and IC50 data, the conventional approaches are adopted as follows. The %inhibition value is converted to the Ki value. Let E, S, P and I be the enzyme, substrate, product and inhibitor, respectively. The inhibition reaction is described as follows. Here, “K” represents the reaction rate. (Eq. (9))

E+SKsESK2E+PE+IKiEI (9)

When the enzyme reaction is the rate‐determining step, we have (Eq. (10))

ESE+P. (10)

The value of Ks is then derived from the density of the E, S and ES complexes as follows. (Eq. (11))

Ks=[E][S][ES] (11)

Here, the bracket [ ] represents the density of molecules. The reaction speed v is described as follows. (Eq. (12))

v=Vmax[S](1+[I]/Ki)Ks+[S] (12)

Here, Vmax is the maximum enzyme reaction speed. Let r be the residual activity; Ki is then given by (Eq. (13))

Ki=[I][S]/r-[S]Ks-1 (13)

The parameters in eq. 13 must satisfy (Eq. (14))

r<[S]/Ks1+[S]/Ks, (14)

because Ki>0. Here, the %inhibition value is (1−r)*100.

The IC50 value is converted to the Ki value by the Cheng‐Prusoff equation as follows.20,34 Here, S and Ks are the substrate and the affinity between the enzyme and the substrate. (Eq. (15))

Ki=IC501+[S]Ks (15)

Unfortunately, the exact values of [S], [I] and Ks are not explicitly described in the assay data of PubChem or ChEMBL. Thus, we checked some original experimental articles for their assay data and adopted arbitrary standard values for [S], [I] and Ks based on previous reports.27,28,35, 36, 37

3.  Data Preparation

3.1.  Probe Protein Sets with and without Kinase Structures

To generate {s} in Models 1–6 (the DSI or affinity fingerprint), we performed a protein−compound docking simulation based on the soluble protein structures registered in the Protein Data Bank (PDB). The probe protein set consisted of 600 arbitrarily selected protein structures. These structures were all protein‐ligand complex structures. Some of them were kinases. For protein sets, the complexes containing a covalent bond between the protein and ligand were removed, and all missing hydrogen atoms were added to form the all‐atom models of the proteins. All water molecules and cofactors were removed from the protein structures. The atomic charges of the proteins were the same as those in AMBER parm99.38 The docking pocket of each protein was indicated by the coordinates of the original ligand.

3.2.  Validation Test Set: Target Proteins and Compounds

The tested compounds and their assay information (compound structures, affinities against kinases) were downloaded from KinaseSARfari on the ChEMBL website (https://www.ebi.ac.uk/chembl/).24 The biochemical assay data, namely, Ki, IC50, %residual activity and/or %inhibition values of human kinase protein‐inhibitor systems, were also extracted from the bioactivity table in KinaseSARfari. Assay data with inadequate energy units or unclear energy values were excluded. The assay data for large compounds (mass weight >500 Da) were also excluded, since Sievgene is designed for the docking of small compounds with mass weights <500 Da.

As target proteins, 97 kinases with more than 50 assay data points were selected. The 3D structures of the compounds were energy‐optimized by cosgene39 with the general AMBER force field (GAFF),40 and the atomic charges were calculated by the MOPAC AM1 model using the Hgene program of the myPresto suite. Finally, 38,946 assay data points of 97 kinases and 18,491 compounds were derived. Most of the assay data were IC 50 data, with the second‐most common type being %inhibition data, followed by Ki and Kd values. These data were converted to the ΔG value by using Eqs. 8, 13 and 15. Equations 13 and 15 required the density of the substrate [S], the density of the inhibitor [I], and the reaction rate of the substrate (Ks). These [S], [I] and Ks data were not available in the ChEMBL database. For Eq. 15, we used the [S] and Ks values reported by Carna Bioscience Inc. (http://www.carnabio.com/english/), and we used the standard values ([S] : Ks=1 : 1) in place of unknown data. For Eq. 13, we used constant values for some parameters: [S]=20 μM, [I]=50 μM and Ks=1 μM. And r values greater than 0.95 and less than 0.05 were ignored, since the error of r should be around 5 % and an r value >1 gave unreasonable ΔG values. The parameter set was determined based on several reports of assays included in the ChEMBL database.

4.  Results and Discussion

4.1.   ΔG Values Obtained from Ki, IC50 and %Inhibition Values

Appendix C (Supporting Information) provides a list of the target kinases used in the present study and the number of ligands for each. The kinase names were the domain names from the KinaseSARfari database in ChEMBL.

The average and standard deviation values of the experimental ΔG values of the 97 target proteins are summarized in Table 1, and the data were classified following the data type. For some kinases, multiple experimental assay data (Ki, IC50 and %inhibition) were available for the same ligands of some proteins. The differences between each classified data type (the ΔG value calculated from Ki, IC50 and %inhibition data) and the other types of data are summarized in Table 1. The difference corresponded to the error of the experimental data converted in the present study by Eqs. 13 and 15.

Table 1.

Statistics of ΔG values (kcal/mol) converted from ChEMBL data.

Data type Average ΔG σ [a] ΔGmin [b] ΔGmax [c] RMSD [d]
Whole −9.67 2.41 −18.51 −0.55
Ki/Kd −9.30 1.69 −16.30 −1.34 1.55
IC50 −9.37 2.18 −18.51 −0.55 1.87
Activity −0.88 5.74 −9.35 −4.11 2.58
%inhibition −2.66 4.14 −9.35 −4.11 2.99

[a] The standard deviation of the whole observed data (kcal/mol). [b] The minimum ΔG value of the data set (kcal/mol). [c] The maximum ΔG value of the data set (kcal/mol). [d] The root mean square deviation (RMSD) of the multiply‐observed data for the same protein‐ligand pairs (kcal/mol)

We simulated the expected correlation coefficient between the experimental and calculated ΔG values based on the statistics summarized in Table 1. We generated a set of numbers that mimics the experimental ΔG values whose average and standard deviation were −9.6 kcal/mol and 2.5 kcal/mol, respectively. Then a random number was added as experimental error to each simulated experimental ΔG value. Also, we generated a set of numbers that mimics the calculated ΔG value by adding a random number as the computational error. The calculated correlation coefficients are summarized in Table 2. In addition, we performed a set of virtual screenings based on these simulated data. The compounds with experimental ΔG<−11 kcal/mol were selected as the active (hit) compounds, and the others were treated as decoys. Then the compounds were sorted in the order of the calculated ΔG and the receiver operating characteristic (ROC) curves were calculated. The area‐under‐the‐curve (AUC) values of the ROC curves of the simulated virtual screenings are also summarized in Table 2. A higher AUC value corresponds to better prediction, and the AUC value is always more than zero and less than 100 %. For the random screening, AUC=50 %.

Table 2.

Correlation coefficient (R) and area under the curve (AUC) of the receiver operating characteristic (ROC) curve by mathematical simulation.

Error exp [a] Error calc [b] R AUC (%)
0.58 0.82 0.90 97
0.58 1.29 0.79 94
0.58 1.51 0.74 91
0.58 1.83 0.67 88
0.70 0.91 0.88 97
0.70 1.35 0.78 94
0.70 1.55 0.73 92
0.70 1.87 0.66 88
1.16 1.30 0.79 97
1.16 1.64 0.70 93
1.16 1.81 0.66 92
1.16 2.09 0.59 88
1.39 1.51 0.75 97
1.39 1.81 0.66 94
1.39 1.97 0.62 92
1.39 2.23 0.56 88

[a] Simulated experimental error (kcal/mol). [b] Simulated prediction error (kcal/mol).

These correlation coefficients (0.6–0.7) suggested poor QSARs; on the other hand, the AUC values (AUC=80–90 %) showed good virtual screening results.

4.2.  Cross‐validation Tests of QSAR Models

Each of the compounds that gave assay data for one or more of the 97 target proteins was docked to all proteins of a protein set to generate the protein‐compound docking score matrix s. Then we adopted Models 1–6 and the LOO cross‐validation test to calculate the correlation coefficients R and Q. Figure 2 shows the schematic description of the LOO cross‐validation procedure for models 1–6. In all models, we performed the simple linear regressions and obtained an R value for each principal component axis. The axes were sorted according to their R values. Then, we performed the multiple linear/polynomial regressions adopting the top m‐axes and calculated the R values. The number of axes that gave the highest R was adopted to construct the regression model for calculating the ΔG of the query compound, and the Q value was calculated.

Figure 2.

Figure 2

Schematic representation of the leave‐one‐out (LOO) cross‐validation procedure for models 1–6. The similarity calculation and making copy of assay data (in model 6) were applied only for the weighted principle component regression (PCR) model (gray boxes). Nc is the number of compounds. For the replica and partial‐replica PCR models, the initial docking scores and set of ΔG were replicated. *: Regression type was polynomial only for the polynomial PCR model. Otherwise, the regression was a multilinear regression (MLR).

Figure 3 shows the results of the LOO cross‐validation test of three selected kinases. The correlation coefficients (R) between the experimental and predicted ΔG values are summarized in Tables 1 and 3, respectively. Also, the root mean square error (RMSE) values between the experimental and predicted ΔG values are summarized in Table 3.

Table 3.

Average correlation coefficient (R/Q) between the experimental data and the calculated data obtained by the six regression models over all 79 proteins.

Model R [a] RMSE [b] Q [c] RMSE [d]
Simple PCR model 0.81 1.17 0.63 1.58
Polynomial PCR model 0.69 1.47 0.58 1.66
Replica PCR model NR [e]
1 0.81 1.17 0.63 1.59
2 0.81 1.17 0.63 1.59
5 0.81 1.17 0.62 1.60
10 0.81 1.17 0.60 1.64
Partial‐replica PCR model NR [e]
1 0.82 1.19 0.63 1.59
2 0.82 1.16 0.63 1.59
5 0.82 3.26 0.62 1.60
10 0.82 1.19 0.60 1.63
Weighted PCR model x NR [e]
0.1 1 0.89 0.87 0.66 1.54
2 0.89 0.87 0.66 1.54
4 0.89 0.87 0.66 1.54
8 0.89 0.87 0.65 1.56
0.2 1 0.89 0.87 0.66 1.54
2 0.89 0.87 0.65 1.55
4 0.89 0.87 0.65 1.55
8 0.89 0.87 0.64 1.57
0.3 1 0.89 0.87 0.65 1.54
2 0.89 0.87 0.65 1.55
4 0.89 0.87 0.65 1.56
8 0.89 0.87 0.64 1.58
0.5 1 0.89 0.87 0.65 1.54
2 0.89 0.87 0.65 1.55
4 0.89 0.87 0.64 1.56
8 0.89 0.87 0.63 1.58
Classified PCR 0.92 0.42 0.71 0.98
Combined PCR 0.92 0.42 0.61 [f] ND [f]

[a] Average correlation coefficient between the experimental and calculated data. [b] Average root mean square deviation (RMSD) error between the experimental and calculated data (kcal/mol). [c] Average correlation coefficient between the experimental and calculated data obtained by the leave‐one‐out (LOO) cross‐validation test. [d] Average RMSD error between the experimental and calculated data obtained by the LOO cross‐validation test (kcal/mol). [e] Number of replicas. [f] In 12 cases out of 97 proteins, the root mean square deviation error (RMSE)>106.

The machine‐learning DSI and MSM methods were score modification methods in which the new docking scores were given by the linear combinations of the protein‐compound docking scores. The previous works reported the AUC values of database enrichment curves obtained by the machine‐learning DSI/MTS methods and the AUC values were about 98 % (in the original works, the AUC values were referred to as q values). When the number of active compounds is much smaller than the number of inactive compounds, the AUC of the database enrichment curve is close to the AUC of the ROC curve, and the data sets of the previous works satisfy the condition (number of actives : number of in‐actives=1 : 1000). Based on Table 2, an AUC of 98 % corresponded to an R of 0.8–0.9. The R values were close to the R values obtained by the weighted PCR models in Table 3. We must note that the data sets used in the previous studies consisted of high affinity compounds (e.g., commercial drugs) as active compounds and decoy compounds. In the present study, all compounds were nearly active, and discriminating between strong and weak active compounds is more difficult than distinguishing highly active and inactive compounds as in the previous reports. Thus, the present validation tests were much more strict than those used in the previous studies. In addition, the previous methods only realized the active/in‐active binary decision in virtual screening. On the other hand, the present methods could evaluate the binding energy value, which is essential in drug design.

The simple PCR model (Model 1) worked well, since the correlation coefficients between the experimental and calculated ΔG values were close to those obtained by the mathematical simulation data. Also, the results obtained by the simple PCR model were close to the averaged correlation coefficient and RMSE values obtained by the docking study (R=0.7 and 2–3 kcal/mol). Considering that the present model did not require the target protein structure, even the simple PCR model should be useful for rough affinity estimation in 21 % of cases in which Q>0.7 out of the 97 target proteins.

For the polynomial PCR model, the polynomial was restricted to the second order, since the high‐order polynomials require many parameters in the regression equation. This trial did not improve the correlation to the experimental data. This suggests that the linear model was sufficient in the present study.

Figure 3 shows the results obtained by the weighted PCR model. In the weighted PCR method, the molecule in the teaching data that was similar to the input molecule was copied multiple times. The similarity was calculated by the docking score in eq. 6, and x=0.1, 0.2, 0.3 and 0.5 were examined. The weighted PCR method with x=0.1 and single or double copies of similar compounds showed the best correlation to the experimental data. This method required the same data as the simple PCR model. In addition to our method, however, it is expected that many other approaches could be taken to generate the replica data. The weighted PCR model should be useful for rough affinity estimation among these models, and Q>0.7 in 37 % of cases in the present study. The simple average of Q over the 97 proteins in Table 2 was about 0.64, and the correlation coefficient of the ΔG value of the whole data was 0.76 (see Figure 3).

Figure 3.

Figure 3

Correlation between the experimental and prediction data for all 97 kinase proteins obtained by the weighted PCR model with x=0.1 and the number of replicas=1. Black dots represent the least‐squares fitting line, and the correlation coefficient is 0.76. The dotted line represents the fitted result.

The experimental data were classified into IC50, Ki and %inhibition data sets, and then the classified PCR method was applied to each. The correlations were improved compared to the other models with unclassified data. We must note that the Q values obtained by the classified PCR model cannot be simply compared to those obtained by the other models, since there were fewer classified experimental data than unclassified data.

The combined PCR method did not improve the correlation between the experimental and calculated ΔG values compared to the simple PCR model. In particular, in some cases, the combined PCR model generated unrealistic ΔG values with extreme computational error (>106 kcal/mol). In 9 cases (9 %), the RMSE values obtained by the combined PCR model were <1 kcal/mol, while only three RMSE values obtained by the weighted PCR model were <1 kcal/mol (3 %). This result suggested that a deep learning method that considers the study from which the data was sourced should work better than the method employed in the present study, and that the data manipulation should be performed carefully while considering the applicable domain of the model.

The replica PCR and partial‐replica PCR models did not work in the present study. The greater the number of replicas, the lower the correlation coefficients became. Since this study was a single trial with simple random noise, the results do not contradict those of the previous works.32,33 The duplication of data should be treated more carefully than it is in the simple duplication.

5.  Conclusions

In order to achieve docking score correction, we developed several QSAR models based on combinations of multiple docking scores by protein‐drug docking simulations and applied heterogeneous public data to these models. The prediction models employed a descriptor‐based PCR, and the compound descriptor was a set of docking scores against many nontarget proteins (DSI, protein‐compound affinity matrix or affinity fingerprint). We tried six variations of the PCR models: simple, polynomial, weighted, classified, replica PCR and combined PCR models.

Even the simple PCR model worked in some cases, but when the assay data were classified into IC50, Ki and %inhibition data, the classified PCR model worked better than the simple PCR model. The linear combination of the QSAR models (combined PCR model) did not improve the results compared to the simple PCR. Although the weighted PCR model was simple, it achieved the same results as the more complex combined PCR model. In general, the weighted PCR model should be easier and more accurate than the other models. In cases in which the original sources of the assay data are easily accessible, the classified/combined PCR models could be used for further analysis with careful data treatment. Although it was difficult to compare the present results to those obtained by previous studies, including studies using the DSI method with a linear combination of docking scores, the comparison of the R and AUC values suggested that the prediction of active compounds by the present models should be comparable to that achieved in the previous study. However, considering the difference of the datasets, the method introduced in the present study should be superior to the previous method. In addition, the present method affords ΔG prediction.

In the present study, the docking scores against target proteins were omitted. The QSAR models could be improved by the addition of the docking scores that were descriptors of the models. Thus, the present models should be effective for the correction of docking scores.

Supporting Information

The compound structures in SDF format and experimental assay data were supplied as described in the supporting information.

Abbreviations

QSAR

quantitative structure‐activity relationship

DSI

docking score index

MSM

machine‐learning score modification

MTS

multiple target screening

LOO

leave‐one‐out

PCA

principal component analysis

PCR

principal component regression

MLR

multilinear regression

AUC

area under the curve

ROC

receiver operating characteristic

Appendix A

Suppose f (={f1, f2, f3,…..}) is a set of all pharmacophores. Each pharmacophore (fi) is virtual and the total number of pharmacophores is infinite. Both the protein pocket and compound can be described by the pharmacophore. Each protein pocket p and compound c is projected into the pharmacophore space. The vector product of c1*c2 for compounds c1 and c2 gives the similarity between the compounds, and the vector product p1*p2 for pockets p1 and p2 should give the similarity between the pockets. Since c and p exist in the same space, the vector product p*c corresponds to the similarity between the compound c and pocket p, and it could correspond to the docking score.

Conflict of Interest

None declared.

Supporting information

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer reviewed and may be re‐organized for online delivery, but are not copy‐edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

Supplementary

Supplementary

Acknowledgements

This work was supported by grants from the National Institute of Advanced Industrial Science and Technology (AIST), the Japan Agency for Medical Research and Development (AMED), and the Ministry of Economy, Trade, and Industry (METI) of Japan.

Y. Fukunishi, S. Yamasaki, I. Yasumatsu, K. Takeuchi, T. Kurosawa, H. Nakamura, Mol. Inf. 2017, 36, 1600013.

References

  • 1. Yang Y., Adelstein S. J., Kassis A. I., Drug Discov. Today. 2009, 14, 147–154. [DOI] [PubMed] [Google Scholar]
  • 2. Wermuth C., J. Med. Chem. 2004, 47, 1304–1314. [Google Scholar]
  • 3. Wermuth C., Drug Discov. Today. 2006, 11, 160–164. [DOI] [PubMed] [Google Scholar]
  • 4. Nettles J. H., Jenkins J. L., Bender A., Deng Z., Davies J. W., Glick M., J. Med. Chem. 2006, 49, 6802–6810. [DOI] [PubMed] [Google Scholar]
  • 5.M. G. Nidhi, M. Glick, J. W. Davis, J. L. Jenkins, J. Chem. Inf. Model 2006, 46, 1124–1133. [DOI] [PubMed]
  • 6. Fukunishi Y., Kubota S., Nakamura H., J. Chem. Inf. Model. 2006, 46, 2071–2084. [DOI] [PubMed] [Google Scholar]
  • 7. Fukunishi Y., Hojo S., Nakamura H., J. Chem. Inf. Comput. Sci. 2006, 46, 2610–2622. [DOI] [PubMed] [Google Scholar]
  • 8. Sheridan R. P., Nam K., Maiorov V. N., McMasters D. R., Cornell W. D., J. Chem. Inf. Model. 2009, 49, 1974–1985. [DOI] [PubMed] [Google Scholar]
  • 9. Yabuuchi H., Niijima S., Takematsu H., Ida T., Hirokawa T., Hara T., Ogawa T., Minowa Y., Tsujimoto G., Okuno Y., Molecular Systems Biology. 2011, 7, 1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Niijima S., Shiraishi A., Okuno Y., J. Chem. Inf. Model. 2012, 52, 901–912. [DOI] [PubMed] [Google Scholar]
  • 11. Lounkine E., Keiser M. J., Whitebread S., Mikhailov D., Hamon J., Jenkins J. L., Lavan P., Weber E., Doak A. K., Côté S., Shoichet B. K., Urban L., Nature 2012, 486, 361–367. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Peragovics A., Simon Z., Brandhuber I., Jelinek B., Hari P., Hetenyi C., Czobor P., Malnasi-Csizmadia A., J. Chem. Inf. Model. 2012, 52, 1733–1744. [DOI] [PubMed] [Google Scholar]
  • 13. Simon Z., Peragovics A., Vigh-Smeller M., Csukly G., Tombor L., Yang Z., Zahoranszky-Kohalmi G., Vegner L., Jelinek B., Hari P., Hetenyi C., Bitter I., Czobor P., Malnasi-Csizmadia A., J. Chem. Inf. Model. 2012, 52, 134–145. [DOI] [PubMed] [Google Scholar]
  • 14. Cobanoglu M. C., Liu C., Hu F., Oltvai Z. N., Bahar I., J. Chem. Inf. Model. 2013, 53, 3399–3409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15. Peragovics A., Simon Z., Tombor L., Jelinek B., Hari P., Czobor P., Malnasi-Csizmadia A., J. Chem. Inf. Model. 2013, 53, 103–113. [DOI] [PubMed] [Google Scholar]
  • 16. Perez-Nueno V. I., Ritchie D. W., J. Chem. Inf. Model. 2011, 51, 1233–1248. [DOI] [PubMed] [Google Scholar]
  • 17. Perez-Nueno V. I., Karaboga A. S., Souchet M., Ritchie D. W., J. Chem. Inf. Model. 2014, 54, 720–734. [DOI] [PubMed] [Google Scholar]
  • 18. Engin H. B., Keskin O., Nussinov R., Gursoy A., J. Chem. Inf. Model. 2012, 52, 2273–2286. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Pan Y., Chemg T., Wang Y., Bryant S. H., J. Chem. Inf. Model. 2014, 54, 407–418. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20. Tang J., Szwajda A., Shakyawar S., Xu T., Hintsanen P., Wennerberg K., Aittokallio T., J. Chem. Inf. Model. 2014, 54, 735–743. [DOI] [PubMed] [Google Scholar]
  • 21. Sheridan R. P., Chem. Inf. Model. 2014, 54, 1083–1092. [DOI] [PubMed] [Google Scholar]
  • 22. Lindh M., Svensson F., Schaal W., Zhang J., Skold C., Brandt P., Karlen A., J. Chem. Inf. Model. 2015, 50, 343–353. [DOI] [PubMed] [Google Scholar]
  • 23. Wang Y., Xiao J., Suzek T. O., Zhang J., Bryant S. H., Nucleic Acids Res. 2009, 37, W623–W633. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Gaulton A., Bellis L. J., Bentro A. P., Chambers J., Davies M., Hersey A., Light Y., McGlinchey S., Michalovich D., Al-Lazikani B., Overington P., Nucleic Acids Res. 2011, 40, D1100–D1107. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25. van Linden O. P. J., Kooistra A. J., Leurs R., de Esch I. P., de Graal C., J. Med. Chem. 2014, 57, 249–277. [DOI] [PubMed] [Google Scholar]
  • 26. Anastassiadis T., Deacon S. W., Devarajan K., Ma H., Peterson J., Nature Biotechnology. 2011, 29, 1039–1046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Davis M. I., Hunt J. P., Herrgard S., Ciceri P., Wodicka L. M., Pallares G., Hocker M., Treiber D. K., Zarrinkar P. P., Nature Biotechnology. 2011, 29, 1046–1052. [DOI] [PubMed] [Google Scholar]
  • 28. Fukunishi Y., Mikami Y., Takedomi K., Yamanouchi M., Shima H., Nakamura H., J. Med. Chem. 2006, 49, 523–533. [DOI] [PubMed] [Google Scholar]
  • 29. Fukunishi Y., Kubota S., Nakamura H., J. Mol. Graph. Model. 2007, 25, 633–643. [DOI] [PubMed] [Google Scholar]
  • 30. Masuda M., Kaneko H., Funatsu K., Ind. Eng. Chem. Res. 2014, 53, 8553–8564. [Google Scholar]
  • 31. Fukunishi Y., Mikami Y., Nakamura H., J. Mol. Graph. Model. 2005, 24, 34–45. [DOI] [PubMed] [Google Scholar]
  • 32. Steinmetz F. P., Madden J. C., Cronin M. T. D., J. Chem. Inf. Model. 2015, 55, 1739–1746. [DOI] [PubMed] [Google Scholar]
  • 33. Cortes-Ciriano I., Bender A., J. Chem. Inf. Model. 2015, 55, 2682–2692. [DOI] [PubMed] [Google Scholar]
  • 34. Cheng Y., Prusoff W. H., Biochem. Pharmacol. 1973, 22, 3099–3108. [DOI] [PubMed] [Google Scholar]
  • 35. Kitagawa D., Yokota K., Gouda M., Narumi Y., Ohmoto H., Nishiwaki E., Akita K., Kirii Y., Genes Cells. 2013, 18, 110–1022. [DOI] [PubMed] [Google Scholar]
  • 36. Knight Z. A., Shokat K. M., Chem. Biol. 2005, 12, 621–637. [DOI] [PubMed] [Google Scholar]
  • 37. Kufareva I., Abagyan R., J. Med. Chem. 2008, 51, 7921–7932. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.D. A. Case, T. A. Darden, T. E. Cheatham III, C. L. Simmerling, J. Wang, R. E. Duke, R. Luo, K. M. Merz, B. Wang, D. A. Pearlman, M. Crowley, S. Brozell, V. Tsui, H. Gohlke, J. Mongan, V. Hornak, G. Cui, P. Beroza, C. Schafmeister, J. W. Caldwell, W. S. Ross, P. A. Kollman, AMBER 8, UCSF, 2004.
  • 39. Fukunishi Y., Mikami Y., Nakamura H., J. Phys. Chem. B. 2003, 107, 13201–13210. [Google Scholar]
  • 40. Wang J., Wolf R. M., Caldwell J. W., Kollman P. A., Case D. A., J. Compt. Chem. 2004, 25, 1157–1174. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

As a service to our authors and readers, this journal provides supporting information supplied by the authors. Such materials are peer reviewed and may be re‐organized for online delivery, but are not copy‐edited or typeset. Technical support issues arising from supporting information (other than missing files) should be addressed to the authors.

Supplementary

Supplementary


Articles from Molecular Informatics are provided here courtesy of Wiley

RESOURCES