Skip to main content
Frontiers in Pharmacology logoLink to Frontiers in Pharmacology
. 2018 Oct 10;9:1136. doi: 10.3389/fphar.2018.01136

Comparison of Quantitative and Qualitative (Q)SAR Models Created for the Prediction of Ki and IC50 Values of Antitarget Inhibitors

Alexey A Lagunin 1,2,*, Maria A Romanova 2, Anton D Zadorozhny 2, Natalia S Kurilenko 2, Boris V Shilov 2, Pavel V Pogodin 1, Sergey M Ivanov 1,2, Dmitry A Filimonov 1, Vladimir V Poroikov 1,*
PMCID: PMC6192375  PMID: 30364128

Abstract

Estimation of interaction of drug-like compounds with antitargets is important for the assessment of possible toxic effects during drug development. Publicly available online databases provide data on the experimental results of chemical interactions with antitargets, which can be used for the creation of (Q)SAR models. The structures and experimental Ki and IC50 values for compounds tested on the inhibition of 30 antitargets from the ChEMBL 20 database were used. Data sets with Ki and IC50 values including more than 100 compounds were created for each antitarget. The (Q)SAR models were created by GUSAR software using quantitative neighborhoods of atoms (QNA), multilevel neighborhoods of atoms (MNA) descriptors, and self-consistent regression. The accuracy of (Q)SAR models was validated by the fivefold cross-validation procedure. The balanced accuracy was higher for qualitative SAR models (0.80 and 0.81 for Ki and IC50 values, respectively) than for quantitative QSAR models (0.73 and 0.76 for Ki and IC50 values, respectively). In most cases, sensitivity was higher for SAR models than for QSAR models, but specificity was higher for QSAR models. The mean R2 and RMSE were 0.64 and 0.77 for Ki values and 0.59 and 0.73 for IC50 values, respectively. The number of compounds falling within the applicability domain was higher for SAR models than for the test sets.

Keywords: QSAR, antitarget, inhibition, adverse drug reactions, Ki, IC50, GUSAR, ChEMBL

Introduction

Adverse drug reactions (ADRs) are one of the main problems in drug discovery and clinical practice (Böhm and Cascorbi, 2016). According to some estimates, ADR is one of the leading causes of hospitalization and death in developed countries (Starfield, 2000; Kochanek et al., 2016), the second most common cause of drug attrition in later stages of clinical trials and the major reason for drug withdrawal from the market (Hornberg et al., 2014). This situation is largely due to disadvantages of traditional animal toxicological experiments and clinical trials that cannot detect all serious ADRs because of inter-species differences and their idiosyncratic nature. Therefore, additional methods including in vitro and in silico approaches are currently being developed. In silico approaches are usually based on machine learning techniques and network analyses to link several chemical and biological features of approved and withdrawn drugs to ADRs, which include molecular descriptors, known or predicted drug targets, drug-induced gene expression profiles and cell phenotypic features (Ivanov et al., 2016). These approaches allow predict dangerous ADRs in the early stages of drug development and provide insights into potential toxic mechanisms of drug candidates. It is currently accepted that the most ADRs are the consequence of unintended interactions of drugs with human protein targets and are not related to a therapeutic mechanism of action. For example, blocking HERG potassium channels in the heart causes life-threatening arrhythmias (Siramshetty et al., 2016). There are dozens of human proteins that have known relationships to ADRs, and corresponding information has accumulated in public databases (Ji et al., 2003; Zhang et al., 2007) and been described in some publications (Whitebread et al., 2005; Bowes et al., 2012). These proteins are called “antitargets” because to avoid dangerous ADRs, they should not interact with drugs. Many pharmaceutical companies use in vitro assays to measure interactions of lead compounds with “antitargets” and select the least promiscuous ones for further development. To avoid performing hundreds of experiments, such interactions can also be predicted using ligand-based structure-activity relationship analysis or docking (Ivanov et al., 2016; Simões et al., 2018). Due to accumulation of data on chemical-protein interactions and three-dimensional protein structures in public databases such as ChEMBL (Gaulton et al., 2017), PubChem (Wang et al., 2017), and PDB (Berman et al., 2000), it has become possible to predict interactions with many hundreds of human proteins, including “antitargets." There are plenty of published (Q)SAR models (Poroikov et al., 2007; Filz et al., 2008; García-Sosa and Maran, 2014; Ivanov et al., 2016) and free available web-services (Zakharov et al., 2012; Braga et al., 2015) that may perform such predictions; however, no study was found with a comparison between the accuracy of classification (SAR) and quantitative (QSAR) models created based on the same data, descriptors and mathematical algorithm. The aim of this work is the creation, validation, and accuracy estimation of SAR and QSAR models for the prediction of the inhibition of 30 antitargets using GUSAR software and data on structures and Ki and IC50 values of tested compounds from the ChEMBL 20 database. Earlier, we published a study on the creation of reasonable QSAR models by GUSAR software and the appropriate web service1 for the prediction of interaction between drug-like compounds and 18 antitargets (Zakharov et al., 2012). In this paper, we have significantly expanded the list of covered “antitargets" and significantly increased the volumes and diversity of training samples, which allowed us to expand the range of applicability of models and to obtain valuable results.

Materials and Methods

Data Sets

Structures and experimental Ki and IC50 values of compounds tested on the inhibition of 30 antitargets were extracted from the ChEMBL 20 database. The data sets with Ki and IC50 values including more than 100 compounds were created for each antitarget (Table 1). Only the records with Ki or IC50 values in nM and symbol “ = ” in the field “Relation” were extracted from ChEMBL database. During the creation of data sets of compounds interacting with receptors, we included records with compounds studied as truly antagonists and records with compounds studied on biding affinity because of we could not divided them. In spite of Ki and IC50 values indicate the affinity of a compound by a given receptor, and they do not necessarily provide functional information related with agonism or antagonism of a compound to such target we decided to include such data because antagonism of receptors may be related with Ki and IC50 values, whereas agonism to receptors are usually represented by EC50 values. Ki or IC50 values were transformed in pIC50 = −log10(IC50(M)) and pKi = −log10(Ki(M)) values. Table 1 also shows the known relations between the inhibition of antitargets and ADRs. The number of compounds with Ki values was approximately 1.5 times higher than that for IC50 values (46830 and 29678, respectively). The sets included structures of single electroneutral small (molecular weight in range from 50 to 1250 Da) organic molecules. In general, such representation of structure corresponds to the best QSAR practice (Fourches et al., 2016) implemented in the GUSAR software, which was used in our study (see below). If a compound had several experimental values for the parameter, then a median value was used. Such median values were calculated because the reference compounds usually had several experimental values, since they were tested in many experiments. Deleting such compounds reduces an important part of chemical space and significantly restricts the applicability domain of the global QSAR models. In several publications related to the creation of global QSAR models based on heterogeneous data, authors used average values (Politi et al., 2014; Cortes-Ciriano and Bender, 2015). The median value was used because it better characterizes the set of values for strongly skewed distributions. Zip file including SD files related with the appropriate target (the gene name of targets is used in a file name), and endpoint is provided in Supplementary Materials. Each SD file includes structures, ChEMBL_ID, and experimental values. For classification models and comparison of prediction results between the SAR and QSAR models, 1 μM was used as a threshold between active and inactive compounds. The sets were sorted by the ascending mode of the appropriate values. Then, successively, a number from 1 to 5 was assigned for each structure from a set. After that, the sets were divided into five unique parts according to the assigned number of structures. These parts were used for the fivefold cross-validation (fivefold CV) procedure, when each unique part was used as an external test set, and the remaining parts were used as a training set. As a result, different five training and five external test sets for Ki data and five training and five external test sets for IC50 data, including both quantitative and qualitative descriptions, were created for each antitarget.

Table 1.

Data related with antitargets and the number of compounds with Ki and IC50 values in data sets.

Target UniProt ID Chembl Target ID Ki IC50 Effects at antagonism or inhibition
Acetylcholinesterase P22303 CHEMBL220 272 2573 ↓ BP; ↓ HR; ↑ GI motility (↓ at high doses); bronchoconstriction; ↑ respiratory secretions; anaphylaxis; anorexiant; arrhythmogenic; asystole; colic; diarrhea; emetic; gastrointestinal hemorrhage; headache; hypotension; muscle weakness; nausea; neurotoxic; nightmare; respiratory failure; sialorrhea; sweating; ulcer, gastric; urticaria
Adenosine receptor A2a P29274 CHEMBL251 3258 213 Platelet aggregation; ↑ BP; nervousness (tremors, agitation); arousal; insomnia
Alpha-1A adrenergic receptor P35348 CHEMBL229 942 100 ↓ smooth muscle tone; orthostatic hypotension and ↑ HR; dizziness; impact on various aspects of sexual function; flushing; hypotension; impotence; nasal congestion; postural (orthostatic) hypotension; tachycardia; weakness
Alpha-2A adrenergic receptor P08913 CHEMBL1867 557 201 ↑ GI motility; ↑ insulin secretion; hypertension exacerbates heart failure; anxiety; depression
Beta-1 adrenergic receptor P08588 CHEMBL213 278 512 ↓ BP; ↓ HR; ↓ cardiac output; cardiotoxicity; heart failure
Beta-2 adrenergic receptor P07550 CHEMBL210 352 472 ↓ BP; increased bronchospasm
Androgen receptor P10275 CHEMBL1871 631 1054 ↓ spermatogenesis; impotence; gynecomastia, mastodynia; ↑ in breast carcinoma
Muscarinic acetylcholine receptor M1 P11229 CHEMBL216 635 544 ↓ cognitive function; ↓ gastric acid secretion; blurred vision
Muscarinic acetylcholine receptor M2 P08172 CHEMBL211 799 422 Tachycardia; bronchoconstriction; tremors
Muscarinic acetylcholine receptor M3 P20309 CHEMBL245 644 606 Constipation; blurred vision; pupil dilation; dry mouth
Cannabinoid receptor 1 P21554 CHEMBL218 1998 904 ↑ weight loss; emesis; depression
Cannabinoid receptor 2 P34972 CHEMBL253 2375 592 ↑ inflammation; ↓ bone mass
D(1A) dopamine receptor P21728 CHEMBL2056 681 106 Dyskinesia; parkinsonian symptoms (tremors); anti-emetic effects; depression; anxiety; suicidal intent
D(2) dopamine receptor P14416 CHEMBL217 3946 431 Orthostatic hypotension; drowsiness; ↑ GI motility; dyskinesia; extrapyramidal effect; sedative
Endothelin-1 receptor P25101 CHEMBL252 155 894 Teratogenicity
Histamine H1 receptor P35367 CHEMBL231 753 264 Sedation; ↓ allergic responses; ↑ body weight; dizziness; extrapyramidal effect; hypnotic; hypotension; lassitude; tinnitus; xerostomia
5-hydroxytryptamine receptor 1A P08908 CHEMBL214 2505 432 Anxiogenic
5-hydroxytryptamine receptor 1B P28222 CHEMBL1898 662 266 ↑ aggression
5-hydroxytryptamine receptor 2A P28223 CHEMBL224 1768 659 hypnotic; sedative
5-hydroxytryptamine receptor 2B P41595 CHEMBL1833 705 248 Possible cardiac effects, especially during embryonic development
Potassium voltage-gated channel subfamily H member 2 Q12809 CHEMBL240 935 4078 Prolongation of QT interval of ECG
Tyrosine-protein kinase Lck P06239 CHEMBL258 364 1322 T cell inhibition; SCID-like immunodeficiency
Amine oxidase [flavin-containing] A P21397 CHEMBL1951 342 1031 ↑ BP when combined with amines such as tyramine; drug–drug interaction potential; dizziness; sleep disturbances; nausea
Neuropeptide Y receptor type 1 P25929 CHEMBL4777 321 304 Anxiogenic
Glucocorticoid receptor P04150 CHEMBL2034 632 1086 Hypoglycemia
Delta-type opioid receptor P41143 CHEMBL236 1603 534 ↑ BP; ↑ cardiac contractility
Mu-type opioid receptor P35372 CHEMBL233 1816 663 ↑ GI motility; dyspepsia; flatulence
Sodium-dependent noradrenaline transporter P23975 CHEMBL222 1346 1371 ↑ HR; ↑ BP; ↑ locomotor activity; constipation; abuse potential
Sodium-dependent dopamine transporter Q01959 SLC6A3 1195 1183 Addictive psychostimulation; dopaminergic hyperactivity; depression; parkinsonism; attention deficit–hyperactivity disorder; psychotic disorders; seizures; dystonia; dyskinesia; acne
Sodium-dependent serotonin transporter P31645 CHEMBL228 1868 1938 ↑ GI motility; ↓ upper GI transit; ↓ plasma renin; ↑ other serotonin-mediated effects; insomnia; anxiety; nausea; sexual dysfunction

BP, blood pressure; ECG, electrocardiogram; GI, gastrointestinal; HR, heart rate; SCID, severe-combined immunodeficiency.

GUSAR Software

The (Q)SAR models were created by GUSAR software2, which used quantitative neighbourhoods of atoms (QNA), multilevel neighbourhoods of atom (MNA), and whole-molecule descriptors with self-consistent regression (Lagunin et al., 2007; Filimonov et al., 2009; Lagunin et al., 2011). QNA descriptors are calculated by two functions, P and Q. The values for P and Q for each atom i are calculated as:

Pi=Bik(Exp(12C))ikBk,
Qi=Bik(Exp(12C))ikBkAk,

where k is all other atoms in the molecule and

Ak=12(IPk+EAk),Bk=(IPkEAk)12

Here, IP is the ionization potential, EA is the electron affinity for each atom, and C is the connectivity matrix for the molecule. QNA descriptors describe each particular atom of a molecule; at the same time, each P or Q value depends on the total molecule composition and structure. Two-dimensional Chebyshev polynomials are used for approximating the functions P and Q over all atoms of the molecule. A detailed description of QNA descriptors is represented in the publication of Filimonov et al. (2009).

MNA descriptors (Filimonov et al., 1999) are based on the molecular structure representation, which includes hydrogens according to the valences and partial charges of other atoms and does not specify the types of bonds. MNA descriptors are generated as a recursively defined sequence:

  • simple •

    zero-level MNA descriptor for each atom is the mark A of the atom itself;

  • simple •

    any next-level MNA descriptor for the atom is the sub-structure notation A (D1D2...Di...),

where Di is the previous-level MNA descriptor for i–th immediate neighbor of the atom A.

The mark of the atom may include not only the atomic type but also any additional information about the atom. In particular, if the atom is not included in the ring, it is marked by “-”. The neighbor descriptors D1D2...Di... are arranged in a unique manner, for example, in lexicographic order. The iterative process of MNA descriptors generation can be continued covering first, second, and so on, neighborhoods of each atom.

For regression analysis, this molecule structure representation was transformed using the original PASS (Prediction of Activity Spectra for Substances) algorithm (Lagunin et al., 2011). This algorithm estimates the biological activity profiles for chemical compounds using MNA descriptors as input parameters. Therefore, we used the results of PASS prediction as independent variables for regression analysis. The results of PASS prediction are given as a list of biological activities, for which the difference between probabilities of being active (Pa) and inactive (Pi) was calculated. The activities from the list of predicted biological activities were randomly selected as input independent variables for regression analysis. This allows obtaining different QSAR models. GUSAR incorporates a PASS version that predicts 4130 types of biological activity. This version of PASS has a mean prediction accuracy of approximately 95% calculated by leave-one-out cross-validation procedure (Filimonov et al., 2014). The list of predictable biological activities currently includes 501 pharmacotherapeutic effects (e.g., antihypertensive, hepatoprotectant, and nootropic), 3295 mechanisms of action (e.g., 5-hydroxytryptamine antagonist, acetylcholine M1 receptor agonist, and cyclooxygenase inhibitor), 57 adverse and toxic effects (e.g., carcinogenic, mutagenic, and hematotoxic), 199 metabolic terms (e.g., CYP1A inducer, CYP1A1 inhibitor, and CYP3A4 substrate), 49 transporter proteins (e.g., P-glycoprotein 3 inhibitor, nucleoside transporters inhibitors, and proline transporter inhibitor), and 29 activities related to gene expression (e.g., TH expression enhancer, TNF expression inhibitor, and VEGF expression inhibitor). Therefore, the maximum number of independent variables for the creation of MNA models is 4130. The detailed description of realization of PASS in GUSAR is represented in the publication of Lagunin et al. (2011).

QNA and MNA descriptors do not provide information on the shape and volume of a molecule, although this information may be important for determination of structure-activity relationships. Therefore, these parameters, which are called whole-molecule descriptors, are also used in GUSAR. The whole-molecule descriptors used in GUSAR are: topological length, topological volume, lipophilicity, number of positive charges, number of negative charges, number of hydrogen bond acceptors, number of aromatic atoms, molecular weight, and number of halogen atoms. GUSAR uses estimation of the applicability domain based on different types of structural similarity using calculation of QNA and MNA descriptors (Zakharov et al., 2016).

GUSAR may provide an equation of any single (Q)SAR model (Lagunin et al., 2011). But because we used consensus (Q)SAR models from dozens or even hundreds of single (Q)SAR models, it is not possible to provide a general equation describing all selected variables. By this reason, the created consensus (Q)SAR models could not provide information about positive and negatively influencing descriptors. Instead that GUSAR shows positive and negative impact of each atom of the structure in the predicted value (Khayrullina et al., 2015). Analysis of the influence of atoms on the predicted value and the search for general relationships between the structures of active compounds interacting with antitargets is a separate task (because of each structure in the set should be analyzed), and it is beyond the scope of this publication.

Evaluation of Prediction Accuracy

The following statistical parameters were calculated for estimating the accuracy of prediction:

  • simple (1)

    Sensitivity (Sens):

    Sensitivity = TPFN+TP, where TP is true positive, and FN is false negative numbers.

  • simple (2)

    Specificity (Spec):

    Specificity = TNTN+FP, where TN is true negative, and FP is false positive numbers.

  • simple (3)
    Accuracy:
    Accuracy=TP+TNTP+TN+FP+FN
  • simple (4)
    Balanced accuracy (BA): balance between sensitivity and specificity:
    BA=Sensitivity+Specificity2
  • simple (5)
    Root mean square error (RMSE):
    RMSE=(yexpypred)2n
  • simple (6)
    R-squared, coefficient of determination:
    R2=1(yexpypred)2(yexpymean)2,

where yexp – experimental value, ypred – predicted value, and ymean – average value of experimental values in a training set.

Y-Randomization Procedure

Y-Randomization procedure is included in GUSAR software and allows to be ensuring that the developed continues QSAR models are robust and do not have the over fitting (Wold and Eriksson, 1995). In this procedure, the dependent-variable vector, Y vector (Ki or IC50 values in our case), is randomly shuffled and a new QSAR model is developed using the original independent variable matrix. It is expected that the resulting models should generally have low Q2 values. This procedure was repeated five times for each model, and then the average Q2 value was calculated.

Results and Discussion

Three hundred twenty SAR and 320 QSAR models with modified calculation of descriptors and regression coefficients were created by GUSAR software for each from five training sets (five training sets with qualitative and quantitative data for Ki or IC50 values for each target) with internal validation (five times 20% from the training set was randomly used as an internal test set; this procedure is included into GUSAR). As a result, one consensus SAR model and one consensus QSAR model were created for each training set based on the appropriate single (Q)SAR model with R2train and Q2train and average R2 calculated for internal validation sets more than 0.5. If R2 of internal validation for (Q)SAR model was less than 0.5, then the model was excluded from the final consensus model [excluding QSAR models for D(1A) and D(2) dopamine receptors, histamine H1 and 5-hydroxytryptamine 2B receptors created on the basis of IC50 data]. The final predicted values for tested compounds were calculated using a weighted average of the predictions from the obtained (Q)SAR models. Each model is based on a different set of descriptors, and its predictions for each compound were weighted according to the similarity value that was calculated during the applicability domain assessment.

After SAR and QSAR consensus models were created based on a training set, they were used for prediction of inhibition of the antitarget by compounds from the appropriate external test set. It was repeated for five training sets with Ki values and five training sets with IC50 values for each antitarget (fivefold CV procedure). The average characteristics of the created (Q)SAR models including average results of Y-randomization procedure (Q2Y–rand) are represented in Supplementary Tables S1, S2. It was appeared that all Q2Y–rand values for all QSAR models were less 0.15. The average Q2Y–rand values were from 0.026 to 0.06 and from 0.026 to 0.078 for QSAR models created based on Ki and IC50 data, respectively. It is significant less in comparison with Q2 values calculated based on original data of the training sets and displays robustness of the given models.

The plots between predicted and experimental values for the best and worst QSAR models by RMSE values calculated by fivefold cross-validation are displayed in Figure 1. The relations between predicted and experimental values for others QSAR models are within these extreme cases.

FIGURE 1.

FIGURE 1

Plots of predicted and experimental values for the best and worst QSAR models by RMSE values calculated during fivefold cross-validation procedure. (A) QSAR model for prediction of pIC50 values of compounds interacting with glucocorticoid receptor (the best QSAR model for IC50 values). (B) QSAR model for prediction of pIC50 values of compounds interacting with D(2) dopamine receptor (the worst QSAR model for IC50 values). (C) QSAR model for prediction of pKi values of compounds interacting with HERG channel (Potassium voltage-gated channel subfamily H member 2) (the best QSAR model for Ki values). (D) QSAR model for prediction of pKi values of compounds interacting with Beta-2 adrenergic receptor (the worst QSAR model for Ki values).

The statistical parameters describing accuracy of prediction and mentioned in the section “Materials and Methods” were calculated based on the prediction results given during the fivefold CV procedure for both SAR and QSAR models. To compare the accuracy of prediction of QSAR and SAR models, the quantitative results of prediction were transformed into qualitative ones according to the threshold mentioned in the section “Materials and Methods.” Statistical parameters of accuracy of prediction for SAR and QSAR models created based on Ki and IC50 data for all antitargets are represented in Supplementary Tables S3, S4, respectively. The graphical representation of statistical parameters of accuracy and their comparison are represented in Figures 2–4.

FIGURE 2.

FIGURE 2

Comparison of parameters of accuracy of prediction for SAR and QSAR models calculated by the fivefold cross-validation procedure for all antitargets. (A) Comparison of Accuracy (Acc.) and Balanced Accuracy (Bal. Acc.) between SAR and QSAR models for Ki data. (B) Comparison of Sensitivity (Sens.) and Specificity (Spec.) between SAR and QSAR models for Ki data. (C) Comparison of Accuracy (Acc.) and Balanced Accuracy (Bal. Acc.) between SAR and QSAR models for IC50 data. (D) Comparison of Sensitivity (Sens.) and Specificity (Spec.) between SAR and QSAR models for IC50 data.

Figures 2A,B show a comparison of the accuracy between SAR and QSAR models created based on Ki values. Figures 2C,D show the results given based on IC50 values. The accuracy of the QSAR models was higher in most cases than the accuracy of SAR models for both Ki and IC50 values (Figures 2A, 1C). The mean accuracy of prediction for Ki values was 0.84 and 0.87 for SAR and QSAR models, respectively. This is statistically significant difference (p < 0.05). The mean accuracy of prediction for IC50 values was 0.82 and 0.83 for SAR and QSAR models, respectively. This is statistically insignificant difference (p = 0.285). The reverse result was observed for balanced accuracy (SAR models: Ki data – 0.80, IC50 data – 0.81; QSAR models: Ki data – 0.73, IC50 data – 0.76). The difference in balanced accuracy between SAR and QSAR models is statistically significant in both cases, for Ki and for IC50 values (p < 0.05). Specificity and sensitivity were similar for SAR and QSAR models (Figures 2B, 1D). The mean value of specificity was higher for QSAR models for both Ki and IC50 data (SAR models: Ki data – 0.76, IC50 data – 0.79; QSAR models: Ki data – 0.95, IC50 data – 0.90). The mean value of sensitivity was higher for SAR models for both Ki and IC50 data (SAR models: Ki data – 0.84, IC50 data – 0.82; QSAR models: Ki data – 0.50, IC50 data – 0.61).

The analysis of values of accuracy and balanced accuracy of SAR and QSAR models (Supplementary Tables S1, S2) shows that there is a correlation between them. Figures 3A,B show a correlation between accuracy and balanced accuracy for both SAR and QSAR models created based on Ki data. Figures 3C,D show a correlation between accuracy and balanced accuracy for SAR and QSAR models created based on IC50 data. One may see that in the both cases, the correlation between accuracy of SAR and QSAR models was higher than for balanced accuracy (Figure 3). If the values correlate, it means that there is no preference between SAR and QSAR models for the appropriate criterion of accuracy. But similar accuracy is achieved by different ways in the most cases (high sensitivity or high specificity, see Figures 2B,D). One can decide what is more important in the study: find as many as possible active compounds (the models with highest sensitivity should be selected) or reduce the number of false positive prediction (the models with highest specificity should be selected). The absence of correlation between the studied parameters shows that one of methods has preference. The values above the line show that QSAR models better than SAR ones. The values below the line show that SAR models better than QSAR ones. All cases excluding one which is displayed in Figure 3C (Correlation of Accuracy between SAR and QSAR models for IC50 data) had statistically significant difference between the values of SAR and QSAR models (p < 0.05). The values of balanced accuracy is the most important criterion for estimation of accuracy of prediction because of many used datasets were unbalanced (the number of active and inactive compounds is significant different). Therefore, the given results showed that SAR models are the more preferable for the use of prediction of drug adverse reactions.

FIGURE 3.

FIGURE 3

Correlation of accuracy of prediction between SAR and QSAR models for all antitargets. (A) Correlation of Accuracy between SAR and QSAR models for Ki data. (B) Correlation of Balanced Accuracy (BA) between SAR and QSAR models for Ki data. (C) Correlation of Accuracy between SAR and QSAR models for IC50 data. (D) Correlation of Balanced Accuracy (BA) between SAR and QSAR models for IC50 data.

The other parameters of SAR and QSAR models are represented in Figure 4. Figure 4A shows the percent of compounds in applicability domain (AD) of SAR and QSAR models. The number of compounds in AD was 100% approximately for all SAR models. At the same time, the number of compounds in AD approximately for all QSAR models was less 100%. The mean value of percent of compound in AD for SAR and QSAR models was 99.9% and 98.6%, respectively. The highest present of compounds in applicability domain displays advantage and better predictive power for SAR models in comparison with QSAR models. Figure 4B shows the comparison of RMSE and R2 values for QSAR models created on Ki and IC50 data. Clear features of distribution of these characteristics cannot be seen, but in general, the mean value of R2 for QSAR models based on Ki data was higher than one for IC50 data (0.64 and 0.57, respectively). The mean RMSE value for QSAR models based on IC50 data was less than one for Ki data (0.73 and 0.77, respectively). However, if we delete the RMSE value for the QSAR model created based on Ki data for the beta-2 adrenergic receptor, the mean RMSE value also became 0.73 for the other QSAR models created based on Ki data. It means that both Ki and IC50 values can be reliably used to predict interactions with antitargets. We may compare (Q)SAR models based on Ki and IC50 values only in general view because of they were created on different number of compounds and different structures. Nevertheless, we may reveal some features of the created models. The plots with comparison of Specificity and Sensitivity of (Q)SAR models created based on Ki and IC50 data are shown on Supplementary Figure S1. These plots display that SAR models based on IC50 values have Specificity better than SAR models based on Ki data for approximately half of antitargets. The biggest difference is shown for Mu-type opioid receptor (0.34 for Ki data and 0.97 for IC50 data). SAR models based on Ki data for others antitargets have better values of Specificity. The same picture we can see for Sensitivity of SAR models. Analysis of QSAR models revealed that majority of QSAR models based on Ki data had better Specificity value, whereas majority of QSAR models based on IC50 data had better Sensitivity value. High value of Sensitivity is more important for revealing possible adverse drug reaction than high value of Sensitivity. Analysis of Accuracy and Balanced Accuracy of (Q)SAR based on IC50 and Ki data (Supplementary Figure S2) show that the most (Q)SAR models based on Ki values have better values, whereas the values of Balanced Accuracy are higher at the most of QSAR models based on IC50 values.

FIGURE 4.

FIGURE 4

Comparison of quality of (Q)SAR models for Ki and IC50 data for all antitargets. (A) Comparison of the percent of compounds in applicability domain (AD) for SAR and QSAR models; (B) Comparison of R2 and RMSE values of QSAR models.

Conclusion

The creation of SAR and QSAR models based on the same data of compounds tested as inhibitors of 30 antitargets revealed some features related to the use of qualitative and quantitative data. They are valid to (Q)SAR models related to both Ki and IC50 values. SAR models tended to have more balanced prediction results when specificity and sensitivity have the closest values in comparison with QSAR models (Figure 2). High values of specificity and low values of sensitivity in QSAR models may be explained by the fact that at the given R2 values (0.64 and 0.59), prediction results tended to lie closer to the average values of Ki or IC50 in the training set. If a threshold of 1 μM divided the training set into different proportions of active and inactive compounds, then a difference between specificity and sensitivity may occur. At the same time, despite the difference of specificity and sensitivity between SAR and QSAR models, the values of accuracy and balanced accuracy for SAR correlated with those of QSAR models (Figure 3). This indicated that the prediction results of SAR and QSAR models would complement each other and that the use of both approaches would improve the quality of assessment of interaction between ligands and antitargets.

Another conclusion is that SAR models had advantages in the applicability domain. It may be related to the fact that the use ofqualitative data gives SAR models less sensitivity to experimental errors in Ki and IC50 values.

In this study, we also displayed that the modern experimental data and methods of (Q)SAR modeling allow for the creation of rather reasonable (Q)SAR models for prediction of interaction between compounds and dozens of antitargets. The used approaches may be applied to the creation of in silico panels for estimation of “ligand-antitarget” interactions during the drug design process.

Author Contributions

AL designed the study, performed the data analysis, and wrote the manuscript with inputs of all authors. MR, AZ, NK, and BS created and validated (Q)SAR models. PP and SI created datasets and data analysis. DF and VP designed the study, analyzed the results, and wrote the manuscript.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding. This work was supported by Russian Science Foundation Grant No. 14-15-00449.

Supplementary Material

The Supplementary Material for this article can be found online at: https://www.frontiersin.org/articles/10.3389/fphar.2018.01136/full#supplementary-material

References

  1. Berman H. M., Westbrook J., Feng Z., Gilliland G., Bhat T. N., Weissig H., et al. (2000). The protein data bank. Nucleic Acids Res. 28 235–242. 10.1093/nar/28.1.235 [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Böhm R., Cascorbi I. (2016). Pharmacogenetics and predictive testing of drug hypersensitivity reactions. Front. Pharmacol. 7:396 10.3389/fphar.2016.00396 [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bowes J., Brown A. J., Hamon J., Jarolimek W., Sridhar A., Waldron G., et al. (2012). Reducing safety-related drug attrition: the use of in vitro pharmacological profiling. Nat. Rev. Drug Discov. 11 909–922. 10.1038/nrd3845 [DOI] [PubMed] [Google Scholar]
  4. Braga R. C., Alves V. M., Silva M. F., Muratov E., Fourches D., Lião L. M., et al. (2015). Pred-hERG: a novel web-accessible computational tool for predicting cardiac toxicity. Mol. Inform. 34 698–701. 10.1002/minf.201500040 [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Cortes-Ciriano I., Bender A. (2015). Improved chemical structure-activity modeling through data augmentation. J. Chem. Inf. Model. 55 2682–2692. 10.1021/acs.jcim.5b00570 [DOI] [PubMed] [Google Scholar]
  6. Filimonov D., Poroikov V., Borodina Y., Gloriozova T. (1999). Chemical similarity assessment through multilevel neighborhoods of atoms: definition and comparison with the other descriptors. J. Chem. Inf. Comput. Sci. 39 666–670. 10.1021/ci980335o [DOI] [Google Scholar]
  7. Filimonov D. A., Lagunin A. A., Gloriozova T. A., Rudik A. V., Druzhilovskii D. S., Pogodin P. V., et al. (2014). Prediction of the biological activity spectra of organic compounds using the PASS Online web resource. Chem. Heterocycl. Compd. 50 444–457. 10.1007/s10593-014-1496-1 [DOI] [Google Scholar]
  8. Filimonov D. A., Zakharov A. V., Lagunin A. A., Poroikov V. V. (2009). QNA-based ’Star Track’ QSAR approach. SAR QSAR Environ. Res. 20 679–709. 10.1080/10629360903438370 [DOI] [PubMed] [Google Scholar]
  9. Filz O., Lagunin A., Filimonov D., Poroikov V. (2008). Computer-aided prediction of QT-prolongation. SAR QSAR Environ. Res. 19 81–90. 10.1080/10629360701844183 [DOI] [PubMed] [Google Scholar]
  10. Fourches D., Muratov E., Tropsha A. (2016). Trust, but verify ii: a practical guide to chemogenomics data curation. J. Chem. Inf. Model. 56 1243–1252. 10.1021/acs.jcim.6b00129 [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. García-Sosa A. T., Maran U. (2014). Improving the use of ranking in virtual screening against HIV-1 integrase with triangular numbers and including ligand profiling with antitargets. J. Chem. Inf. Model. 54 3172–3185. 10.1021/ci500300u [DOI] [PubMed] [Google Scholar]
  12. Gaulton A., Hersey A., Nowotka M., Bento A. P., Chambers J., Mendez D., et al. (2017). The ChEMBL database in 2017. Nucleic Acids Res. 45 945–954. 10.1093/nar/gkw1074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Hornberg J. J., Laursen M., Brenden N., Persson M., Thougaard A. V., Toft D. B., et al. (2014). Exploratory toxicology as an integrated part of drug discovery. Part I: why and how. Drug Discov. Today 19 1131–1136. 10.1016/j.drudis.2013.12.008 [DOI] [PubMed] [Google Scholar]
  14. Ivanov S. M., Lagunin A. A., Poroikov V. V. (2016). In silico assessment of adverse drug reactions and associated mechanisms. Drug Discov. Today 21 58–71. 10.1016/j.drudis.2015.07.018 [DOI] [PubMed] [Google Scholar]
  15. Ji Z. L., Han L. Y., Yap C. W., Sun L. Z., Chen X., Chen Y. Z. (2003). Drug adverse reaction target database (DART): proteins related to adverse drug reactions. Drug Saf. 26 685–690. 10.2165/00002018-200326100-00002 [DOI] [PubMed] [Google Scholar]
  16. Khayrullina V. R., Gerchikov A. Y., Lagunin A. A., Zarudii F. S. (2015). Quantitative analysis of structure-activity relationships of tetrahydro-2H-isoindole cyclooxygenase-2 inhibitors. Biochemistry 80 74–86. 10.1134/S0006297915010095 [DOI] [PubMed] [Google Scholar]
  17. Kochanek K. D., Murphy S. L., Xu J., Tejada-Vera B. (2016). Deaths: final data for 2014. Natl. Vital Stat. Rep. 65 100–122. [PubMed] [Google Scholar]
  18. Lagunin A., Zakharov A., Filimonov D., Poroikov V. (2011). QSAR modelling of rat acute toxicity on the basis of PASS prediction. Mol. Inform. 30 241–250. 10.1002/minf.201000151 [DOI] [PubMed] [Google Scholar]
  19. Lagunin A. A., Zakharov A. V., Filimonov D. A., Poroikov V. V. (2007). A new approach to QSAR modelling of acute toxicity. SAR QSAR Environ. Res. 18 285–298. 10.1080/10629360701304253 [DOI] [PubMed] [Google Scholar]
  20. Politi R., Rusyn I., Tropsha A. (2014). Prediction of binding affinity and efficacy of thyroid hormone receptor ligands using QSAR and structure-based modeling methods. Toxicol. Appl. Pharmacol. 280 177–189. 10.1016/j.taap.2014.07.009 [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Poroikov V., Filimonov D., Lagunin A., Gloriozova T., Zakharov A. (2007). PASS: identification of probable targets and mechanisms of toxicity. SAR QSAR Environ. Res. 18 101–110. 10.1080/10629360601054032 [DOI] [PubMed] [Google Scholar]
  22. Simões R. S., Maltarollo V. G., Oliveira P. R., Honorio K. M. (2018). Transfer and multi-task learning in QSAR modeling: advances and challenges. Front. Pharmacol. 9:74. 10.3389/fphar.2018.00074 [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Siramshetty V. B., Nickel J., Omieczynski C., Gohlke B. O., Drwal M. N., Preissner R. (2016). WITHDRAWN–a resource for withdrawn and discontinued drugs. Nucleic Acids Res. 44 D1080–D1086. 10.1093/nar/gkv1192 [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Starfield B. (2000). Is US health really the best in the world? JAMA 284 483–485. 10.1001/jama.284.4.483 [DOI] [PubMed] [Google Scholar]
  25. Wang Y., Bryant S. H., Cheng T., Wang J., Gindulyte A., Shoemaker B. A., et al. (2017). Pubchem bioassay: 2017 update. Nucleic Acids Res. 45 955–963. 10.1093/nar/gkw1118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Whitebread S., Hamon J., Bojanic D., Urban L. (2005). Keynote review: in vitro safety pharmacology profiling: an essential tool for successful drug development. Drug Discov. Today 10 1421–1433. 10.1016/S1359-6446(05)03632-9 [DOI] [PubMed] [Google Scholar]
  27. Wold S., Eriksson L. (1995). “Statistical validation of QSAR results,” in Chemometrics Methods in Molecular Design, ed. H. van de Waterbeemd. (Weinheim: VCH; ), 309–318. [Google Scholar]
  28. Zakharov A. V., Lagunin A. A., Filimonov D. A., Poroikov V. V. (2012). Quantitative prediction of antitarget interaction profiles for chemical compounds. Chem. Res. Toxicol. 25 2378–2385. 10.1021/tx300247r [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Zakharov A. V., Varlamova E. V., Lagunin A. A., Dmitriev A. V., Muratov E. N., Fourches D., et al. (2016). QSAR modeling and prediction of drug-drug interactions. Mol. Pharm. 13 545–556. 10.1021/acs.molpharmaceut.5b00762 [DOI] [PubMed] [Google Scholar]
  30. Zhang J. X., Huang W. J., Zeng J. H., Huang W. H., Wang Y., Zhao R., et al. (2007). DITOP: drug-induced toxicity related protein database. Bioinformatics 23 1710–1712. 10.1093/bioinformatics/btm139 [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials


Articles from Frontiers in Pharmacology are provided here courtesy of Frontiers Media SA

RESOURCES