Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2009 Mar 15.
Published in final edited form as: Bioorg Med Chem. 2008 Jan 15;16(6):2791–2802. doi: 10.1016/j.bmc.2008.01.014

Structure-activity Relationship Analysis of N-Benzoylpyrazoles for Elastase Inhibitory Activity: A Simplified Approach Using Atom Pair Descriptors

Andrei I Khlebnikov a,*, Igor A Schepetkin b, Mark T Quinn b,*
PMCID: PMC2396487  NIHMSID: NIHMS45438  PMID: 18234502

Abstract

Previously, we utilized high throughput screening of a chemical diversity library to identify potent inhibitors of human neutrophil elastase and found that many of these compounds had N-benzoylpyrazole core structures. We also found individual ring substituents had significant impact on elastase inhibitory activity and compound stability. In the present study, we utilized computational structure–activity relationship (SAR) analysis of a series of 53 N-benzoylpyrazole derivatives to further optimize these lead molecules. We present an improved approach to SAR methodology based on atom pair descriptors in combination with 2-dimentional (2D) molecular descriptors. This approach utilizes the rich representation of chemical structure and leads to SAR analysis that is both accurate and intuitively easy to understand. A sequence of ANOVA, linear discriminant, and binary classification tree analyses of the molecular descriptors led to the derivation of SAR rule-based algorithms. These rules revealed that the main factors influencing elastase inhibitory activity of N-benzoylpyrazole molecules were the presence of methyl groups in the pyrazole moiety and ortho-substituents in the benzoyl radical. Furthermore, our data showed that physicochemical characteristics (energy of frontier molecular orbitals, molar refraction, lipophilicity) were not necessary for achieving good SAR, as comparable quality of SAR classification was obtained with atom pairs and 2D descriptors only. This simplified SAR approach may be useful to qualitative SAR recognition problems in a variety of data sets.

Keywords: atom pairs, molecular descriptors, structure–activity relationship, N-benzoylpyrazoles, neutrophil elastase inhibitors

1. Introduction

Neutrophil elastase (EC 3.4.21.37) is a member of the chymotrypsin family of serine proteases, which can degrade a variety of extracellular matrix proteins and proteolitically activate several matrix metalloproteinases (MMP-2, -3, and -9) [reviewed in 1,2]. Excessive neutrophil elastase activity can lead to severe pathology through the degradation of elastin and collagen in the airways, resulting in microvascular injury and interstitial edema 3. Given the destructive potential of unregulated neutrophil elastase, it is not surprising that inhibition of the elastase activity in pulmonary tissues has been considered a promising strategy to improve the outcome of pulmonary diseases 4. For example, many types of peptide and nonpeptide inhibitors of neutrophil elastase, employing both reversible and irreversible mechanisms of action, have been identified [reviewed in 5,6]. Recently, we utilized high throughput screening of a chemical diversity library containing 10,000 drug-like molecules and identified 19 potent neutrophil elastase inhibitors (Ki ≤ 1 µM) that have N-benzoylpyrazole core structures and are distinct from currently known elastase inhibitors 7.

Our analysis of N-benzoylpyrazole derivatives showed that individual ring substituents had significant impact on elastase inhibitory activity and compound stability 7. Thus, we suggest that further structure-activity relationship (SAR) analysis of N-benzoylpyrazoles would lead to optimization of these lead compounds to identify improved neutrophil elastase. Indeed, SAR and quantitative SAR (QSAR) models have been instrumental in understanding the molecular mechanism of action of receptor antagonists, their design, and virtual screening 8. (Q)SAR refers to a broad range of computational methods, such as simple SAR and QSAR, as well as methods for chemical grouping and formalized approaches based on chemical similarity analysis 9. (Q)SAR methodology consists of a representation of the chemical structure using molecular descriptors and a learning algorithm that relates biological activity of a compound to its chemical structure 10. While a variety of molecular parameters can be used in the computational methods for (Q)SAR analysis 1012, some of these parameters are complex physicochemical or geometrical 3D descriptors whose calculation is associated with difficulties conditioned by molecular flexibility and adequate sampling of conformational space. Conversely, topological indices, or 2D descriptors, obtainable from the structural formula of a compound are very attractive because of their simplicity. A reasonable compromise between ease of interpretation and ease of computation was reported by Carhart et al. 11, who introduced atom pair descriptors as features of the atomic environments of all pairs of atoms in the 2D representation of a chemical structure. Although this methodology is rather simple, only a few papers have been published where descriptors of this kind were applied in SAR analysis 12,13.

Among the known serine protease inhibitors, QSAR analysis has been performed for inhibitors of several proteases 1416, including the analysis of peptide inhibitors of porcine pancreatic elastase 17. However, there are currently no reported (Q)SAR models for non-peptide inhibitors of human neutrophil elastase. Here, we utilized computational SAR analysis of a large group of N-benzoylpyrazoles to further optimize these molecules as lead neutrophil elastase inhibitors. We present an improved approach to SAR methodology based on atom pair descriptors in combination with classical physicochemical and geometrical descriptors and show that this methodology can detect specific combinations of substructure patterns that confer high or low inhibitory activity against neutrophil elastase. Furthermore, we suggest that the SAR approach developed here may be widely applicable to qualitative SAR recognition problems in other data sets.

2. Results and discussion

2.1. Descriptors

In our investigation, we used atom pairs automatically generated directly from bond connectivity of N-benzoylpyrazoles 1–53 (Table 1), as well as physicochemical and structural descriptors obtained from semiempirical calculations and from formulae of the compounds. Atom pairs have been used previously in SAR modeling 1113,18 and have been defined using different schemes. These differences in nomenclature consisted mainly in the description of atom types. For example, Carhart et al. 11 defined an atom type with its chemical name, number of attached non-hydrogen atoms, and its number of bonding π-electrons. A similar definition of atom types was used by Rusinko et. al. 13 in their analysis of chemical libraries. In comparison, Seierstad and Agrafiotis 18 described atom types in terms of the SMARTS definition, taking into account hydrogen bond donor/acceptor, lipophilicity, and charge characteristics instead of explicit chemical notation. In most of these papers the atom pairs were used as indicator variables with Boolean values (0 or 1).

Table 1.

Structure and neutrophil elastase inhibitory activity of N-benzoylpyrazolesa

graphic file with name nihms45438t1.jpg
Compound R1 R2 R3 R4 R5 R6 R7 R8 Ki (nM)
1 H Cl H H H F H H 6
2 H H graphic file with name nihms45438t2.jpg H H Cl H H 15
3 CH3 H H H OCH3 OCH3 OCH3 H 21
4 H Cl H H NO2 H H H 24
5 graphic file with name nihms45438t3.jpg H H H H F H H 24
6 NO2 H H H H CH3 H H 28
7 H NO2 H H H H H H 34
8 graphic file with name nihms45438t4.jpg H H H H H H H 39
9 H Br H H H CH3 H H 45
10 NO2 H H H H H H H 46
11 H NO2 H CH3 H H H H 65
12 H H H H Cl Cl H H 104
13 NO2 H CH3 F H H H H 107
14 H Cl H CH3 H H H H 230
15 H Br H H CH3 H H H 250
16 H Br H CH3 H H H H 300
17 CH3 H H H H NHCOCH3 H H 300
18 H Cl H Cl H H H H 1000
19 H Br H F H H H F 1100
20 H H H CH3 H H H H 3400
21 H Br H H F F H Cl 7200
22 H Cl H H H t-butyl H H 9000
23 CH3 Cl CH3 F H H H H 9000
24 CH3 Cl CH3 H H Cl H H 10700
25 CH3 Br CH3 H H OCH3 H H 24500
26 H H H Br H H H H 29900
27 CH3 CH3 CH3 H OCH3 OCH3 OCH3 H 50900
28 H H H H H H H Cl NAb
29 H H H H F H H Cl NA
30 H NO2 H F H H H H NA
31 H NO2 H H Br H H H NA
32 H NO2 H Cl H Cl H H NA
33 graphic file with name nihms45438t5.jpg H H Cl H H H H NA
34 CH3 H CH3 H H Cl H H NA
35 CH3 H CH3 H H t-butyl H H NA
36 CH3 H CH3 H NO2 Cl H H NA
37 CH3 H CH3 H Br CH3 H H NA
38 CH3 H CH3 H H NHCOCH3 H H NA
39 CH3 H CH3 H OCH3 OCH3 H H NA
40 CH3 H CH3 OCH3 H H H H NA
41 CH3 H CH3 H Cl H H H NA
42 CH3 H CH3 COOH H H H H NA
43 CH3 H CH3 H H OCH3 H H NA
44 CH3 H CH3 Cl H H H H NA
45 CH3 H CH3 H CH3 NO2 H H NA
46 CH3 H CH3 H H NHCOCH2CH3 H H NA
47 CH3 Cl CH3 Cl H Cl H H NA
48 CH3 Cl CH3 H Cl Cl H H NA
49 CH3 Cl CH3 Br H H H H NA
50 CH3 Cl CH3 H H Cl NO2 H NA
51 CH3 Br CH3 H Br H H H NA
52 CH3 Br CH3 H H Cl H H NA
53 CH3 H CH3 H Cl Cl H H NA
a

Data taken from 7.

b

NA, not active or no inhibition seen at the highest concentration of compound tested (55 µM).

Various conventions for naming atom types are implemented in molecular modeling software. For example, MM+, AMBER, OPLS and other force fields developed for molecular mechanics computations assign very specific names to atoms of the same chemical nature depending on their environment in a molecule. For SAR analysis, we used the atom naming scheme from MM+ force field, as implemented in HyperChem. According to this scheme, specific atom pairs are defined as T1_D_T2, where T1 and T2 are the atom types assigned by HyperChem, and D is the number of chemical bonds in the shortest path between the two atoms (see Experimental Section). HyperChem output in a HIN file format was entered directly into our CHAIN program, which generated all possible atom pairs and frequencies of their occurrence in each of the 53 N-benzoylpyrazoles (Figure 1). These frequencies were considered as values of the corresponding atom pair descriptors. Hence, the atom pairs used had non-indicator character, which is another distinctive feature of our approach. This characteristic has some advantages over Boolean values used previously in trend-vector analysis 11 and recursive partitioning methods 13. For example, cyclohexane and cycloheptane have the same set of atom pairs (C4_1_C4, C4_2_C4, C4_3_C4), which would not be distinguished from each other using Boolean (indicator) values of descriptors. However, the frequencies of occurrence of these atom pairs are different in six- and seven-membered rings, resulting in different values of non-indicator descriptors between two cyclic hydrocarbons, as well as between their derivatives.

Figure 1.

Figure 1

Numbers of unique atom pairs in 53 N-benzoylpyrazoles. The numbers are shown for each of the indicated bond separations initially generated for the 53 N-benzoylpyrazoles (light bars). Atom pairs subsequently selected by ANOVA as having significant differences between the three classes of elastase inhibitory activity are shown in dark bars.

Several examples of atom pairs are shown in Figure 2. Note that atom pair descriptors are easily interpretable in terms of standard chemical formulae. For example, C4_1_CA corresponds to the methylated aromatic ring (shown in red for Compounds 11 and 14), C3_4_C4 corresponds to the presence of a methyl group within a substituent of the pyrazole moiety (shown in red for Compound 2), and C4_4_C4 represents two methyl groups present as R1 and R2 substituents of the heterocycle (shown in red for Compounds 5 and 50). Atom pairs C3_1_NO and NO_1_O1 both correspond to nitro groups (shown in bold for Compounds 30 and 32) and can be regarded as having the same chemical origin. However, they are not completely equivalent, as NO_1_O1 represents any nitro group, including substituents in aromatic moieties, while C3_1_NO designates only nitro groups in the pyrazole ring. Four atom pairs (C3_1_C3, C3_1_N2, N2_1_N2, C3_2_C3) had equal occurrences in Compounds 1–53, as they originate from the pyrazole ring and are present in all of the compounds investigated. Thus, such descriptors with zero variance were excluded from further consideration. It should be noted that, although atom naming was taken from MM+ force field, performing MM+ molecular mechanics optimization itself is not necessary because only bond connectivity, but not geometry, is important for the atom pair calculation. Thus, initial data for each compound might include just a sketch of the molecular formula saved in HIN format. Nevertheless, we did perform geometry optimization here with MM+ force field and then by the semi-empirical PM3 method for the purpose of physicochemical descriptor calculations by HyperChem.

Figure 2.

Figure 2

Examples of atom pairs from the best subsets in the structures of representative Nbenzoylpyrazoles. Atom pairs are depicted in red and indicated below the structure. Compound numbers correspond to those shown in Table 1.

2.2. SAR modeling

The total set of descriptors used can be divided into atom pairs, 2D descriptors obtained directly from structural formulae of the compounds, and physicochemical descriptors. The latter group of descriptors required application of additional techniques for determination, such as semi-empirical methods for orbital energy calculation and procedures for evaluation of lipophilicity and molar refractivity (see Experimental Section). Hence, we performed two separate SAR analyses for comparison, one based on the entire set of descriptors, and the other based on the set of atom pairs and 2D parameters only. If results of comparable quality could be obtained for these two types of analyses, then the less complex methodology without physicochemical characteristics would be beneficial because this approach allows easy visualization and translation of these variables into a simple “chemical’ language, as the variables correspond directly to the presence of certain chemical substructures and functional groups in a molecule.

SAR analysis using atom pairs, quantum-mechanical characteristics calculated by the semi-empirical PM3 method, “classical” physicochemical descriptors (Refr, Refr(Pz), Refr(Ph) ACD/LogP, EHOMO, ELUMO), and integer variables obtained directly from structural formulae (n1, n2, n3, no, nm, np) resulted in an initial data matrix of descriptors that contained 375 columns (variables) (see Materials and Methods for variable definitions). The second analysis utilizing atom pairs and integer variables only resulted in an initial data matrix of descriptors containing 369 variables. Large numbers of variables requires selection of descriptors to obtain a shortened list of molecular characteristics that are the most valid for effective SAR analysis. For example, the method of recursive partitioning can be applied when the number of variables is quite large, consisting of thousands or even millions of descriptors; however, this method does not account well for possible covariance of variables and uses a simplified descriptor selection with construction of a hierarchical tree 13,19,20. In comparison, the more elaborate procedure of linear discriminant analysis (LDA) can account for clustering data in multidimensional space of interrelated variables but cannot be accomplished if there are too many descriptors. In order to exploit the advantages of LDA, we performed a sequential variable selection with application of a specified procedure on each stage.

ANOVA methodology was used in the first stage of selection to identify variables most important for distinguishing compounds in descriptor space 21. The set of N-bezoylpyrazoles was divided into three classes based on elastase inhibitory activity: high, medium, and not active (NA) (see Experimental Section). We performed ANOVA calculation of significance for differences between means in these three classes for each of the 375 initial columns of data matrix, and this procedure resulted in selection of 66 descriptors having significant (p<0.05) difference between within-class and total variability. This group of parameters contained four physicochemical descriptors [Refr, ACD/LogP, ELUMO, Refr(Ph)], three 2D integer variables (n1, n3, no), and 59 atom pairs. Figure 1 shows relative number of atom pairs with different bond separations selected by ANOVA and initially generated by CHAIN for the 53 compounds. Atom pairs separated by 2–4 bonds seemed to be the most important for detailed SAR analysis because they better discriminated between the three activity classes (Figure 1). Only one atom pair descriptor with a 9-bond separation (CA_9_O1) was selected by ANOVA for further SAR analysis; whereas, atom pairs separated by ≥10 chemical bonds were not significant, and differences of their occurrence in three classes were negligible at a confidence level of p<0.05.

In the second stage of sequential variable selection, the LDA procedure was applied to the 66 descriptors chosen by ANOVA, since reduction of the dimension of descriptor space from 375 to 66 variables allowed analysis with this powerful methodology. At this point, the SAR calculations were branched into two routes. Route 1 LDA utilized all 66 descriptors, including physicochemical characteristics; whereas, Route 2 analysis included only atom pairs and 2D descriptors obtained directly from molecular graphs of N-benzoylpyrazoles. In the latter case, Refr, ACD/LogP, ELUMO, and Refr(Ph) were discarded, and 62 variables were retained as an input for LDA. The implementation of LDA in STATISTICA 6.0 made it possible to sort out variables which were non-significant for SAR classification by this method, as the corresponding coefficients of classification functions were automatically zeroed by STATISTICA 6.0. LDA led to a further decrease in the number of important descriptors, and the resulting LDA classification matrices are shown in Table 2. LDA using Route 1 retained 25 variables [Refr, ACD/LogP, ELUMO, Refr(Ph), n1, n3, no, BR_1_C3, C3_1_C4, C3_1_CO, C3_1_NO, C4_1_CA, CA_1_F, NO_1_O1, C3_2_C4, C4_2_N2, C3_3_C4, C4_3_CO, C4_3_N2, C3_4_C4, C4_4_C4, C4_4_CO, C4_4_O1, C4_5_CA, C4_8_CL]. In comparison, Route 2 retained the same set of descriptors, with the exception of the four physicochemical variables that were initially discarded (21 variables retained). Atom pairs with one-bond separation (7 of the retained variables) prevailed among all the atom pairs selected by LDA, although several atom pair descriptors with 3-bond and 4-bond separation were also present (3 and 4 of the retained variables, respectively). Table 2 shows that quality of SAR classification for the 53 N-benzoylpyrazoles using 25 descriptors (Route 1) was sufficiently high, and the correct activity classification was calculated for 11 of 13 active, all 10 moderately active, and 29 of 30 inactive N-benzoylpyrazoles. In total, 50 compounds (94.3%) were classified correctly. Discarding physicochemical characteristics from the descriptor set in Route 2 resulted in a slight decrease in quality of the SAR analysis, and correct activity classification was obtained for 47 of 53 compounds (88.7% accuracy) (Table 2). Estimation of predictive power of the LDA models made by the leave-one-out (LOO) approach is shown in Supplementary Table S1 for Routes 1 and 2, where calculated classes for each compound are also presented. For both routes, a priori LOO classification gave 34 coincidences (64.2%) between experimental and predicted classes. While this percentage is low, the number of descriptors used (25 and 21 for Routes 1 and 2, respectively) is still rather high for LOO prediction of biological of activity. Thus, a further reduction in the number of descriptors was needed.

Table 2.

Classification matrices for linear discriminant analysis-derived SAR with 25 variables (Route 1) and 21 variables (Route 2).

Experimentally determined classification Calculated classification
Route 1 Route 2
High Medium NA Accuracy (%) High Medium NA Accuracy (%)
High 11a 1 1 84.6 11 2 0 84.6
Medium 0 10 0 100.0 0 8 2 80.0
NA 0 1 29 96.7 1 1 28 93.3
Total 11 12 30 94.3 12 11 30 88.7
a

The number of correctly classified compounds is indicated in bold.

The third stage of sequential variable selection utilized the “best subset search” option of LDA, as implemented in STATISTICA 6.0. This option allows one to check possible combinations of descriptors and obtain a smaller combination that is optimal in terms of analysis misclassification or cross-validation misclassification. Because the number of variables was still high, the LDA with “best subset search” was preceded by first removing correlated descriptors from the variable sets retained after classical LDA on Routes 1 and 2 (see Experimental Section). Hence, 15 descriptors were retained for LDA with “best subset search” using Route 1 [Refr, ACD/LogP, ELUMO, Refr(Ph), n1, no, BR_1_C3, C3_1_CO, C3_1_NO, C4_1_CA, CA_1_F, NO_1_O1, C3_4_C4, C4_4_C4, C4_8_CL], and 11 of these variables were retained for analysis using Route 2 (as above, the physicochemical descriptors were excluded for Route 2). Using LDA with the “best subset search” option, we found that the optimal subsets, in terms of analysis misclassification, consisted of 7 and 6 variables for Routes 1 and 2, respectively. Linear classification functions for elastase inhibitory activity classes of High, Medium, and NA are provided below.

Route 1

FHigh=51.068+1.596Refr17.351n1+3.688no+14.793C3_1_NO4.965C4_1_CA13.687C3_4_C4+9.722C4_4_C4 (Eq. 1)
FMedium=53.379+1.628Refr20.507n1+7.365no+9.699C3_1_NO3.341C4_1_CA22.905C3_4_C4+12.687C4_4_C4 (Eq. 2)
FNA=64.130+1.798Refr23.782n1+6.372no+15.213C3_1_NO7.462C4_1_CA24.375C3_4_C4+21.205C4_4_C4 (Eq. 3)

Route 2

FHigh=3.041+0.700no+1.323C4_1_CA+0.699CA_1_F+1.795NO_1_O1+4.944C3_4_C4+1.004C4_4_C4 (Eq. 4)
FMedium=4.473+3.657no+3.461C4_1_CA+1.529CA_1_F0.198NO_1_O13.504C3_4_C4+1.899C4_4_C4 (Eq. 5)
FNA=4.668+4.006no0.367C4_1_CA1.351CA_1_F+1.412NO_1_O11.137C3_4_C4+8.206C4_4_C4 (Eq. 6)

According to SAR rules expressed by these equations, a compound will be assigned to a certain activity class if the value of its corresponding classification function is greater than the values of the functions for the two remaining classes. The results presented in Table 3 and Table 4 indicate that SAR classifications with similar qualities were achieved for both Routes 1 and 2 (88.7% and 86.8% of the compounds were correctly calculated with the experimentally-determined classes of elastase inhibitory activity, respectively). Thus, inclusion of physicochemical descriptors provided little improvement in the SAR results. Although the quality of LDA with “best subsets search” was generally lower than that obtained by standard LDA (Table 2), use of the “best subsets search” with fewer descriptors led to a significantly higher percentage of correct LOO predictions (Table 4). Indeed, accuracy of a priori LOO predictions was 75.5% and 71.7% for Routes 1 and 2, respectively, which corresponded to 40 and 38 of the 53 compounds correctly assigned to their experimental classes using training sets each consisting of 53 N-benzoylpyrazole derivatives. The previous stage of sequential variable selection resulted in only 64% correct LOO classifications (Supplementary Table S1). Thus, step-by-step reduction in the number of descriptors led to variable subsets that were the most critical for SAR analysis. Importantly, the resulting classification functions, expressed by Eq. 1Eq. 3 for Route 1 and Eq. 4Eq. 6 for Route 2 appear to fairly accurately predict the class of elastase inhibitory activity of a given N-benzoylpyrazole.

Table 3.

Classification matrices for linear discriminant analysis-derived SAR with best subsets of 7 variables (Route 1) and 6 variables (Route 2).

Experimentally determined classification Calculated classification
Route 1 Route 2
High Medium NA Accuracy (%) High Medium NA Accuracy (%)
High 12a 1 0 92.3 12 1 0 92.3
Medium 1 8 1 80.0 1 7 2 70.0
NA 0 3 27 90.0 2 1 27 90.0
Total 13 12 28 88.7 15 9 29 86.8
a

The number of correctly classified compounds is indicated in bold.

Table 4.

Results of SAR classification and leave-one-out (LOO) prediction obtained using Routes 1 and 2 for each compound based on LDA with the “best subset search” option.

Compound Experimental classification Route 1 Route 2
Calculated class LOO-predicted class Calculated class LOO-predicted class
1 High High Mediuma High Medium
2 High High High High High
3 High High High High High
4 High High Medium High High
5 High High High High Medium
6 High High High High High
7 High High NA High High
8 High High High High High
9 High Medium Medium Medium Medium
10 High High High High High
11 High High NA High Medium
12 High High Medium High High
13 High High High High High
14 Medium Medium Medium Medium Medium
15 Medium Medium Medium Medium High
16 Medium Medium Medium Medium Medium
17 Medium High High High High
18 Medium Medium Medium NA NA
19 Medium Medium Medium Medium Medium
20 Medium Medium Medium Medium Medium
21 Medium Medium Medium Medium Medium
22 Medium Medium Medium Medium High
23 Medium NA NA NA NA
24 NA NA NA NA NA
25 NA NA NA NA NA
26 NA Medium Medium NA Medium
27 NA NA NA NA NA
28 NA Medium Medium NA Medium
29 NA Medium Medium Medium Medium
30 NA NA NA High High
31 NA NA High High High
32 NA NA NA NA NA
33 NA NA Medium NA Medium
34 NA NA NA NA NA
35 NA NA NA NA NA
36 NA NA NA NA NA
37 NA NA NA NA NA
38 NA NA NA NA NA
39 NA NA NA NA NA
40 NA NA NA NA NA
41 NA NA NA NA NA
42 NA NA NA NA NA
43 NA NA NA NA NA
44 NA NA NA NA NA
45 NA NA NA NA NA
46 NA NA NA NA NA
47 NA NA NA NA NA
48 NA NA NA NA NA
49 NA NA NA NA NA
50 NA NA NA NA NA
51 NA NA NA NA NA
52 NA NA NA NA NA
53 NA NA NA NA NA
a

Incorrect classifications are indicated in italics.

Coefficients of linear classification functions (see Eq. 1Eq. 6) reflect cooperative effects of partially correlated descriptors on the values of FHigh, FMedium, FNA, and direct interpretation of their values is complex; however, unambiguous interpretation of these values was sometimes possible. For example, values for variables no and C4_4_C4 tended to following the sequence NA>Medium>High (Figure 3). Consequently, higher values of these descriptors lead to larger increments for FNA than for FHigh and are unfavorable for elastase inhibitory activity.

Figure 3.

Figure 3

Classification function coefficients for selected descriptors. Coefficients were obtained using the best subsets of 7 variables (Route 1, light bars) and 6 variables (Route 2, dark bars) for descriptor no (Panel A) and atom pair C4_4_C4 (Panel B).

Note that other robust (Q)SAR methods exist, which provide good fitting and prediction results without the use of variable selection. Among the best of these approaches is the random forest method 22. However, this method is not able to produce an explicit model from randomly generated trees. Rather, the random forest model can be regarded as a "black box" 22. Since the goal of our approach was to construct interpretable and defined SAR models to predict elastase inhibitory activity of N-benzoylpyrazoles, pre-selection of variables based on activity was necessary. This resulted in the derivation of very simple and intuitively understandable SAR rules, as described below.

2.3. Classification tree analysis and derivation of simplified SAR rules

Although Eq. 1Eq. 6 are capable of fitting and predicting elastase inhibitory activity classes of N-benzoylpyrazoles with good accuracy, simpler and intuitively understandable SAR criteria are desirable. We tried to reasonably simplify classification rules with the use of binary classification tree analysis, as implemented in STATISTICA 6.0. Based on 7 descriptors from the best subset selected on Route 1, we have built the optimal classification tree shown in Figure 4. The tree has three splits according to the three conditions indicated for the splits. If a condition is satisfied, then the compounds are sent to the left branch of the tree, otherwise they are sent to the right branch, eventually reaching one of the terminal nodes corresponding to a certain activity class. For example, the condition C4_4_C4<0.38 indicates that compounds not containing atom pairs of this type (i.e., when the integer value of the C4_4_C4 descriptor is equal to zero), will be sent to the left branch (28 cases), while compounds containing one or more C4_4_C4 atom pairs will be sent to the right branch and immediately enter the NA terminal node. Thus, according to the above-mentioned observation from LDA analysis (Figure 3B), the presence of C4_4_C4 in the chemical structure of an N-benzoylpyrazole is unfavorable for inhibitory activity, and such compounds should be designated inactive. Note that in almost all the compounds investigated, this atom pair is associated with two methyl groups R1 and R3 in the pyrazole moiety (see an example of this atom pair within the structure of Compound 50 in Figure 2). The only exception to this rule was Compound 5, where the C4_4_C4 descriptor originates from two methyl substituents in the phenyl ring (Figure 2).

Figure 4.

Figure 4

Classification tree for predicting elastase inhibitory properties of N-benzoylpyrazoles using Route 1. The number of N-benzoylpyrazoles that entered each node is indicated. Terminal nodes correspond to the three activity classes: High, Medium, or non-active (NA).

The logic tree for Route 1 is defined by two additional descriptors from the best subset, namely by the conditions C3_1_NO<0.64 and no<0.42 (Figure 4). Integer values of descriptors participating in these conditions indicates that the presence of a C3_1_NO atom pair (i.e., nitro group substituent of the pyrazole ring; see an example of this atom pair in Compound 30 of Figure 2) or ortho-substituents in the benzoyl moiety sends compounds to the right branches on the two corresponding logical splits of the tree, because any non-zero positive values of C3_1_NO and no violate these conditions, respectively. Supplementary Table S2 shows SAR classification for each of the 53 N-benzoylpyrazoles analyzed by the classification tree shown in Figure 4. Taking into account a structural sense of the atom pairs (Figure 2), the this classification tree can also be presented in the form of a simple “chemical” algorithm (Scheme 1).

Scheme 1.

Scheme 1

Scheme 1

Note that these rules are not completely equivalent with the classification tree (Figure 4), since the C4_4_C4 atom pair can also be associated with methyl groups located in the phenyl ring, rather than the pyrazole moiety (see Figure 2, Compound 5). Using the tree, Compound 5 is classified incorrectly; whereas, this N-benzoylpyrazole derivative is classified correctly by the “chemical” algorithm, which considers methyl substituents only in the pyrazole moiety. In total, the correct elastase inhibitory activity class was determined by Scheme 1 for 42 of the 53 N-benzoylpyrazoles (79.2% accuracy). While LDA with “best subset search” resulted in a higher percentage (88.7%) of correctly calculated classifications (Table 3), the simplified SAR rules are attractive because of their clarity and the possibility of translating them into standard “chemical” language.

The classification tree approach was also applied to the best subset of 6 variables selected by LDA on Route 2. An optimal tree obtained in this case considered 3 variables (Figure 5). Again, descriptors C4_4_C4 and no were important in the derived SAR rules. The third condition was based on the C4_1_CA atom pair, which is associated with the presence of alkyl groups attached to the aromatic ring (see Compounds 11 and 14 in Figure 2). In contrast, the C3_1_NO descriptor involved in the classification tree for Route 1 was not present in the best LDA subset identified with Route 2 (Equation 4Equation 6). Supplementary Table S2 shows SAR classification for each of the 53 N-benzoylpyrazoles according to the tree shown in Figure 5. This tree can also be presented in the form of a simple “chemical” algorithm (Scheme 2).

Figure 5.

Figure 5

Classification tree for predicting elastase inhibitory properties of N-benzoylpyrazoles using Route 2. The number of N-benzoylpyrazoles that entered each node is indicated. Terminal nodes correspond to the three activity classes: High, Medium, or non-active (NA).

Scheme 2.

Scheme 2

Scheme 2

Supplementary Table S2 shows SAR classification for each of the 53 N-benzoylpyrazoles analyzed by Scheme 2. Hence, according to the SAR rules derived with Route 2, a compound is classified as inactive if it has non-methyl ortho-substituents (i.e., Cl, Br, F, etc.) or two methyl substituents in the pyrazole moiety. Likewise, highly active inhibitors cannot contain two methyl groups in the pyrazole heterocycle or ortho-substituents in benzoyl radical. The Scheme 2 algorithm correctly classified 43 of 53 N-benzoylpyrazoles (81.1%), which is comparable to the 79.2% accuracy obtained on Route 1 using SAR rules from Scheme 1. In spite of the satisfactory overall percentage of correct classifications, both algorithms had lower accuracy in predicting the activity class of the 10 moderately active compounds (4 and 3 cases on Routes 1 and 2, respectively). Clearly, this is due to the relative simplicity of binary classification trees, as they represent only rough approximations to the more sophisticated LDA-derived classifications expressed by Eq. 1Eq. 6.

The simplified SAR rules did not differ significantly between Routes 1 and 2 in quality of their results. Consequently, inclusion of physicochemical characteristics in a descriptor set does not seem to provide a significant advantage. The easily calculated 2D descriptors (atom pairs and simple structural variables) used on Route 2 were sufficient to obtain good accuracy for SAR recognition, both by LDA and by binary classification tree analysis. Importantly, these simplified SAR rules were generally consistent with our molecular modeling data 7, where we found that R1 and R3 groups in a pyrazole, as well as ortho-substituents in the benzoyl radical prevented proper positioning of the N-benzoylpyrazole required for interaction between the hydroxyl group of elastase Ser195 making them unfavorable for nucleophilic attack by Ser195 to the carbonyl group of an inhibitor in the oxyanion hole. Likewise, the presence of a nitro group enhanced positive charge on the N-benzoylpyrazole carbonyl carbon atom, making it more susceptible to attack by the nucleophile 7.

3. Conclusions

Previously, we utilized high-throughput screening to select unique small-molecule inhibitors of neutrophil elastase, and identified a novel class of elastase inhibitors with an N-benzoylpyrazole scaffold 7. Here, we performed SAR analysis of these compounds to further define the features of these molecules important for activity and to develop a simple, but accurate SAR model for predicting biological activity in future compound screening. The analysis of structure-activity relationships is an important approach for defining the critical combination of various structural and physicochemical descriptors responsible for the biological activity of a given molecule [e.g., 23,24]. In the present study, we utilized 2D atom pair descriptors together with physicochemical molecular descriptors (Route 1) or 2D parameters alone (Route 2) for SAR analysis of a series N-benzoylpyrazoles with various levels of experimentally determined elastase inhibitory activity.

A sequence of ANOVA, linear discriminant, and binary classification tree analyses based on the molecular descriptors led to the derivation of simple SAR rules, in spite of the large number of starting variables. The SAR rules obtained by binary classification tree analysis on Routes 1 and 2 were quite consistent with our experimental activity data and molecular docking studies 7, indicating the approach can accurately predict active elastase inhibitors. We also found that physicochemical characteristics (i.e., energies of frontier molecular orbitals, molar refractions, lipophilicities) were not necessary for achieving good SAR rules, as comparable quality of SAR classification was obtained with 2D descriptors only. Thus, the use of atom pair descriptors is a valuable tool for identifying different SAR rules in high-throughput screening data sets and could provide a relatively simple classification useful for de novo design of elastase inhibitors with an N-benzoylpyrazole scaffold.

Although we applied atom pair descriptors to SAR in a set of related compounds, this approach is also applicable to chemically diverse data sets 13. We believe that our modification of the method using more specific atom typing and non-biased values of descriptors, in conjunction with sequential variable selection, will also be useful for SAR analysis in a heterogeneous series of compounds, and this issue will be addressed in future studies.

4. Materials and methods

4.1. Molecular set

The data set used in this study is a series of 53 N-benzoylpyrazoles with different levels of inhibitory activity for human neutrophil elastase. These compounds were selected by high-throughput screening of a 10,000-compound chemolibrary 7. For SAR analysis, the set of the N-benzoylpyrazoles (Table 1) was divided into three activity classes according to their experimentally determined elastase inhibitory activity. Inhibitors possessing Ki≤200 nM were regarded as highly active and were placed in the activity class labeled “High” (13 compounds). N-Benzoylpyrazoles with moderate activity (200<Ki≤10000 nM) were placed in the activity class labeled “Medium” (10 compounds). Derivatives with Ki>10000 nM considered non-active and placed in the activity class labeled “NA” (30 compounds).

4.2. Structure encoding by atom pairs and other 2D descriptors

For the purpose of SAR analysis we used an atom pair representation of molecular structures, with each atom pair denoted as T1_D_T2, where T1 and T2 are the types of atoms in the pair, and D represents the topological distance or number of bonds in the shortest path between these atoms in a structural formula. In our investigation, T1 and T2 were defined with symbolic codes used in HyperChem, Version 7 (Hypercube, Inc., Gainesville, FL) for atom type representation within MM+ force field. For example, CA, CO, and C3 codes were used for sp2-hybridized aromatic, carbonyl, and pyrazole carbon atoms, respectively. This approach allows easy generation of atom pairs directly from the output file containing the molecular structure (HIN file) built by HyperChem. The notation of atom types can be changed, if necessary, based on the force field used. For example, the codes listed above for aromatic, carbonyl, and pyrazole carbons would be altered to CA, C, and CM, respectively, if AMBER instead of MM+ force field was used for HyperChem output. As atom pairs T1_D_T2 and T2_D_T1 are equivalent, we chose a unified definition with lexicographic order of type substrings (i.e., with T1≤T2).

All 367 unique atom pairs possible for non-hydrogen atoms in the 53 N-benzoylpyrazoles were generated. This 53×367 data matrix was automatically built by our CHAIN program, based on HIN files created in HyperChem. By convention, a matrix element at the intersection of the ith row and jth column was equal to the jth atom pair occurrence in the ith molecule. The data matrix obtained in this way for the 53 compounds contained columns with no variance for descriptors C3_1_C3, C3_1_N2, N2_1_N2, C3_2_C3, because these atom pairs are present in all the compounds investigated at an the same frequency. Thus, the corresponding columns were deleted from the matrix, resulting in a 53×363 matrix of atom pair descriptors.

In addition to atom pairs, we selected the following set of 6 additional structural 2D descriptors: number of substituents in ortho- (no) and meta- (nm) positions of the benzene ring; and numbers of substituents R1, R2, R3, R6 (Table 1) denoted as n1, n2, n3, np, respectively (integer variables). These descriptors were obtained directly from structural formulae of Compounds 1–53.

4.3. Physicochemical descriptors

The following 6 physicochemical descriptors were used: total molar refraction (Refr), lipophilicity (octanol-water partition coefficient; ACD/logP), energies of the highest occupied and lowest unoccupied molecular orbitals (EHOMO and ELUMO, respectively), and sum of refractions for substituents in the pyrazole (R1, R2, R3) and benzene (R4–R8 ) rings [Refr(Pz) and Refr(Ph), respectively]. Energies EHOMO and ELUMO were determined by the semi-empirical PM3 method after geometry optimization in HyperChem. The values of Refr, Refr(Pz), and Refr(Ph) were calculated with the QSAR built-in module of HyperChem. Lipophilicities ACD/LogP were obtained taken from the site www.emolecules.com. The resulting data matrix of physicochemical and structural descriptors and atom pairs contained 375 columns (variables).

4.4. Data processing and derivation of SAR rules

Derivation of SAR classification was accompanied by sequential variable selection and reduction of dimensionality. In order to distinguish between variables significant and non-significant for SAR, we applied one-way analysis of variance (ANOVA) 21 using the STATISTICA 6.0 package (StatSoft, Inc., Tulsa, OK). The variables selected by ANOVA served as basic descriptors for refined classification by LDA, using the corresponding module of STATISTICA 6.0. Redundant or non-significant coefficients of the linear classification functions were automatically zeroed by STATISTICA 6.0.

Although LDA methodology allows analysis with linearly dependent variables, we undertook another step of variable selection and excluded dependent correlated variables, as they originated from the same molecular feature. We found that each of 11 descriptors retained by LDA (n3, C3_1_C4, C3_2_C4, C4_2_N2, C3_3_C4, C4_3_N2, C4_4_C4, C4_4_CO, C4_5_CA, C4_3_CO, C4_4_O1) had correlation coefficients >0.8 with at least one descriptor within this same group. This correlation can be explained easily because all 11 variables are related to the presence of methyl substituents in the compounds. For example, the tetrahedral sp3-carbon (atom type C4) is involved in all atom pairs from this group. Similarly, variable n3 was also present in this group of correlated descriptors because R3 is a methyl radical in almost all N-benzoylpyrazoles containing a substituent in this position of the pyrazole ring. Thus, it was reasonable to choose one of the mutually correlated descriptors as an independent variable representing the entire group of 11 variables. For this purpose, we selected atom pair C4_4_C4 as an independent descriptor, because it had the largest sum of correlation coefficients with the remaining 10 variables from this group. Hence, the dimension of descriptor space was further reduced by 10 variables. We repeated LDA analysis focusing on the remaining descriptors and applying the “best subset search” option available in STATISTICA 6.0.

For the purpose of ultimate visuality we also built simplified SAR rules with the use of binary classification tree methodology 25. Starting from variables of the best subsets selected as described above, classification trees were built with STATISTICA 6.0 using discriminant-based univariate splits with estimated prior probabilities and equal misclassification costs for classes 25,26.

Supplementary Material

01

Supplementary data Supplementary data associated with this report consist of 1) the results of SAR classification and leave-one-out (LOO) prediction obtained for each N-benzoylpyrazole with the initial (classic) LDA analysis and 2) an illustration of simplified SAR rules based on classification tree analysis for each compound. Supplementary data associated with this article can be found, in the online version, at…‥

Acknowledgments

This work was supported in part by Department of Defense grant W9113M-04-1-0001, National Institutes of Health grant RR020185, and the Montana State University Agricultural Experimental Station. The U.S. Army Space and Missile Defense Command, 64 Thomas Drive, Frederick, MD 21702 is the awarding and administering acquisition office. The content of this report does not necessarily reflect the position or policy of the U.S. Government.

Footnotes

Publisher's Disclaimer: This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final citable form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.

References

  • 1.Owen CA, Campbell EJ. J. Leukoc. Biol. 1999;65:137. doi: 10.1002/jlb.65.2.137. [DOI] [PubMed] [Google Scholar]
  • 2.Dollery CM, Owen CA, Sukhova GK, Krettek A, Shapiro SD, Libby P. Circulation. 2003;107:2829. doi: 10.1161/01.CIR.0000072792.65250.4A. [DOI] [PubMed] [Google Scholar]
  • 3.Carden D, Xiao F, Moak C, Willis BH, Robinson-Jackson S, Alexander S. Am. J. Physiol. 1998;275:H385–H392. doi: 10.1152/ajpheart.1998.275.2.H385. [DOI] [PubMed] [Google Scholar]
  • 4.Barnes PJ, Stockley RA. Eur. Respir. J. 2005;25:1084. doi: 10.1183/09031936.05.00139104. [DOI] [PubMed] [Google Scholar]
  • 5.Chughtai B, O'Riordan TG. J. Aerosol Med. 2004;17:289. doi: 10.1089/jam.2004.17.289. [DOI] [PubMed] [Google Scholar]
  • 6.Tremblay GM, Janelle MF, Bourbonnais Y. Curr. Opin. Investig. Drugs. 2003;4:556. [PubMed] [Google Scholar]
  • 7.Schepetkin IA, Khlebnikov AI, Quinn MT. J. Med. Chem. 2007;50:4928. doi: 10.1021/jm070600+. [DOI] [PubMed] [Google Scholar]
  • 8.Andricopulo AD, Montanari CA. Mini. Rev. Med. Chem. 2005;5:585. doi: 10.2174/1389557054023224. [DOI] [PubMed] [Google Scholar]
  • 9.Worth AP, Bassan A, De BJ, Gallegos SA, Netzeva T, Patlewicz G, Pavan M, Tsakovska I, Eisenreich S. SAR QSAR. Environ. Res. 2007;18:111. doi: 10.1080/10629360601054255. [DOI] [PubMed] [Google Scholar]
  • 10.Buttingsrud B, Ryeng E, King RD, Alsberg BK. J. Comput. Aided Mol. Des. 2006;20:361. doi: 10.1007/s10822-006-9058-y. [DOI] [PubMed] [Google Scholar]
  • 11.Carhart RE, Smith DH, Venkataraghavan R. J. Chem. Inf. Comput. Sci. 1985;25:64. [Google Scholar]
  • 12.Gute BD, Basak SC. SAR QSAR. Environ. Res. 2006;17:37. doi: 10.1080/10659360600560933. [DOI] [PubMed] [Google Scholar]
  • 13.Rusinko A3, Farmen MW, Lambert CG, Brown PL, Young SS. J. Chem. Inf. Comput. Sci. 1999;39:1017. doi: 10.1021/ci9903049. [DOI] [PubMed] [Google Scholar]
  • 14.Greenidge PA, Merette SA, Beck R, Dodson G, Goodwin CA, Scully MF, Spencer J, Weiser J, Deadman JJ. J. Med. Chem. 2003;46:1293. doi: 10.1021/jm021028j. [DOI] [PubMed] [Google Scholar]
  • 15.Frecer V, Kabelac M, De NP, Pricl S, Miertus S. J. Mol. Graph. Model. 2004;22:209. doi: 10.1016/S1093-3263(03)00161-X. [DOI] [PubMed] [Google Scholar]
  • 16.Li X, Zhang W, Qiao X, Xu X. Bioorg. Med. Chem. 2007;15:220. doi: 10.1016/j.bmc.2006.09.074. [DOI] [PubMed] [Google Scholar]
  • 17.Nomizu M, Iwaki T, Yamashita T, Inagaki Y, Asano K, Akamatsu M, Fujita T. Int. J. Pept. Protein Res. 1993;42:216. doi: 10.1111/j.1399-3011.1993.tb00135.x. [DOI] [PubMed] [Google Scholar]
  • 18.Seierstad M, Agrafiotis DK. Chem. Biol. Drug Des. 2006;67:284. doi: 10.1111/j.1747-0285.2006.00379.x. [DOI] [PubMed] [Google Scholar]
  • 19.Blower P, Fligner M, Verducci J, Bjoraker J. J. Chem. Inf. Comput. Sci. 2002;42:393. doi: 10.1021/ci0101049. [DOI] [PubMed] [Google Scholar]
  • 20.Rusinko A3, Young SS, Drewry DH, Gerritz SW. Comb. Chem. High Throughput Screen. 2002;5:125. doi: 10.2174/1386207024607383. [DOI] [PubMed] [Google Scholar]
  • 21.Lindman HR. 1974 [Google Scholar]
  • 22.Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP. J. Chem. Inf. Comput. Sci. 2003;43:1947. doi: 10.1021/ci034160g. [DOI] [PubMed] [Google Scholar]
  • 23.Tong W, Welsh WJ, Shi L, Fang H, Perkins R. Environ. Toxicol. Chem. 2003;22:1680. doi: 10.1897/01-198. [DOI] [PubMed] [Google Scholar]
  • 24.Raevsky OA. Mini. Rev. Med. Chem. 2004;4:1041. doi: 10.2174/1389557043402964. [DOI] [PubMed] [Google Scholar]
  • 25.Breiman L, Friedman JH, Olshen RA, Stone CJ. 1984 [Google Scholar]
  • 26.Loh WY, Shih YS. Statistica Sinica. 1997;7:815. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

01

Supplementary data Supplementary data associated with this report consist of 1) the results of SAR classification and leave-one-out (LOO) prediction obtained for each N-benzoylpyrazole with the initial (classic) LDA analysis and 2) an illustration of simplified SAR rules based on classification tree analysis for each compound. Supplementary data associated with this article can be found, in the online version, at…‥

RESOURCES