Abstract
The second development program developed in this work was introduced to obtain physicochemical properties of DPP-IV inhibitors. Based on the computation of molecular descriptors, a two-stage feature selection method called mRMR-BFS (minimum redundancy maximum relevance-backward feature selection) was adopted. Then, the support vector regression (SVR) was used in the establishment of the model to map DPP-IV inhibitors to their corresponding inhibitory activity possible. The squared correlation coefficient for the training set of LOOCV and the test set are 0.815 and 0.884, respectively. An online server for predicting inhibitory activity pIC50 of the DPP-IV inhibitors as described in this paper has been given in the introduction.
1. Introduction
The incretin hormones glucagon-like peptide-1 (GLP-1) and glucose-dependent insulinotropic polypeptide (GIP) are the endogenous peptides that stimulate glucose-dependent insulin secretion [1]. One of the important roles of dipeptidyl peptidase IV (DPP-IV) [2] is a rapid inactivation of the GLP-1 and GIP. Inhibition of DPP-4 increases the levels of endogenous intact circulating GLP-1 and GIP. Consequently, inhibitors of DPP-4 or gliptins have been recently regarded as a prospective approach for the treatment of type-2 diabetes mellitus.
In recent years, multiple small-molecule DPP-4 inhibitors have been reported [3, 4]. The development of a structurally diverse collection of DPP-4 inhibitors is a hot research [5–8]. Computational and various mathematical approaches have been widely employed in the quantitative structure-activity relationship (QSAR) analysis [9–13]. Using statistical methods, QSAR analyses were carried out on a dataset of 47 pyrrolidine analogs acting as DPP-IV inhibitors by Paliwal et al. [14]. Murugesan et al. used the comparative molecular field analysis (CoMFA) and comparative molecular similarity indices analysis (CoMSIA) to analyze the structural requirements of a DPP-IV active site [15]. Gao et al. developed a novel 3D-QSAR model to assist rational design of novel, potent, and selective pyrrolopyrimidine DPP-4 inhibitors [16]. Moreover, several efforts by using computational and mathematical approaches have been made in investigating small molecules of DPP-4 inhibitors. In our previous studies [17], we have attempted to use the quantum chemistry method [18] to optimize a series of DPP-IV inhibitors, and a 2D-QSAR model has been built, which can predict the inhibitory activity of small molecule with satisfying results. However, it is time consuming to calculate the molecular descriptors adopted in 2D-QSAR model.
In view of this, here we will try to devise an effective method to correctly recognize the possible activity prediction of small molecules based on physical and chemical properties of the compounds.
According to the general development trend [19, 20] and the recent research progress [21–31], the following procedures should be considered to establish a powerful statistical predictor for a biological system: (i) a valid benchmark dataset is constructed or selected to train and test the predictor; (ii) the samples are formulated with potent mathematical functions that are contributed to the prediction; (iii) a powerful algorithm is introduced or developed to operate the prediction; (iv) cross-validation tests are used to estimate the performance of the predictor; (v) a user-friendly online-server is established for the predictor that is accessible to the public. In this study, we attempt to describe how to deal with these steps for predicting the DPP-IV inhibitory activity pIC50 based on their physicochemical properties available via our program.
2. Materials and Methods
2.1. Data Preparation
The dataset used in the present work contains 48 pyrrolidine amides derivatives. In the current study, a diverse series of DPP-IV inhibitors with known IC50 values were collected from the papers [32, 33]. The detailed structures are documented in Supplementary Materials.(See Supplementary Material available at http://dx.doi.org/10.1155/2013/798743.) Figure 1 demonstrates the common structure of all of these analogues. All of the structures of compounds under investigation are based on the structure of Figure 1.
Figure 1.
Molecular structure of cyanopyrrolidine amides as DPP-IV inhibitors.
How to describe the molecules is an important problem in the establishment of the statistical model. In this study, the molecular descriptors for the 48 molecules were calculated by the second development software based on the calculator plugins, which is a product of ChemAxon [34]. ChemAxon is a company that provides chemical software development platforms and desktop applications for the biotechnology and pharmaceutical industries [35].
2.2. The Introduction of Procedure
Due to the use of Marvin Sketch graphic interface and JChem for Excel program, the calculations of small molecular descriptors are not very convenient. ChemAxon provides the calculation plugins of invoking function API, so our lab members have made a careful study and repeated experiments. The calculation results are compared with the ones of Gaussian 09 [18], JChem for Excel [34], HyperChem 7.5 [20, 36], and Dragon [37] programs calculation. By invoking the Calculator Plugins and using the Java language, we successfully developed a convenient and available customized batch calculation program (second development software) for the small molecular descriptors.
This program contains a selection of tree box; the user can choose the visual way to the calculation of molecular descriptors (as shown in Figure 2, command-line version does not provide molecular descriptor selection). The molecule structures are constructed from Gauss View 5.0 package [38, 39] as MOL-format file. Command-line version of the program is operated commonly in Linux server, through the similar execution command as follows:
java-jar JChemCmd.jar Molecules Pathway Result.csv Method.xml
Figure 2.

The program interface for the computation of molecular descriptors.
2.3. Model Validation
2.3.1. Dataset
The full dataset included training set (36 compounds) and test set (12 compounds). The whole samples were ranked by activity and were extracted every fourth sample for the generation of the test set.
2.3.2. Leave-One-Out Cross-Validation (LOOCV) and Predictive Validation
In this study, Leave-one-out cross-validation (LOOCV) [40, 41] was used to investigate the prediction quality of training set. In the cross-validation, each sample is used to test the model that is established by all of the other samples at the same time.
2.3.3. Fitting and Predictive Performances of Models
The fitting and predictive performances of model were measured by the squared correlation coefficient (q 2) and root mean square error (RMSE) for both the training set and the external test set. Here the performances of models can be estimated by q 2 and RMSE defined as follows, respectively:
| (1) |
where y i and are the actual and predicted pIC50 values of i sample, respectively, and y mean is the average pIC50 value of the entire samples. N is the numbers of the training set.
2.4. Methods
For the sake of the redundancy of some features, the selection of descriptors before establishing a suitable model is necessary. The selection of descriptors plays an important role in construction for the actual model. In this work, mRMR-BFS method (minimum redundancy maximum relevance-backward feature selection) [42, 43] was used for the selection of molecular descriptors. The support vector regression (SVR) model was established based on the feature selection results.
2.4.1. mRMR-BFS Algorithm
The mRMR (minimum-redundancy maximum-relevance) algorithm was introduced by Ding and Ping [44], which was used usually for feature selection. It sorts a feature based on score function which is maximum relevance to target and minimum redundancy to the already selected features. The score function is defined as follows:
| (2) |
where f j ⊂ S n, f i ⊂ S m, S m = S − S n, and S m, S n, and S are the feature sets. m and n are the feature numbers. The mutual information I(x, y) is as follows:
| (3) |
where p(x, y), p(x), and p(y) are the probabilistic density functions.
More details about mRMR algorithm can be found in [44, 45].
To gain an even better performance of predictor and feature selection, backward feature selection (BFS) based on the result of mRMR is also used in this study. The most important 50 variables were obtained from the mRMR procedure. We initialize the BFS-selected feature set S s with all features in S:
| (4) |
With the mRMR-selected feature subset S s, the next BFS-selected feature set can be gained by the following steps.
Suppose that the candidate feature set is S C = S S − f k. Then an SVR model based on each S C is established and evaluated by LOOCV method.
The feature f which gets the lowest RMSE is selected when removed from S S.
The feature f is removed from S s forming the next BFS-selected feature set.
2.4.2. SVM (Support Vector Machine)
Vapnik and his co-workers developed the SVM algorithm, which is a supervised machine-learning method that is used for classification and regression analysis. Owing to embodying the structural risk minimization principle, the SVM exhibits a better whole performance. The SVM is suitable for the problems which are involved in the small sample set. In this work, SVM was applied to regression. The details of the algorithm can be found in reference [46]. The algorithm was performed by using the software package Weka 3.6.7 [47, 48].
3. Results and Discussion
3.1. Selection of Features
Firstly, mRMR method was applied to rank the total 75 features according to their mRMR scores. Secondly, we used the backward feature selection (BFS) algorithm based on SVR to search for the feature combinations. As different machine learning methods will lead to different results, several robust machine learning methods like the nearest-neighbor algorithm (NNA), support vector machine (SVM based on RBF kernel function), and Adaboost were employed to find an optimal feature subset with leave-one-out cross-validation, respectively. As a result, we adopted the SVM as the prediction engine based on the LOOCV in this study.
Table 1 lists an optimal subset attained by employing the above two-stage feature selection method, mRMR-BFS. The six features in optimal subset can be clustered into three categories (based on the category of Calculator Plugins [49]): elemental analysis, geometry, topology, and others. The geometry and topology factor are more important in this work. The geometry and topology factor are related to the size of the molecule as it indicates that the size of cyanopyrrolidine amides derivatives plays a main role in the inhibitory activity.
Table 1.
Symbols for molecular descriptors involved in the model.
| Molecular descriptor | Type | Description |
|---|---|---|
| OComposition | Elemental analysis functions | O Composition |
| MaximalProjectionArea | Geometry | Calculates the maximal projection area |
| MinimalProjectionArea | Geometry | Calculates the minimal projection area |
| BasicpKa | pKa | Constant denoting basic pKa |
| RingBondCount | Topology | Ring bond count |
| AliphaticRingCount | Topology | Aliphatic ring count |
3.2. Results of Computation
In this work, q train 2, q train-CV 2, and q test 2 were used to present the squared correlation coefficients for the training set, cross-validation set, and external test set, respectively. Also RMSEtrain, RMSEtrain-CV, and RMSEtest were adopted to present the root mean square errors for the training set, cross-validation set, and external test set, respectively.
The final model was built by the SVR based on the Gaussian kernel function (RBF) with the parameters C, ε, and γ that are 2.0, 0.05, and 1.0, respectively. The Gaussian kernel function (RBF) is given as follows:
| (5) |
The model based on the above parameters with original data is given as follows:
| (6) |
where β i is the Lagrange coefficient of support vectors.
The experimental versus predicted pIC50 values based on the SVR model for the training set and test set are shown in Figure 3. As a result, the values of q train 2, q train-CV 2, and q test 2 were 0.953, 0.815, and 0.884, respectively. And the values of RMSEtrain, RMSEtrain-CV, and RMSEtest were 0.123, 0.247, and 0.193, respectively. Figure 3 illustrates that the regression straight line is appropriate not only for the fitting pIC50 values of the training set but also for the predicted pIC50 values of the external test set. Table 2 shows the experimental and the calculated values over the training set and the test set. From Figure 3 and Table 2, it can be concluded that the predicted values are in good agreement with the experimental ones. Figure 4 illustrates the dispersion plot of the residuals for the training and test sets. The predicted values are randomly dispersed around the zero-value line in Figure 4. It means that the model is appropriate for the data.
Figure 3.

Predicted versus experimental pIC50 for the training (circles for fitting and triangle for CV, respectively) and test (stars) sets.
Table 2.
Experimental and predicted pIC50 for the training and test sets.
| No. | pIC50(exp) | pIC50(Pred) | pIC50(LOOCV) |
|---|---|---|---|
| 1 | 7.00 | 7.11 | 7.17 |
| 2T | 7.20 | 7.30 | — |
| 3 | 7.35 | 7.36 | 7.33 |
| 4 | 7.33 | 7.23 | 7.16 |
| 5T | 7.01 | 7.00 | — |
| 6 | 7.14 | 7.04 | 6.92 |
| 7 | 7.14 | 7.03 | 6.84 |
| 8 | 6.71 | 7.01 | 7.14 |
| 9T | 6.64 | 6.80 | — |
| 10 | 7.06 | 7.13 | 7.14 |
| 11 | 6.91 | 7.01 | 7.28 |
| 12 | 6.62 | 6.73 | 6.89 |
| 13 | 6.60 | 6.70 | 6.78 |
| 14T | 6.85 | 6.73 | — |
| 15 | 6.67 | 6.70 | 6.70 |
| 16 | 6.60 | 6.70 | 6.70 |
| 17 | 6.94 | 6.86 | 6.86 |
| 18 | 6.74 | 6.79 | 6.79 |
| 19T | 6.52 | 6.73 | — |
| 20 | 8.70 | 8.27 | 8.18 |
| 21 | 8.30 | 8.34 | 8.34 |
| 22 | 7.46 | 7.39 | 7.39 |
| 23 | 7.40 | 7.50 | 7.43 |
| 24T | 8.22 | 8.24 | — |
| 25 | 8.15 | 8.25 | 8.57 |
| 26 | 8.30 | 8.24 | 8.25 |
| 27 | 8.05 | 8.13 | 8.14 |
| 28 | 8.22 | 8.11 | 8.05 |
| 29 | 8.15 | 8.05 | 7.90 |
| 30T | 8.00 | 7.78 | — |
| 31 | 7.66 | 7.77 | 8.11 |
| 32T | 8.15 | 7.80 | — |
| 33 | 7.82 | 7.93 | 8.17 |
| 34T | 7.77 | 7.54 | — |
| 35T | 7.51 | 7.46 | — |
| 36 | 8.10 | 8.00 | 7.85 |
| 37 | 7.72 | 7.82 | 8.00 |
| 38T | 7.43 | 7.09 | — |
| 39 | 7.96 | 7.93 | 7.93 |
| 40 | 8.10 | 8.17 | 8.18 |
| 41 | 7.51 | 7.40 | 7.30 |
| 42 | 7.92 | 7.89 | 7.89 |
| 43 | 7.51 | 7.47 | 7.47 |
| 44 | 7.92 | 7.93 | 7.93 |
| 45 | 7.80 | 7.70 | 7.55 |
| 46 | 7.60 | 7.76 | 7.84 |
| 47 | 7.85 | 7.75 | 7.26 |
| 48T | 7.89 | 7.98 | — |
Tindicates the test samples.
Figure 4.

Dispersion plot of the residuals for the training and test sets.
3.3. Analysis of the New Method
The secondary development program developed in this work was used to establish a robust model with q train 2 = 0.953, q train-CV 2 = 0.815, and q test 2 = 0.884, respectively. In order to validate the generalization and reliability of the descriptors obtained by using our secondary development program, the same training and test sets were also constructed and optimized at the HF/6 − 31G* level of theory with the Gaussian program; 1262 descriptors were computed by HyperChem 7.5 program [20], JChem for Excel package [34], and the Dragon program [37]. And a robust and reliable model was obtained with q train 2 = 0.969, q train-CV 2 = 0.868, and q test 2 = 0.891, respectively. The statistical comparisons were summarized in Table 3.
Table 3.
Comparative statistical parameters obtained by the secondary development program and the Gaussian program concerning the same compounds.
| Program | q train 2 | q train-CV 2 | q test 2 |
|---|---|---|---|
| The secondary development program developed in this work | 0.953 | 0.815 | 0.884 |
|
| |||
| Gaussian, HyperChem 7.5, JChem for Excel package, Dragon | 0.969 | 0.868 | 0.891 |
It is indicated that it takes less than 30 minutes for a molecule from the structure optimization to the computation of descriptors by using the second development program. In contrast, more than 36 hours were taken based on the Gaussian program. These results show that the computing speeds are greatly improved by using the secondary development program, while the statistical parameters of models are as good as those obtained with the Gaussian method. Therefore, the second development program is very helpful not only for saving the time of descriptor computation but also for providing the effective QSPR models online available in the future.
In a benchmark test, the support vector regression (SVR) was contrasted with the multiple linear regression (MLR) and the back propagation-artificial neural network (BP-ANN) on the q train-CV 2. The statistical comparisons were shown in Table 4. From Table 4, SVR has a better generalization ability in our work.
Table 4.
q train-CV 2 of different methods.
| Method | SVR | BP-ANN | MLR |
|---|---|---|---|
| q train-CV 2 | 0.815 | 0.761 | 0.721 |
3.4. The Online Web Server
Since user-friendly and publicly accessible online servers represent the trend for developing more useful models or predictors, we established a web server for predicting the DPP-IV inhibitory activity pIC50 at http://chemdata.shu.edu.cn:8080/QSARPrediction/index.jsp.
The web server allows users to upload the MOL-format file of a molecule, and the server will return the result of prediction according to the model of our mRMR-BFS-SVR method. In this course, the Calculator Plugins [49] of ChemAxon was invoked in the background program. The server developed has the most outstanding characteristic that users need to do nothing except for uploading the file of the unknown small molecule. Then they can get the predicted result after waiting for some time. It is a remarkable advance compared to our previous work [17, 20, 36].
4. Conclusions
In this paper, the secondary development program was proposed to bring an efficient and fast calculation means for molecular descriptors. The mRMR-BFS was adopted in the procedure of feature selection. The SVR was used to construct the model to map DPP-IV inhibitors to their corresponding inhibitory activity. The q train 2, q train-CV 2, and q test 2 of the model are 0.953, 0.815, and 0.884, respectively. These results are as good as those obtained with the Gaussian method. The web server, which provides a quick approach to predict the DPP-IV inhibitory activities pIC50 of unknown small molecules based on their MOL-format files, was established by using our secondary development program at http://chemdata.shu.edu.cn:8080/QSARPrediction/index.jsp. A user-friendly and rapid approach whose accuracy is approximate with the Gaussian method is proposed in this work.
Supplementary Material
A full list of the structure and molecular descriptors of compound are available in the supplementary Materials.
Acknowledgments
This study was supported by the National Science Foundation of China (20973108, 20902056), the Shanghai Education Committee Project (11ZZ83), and the Leading Academic Discipline Project of Shanghai Municipal Education Commission, China (J50101). The authors also acknowledge ChemAxon for their excellent products.
References
- 1.Kim MH, Lee MK. The incretins and pancreatic beta-cells: use of glucagon-like peptide-1 and glucose-dependent insulinotropic polypeptide to cure type 2 diabetes mellitus. Korean Diabetes Journal. 2010;34(1):2–9. doi: 10.4093/kdj.2010.34.1.2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Sarashina A, Sesoko S, Nakashima M, et al. Linagliptin, a dipeptidyl peptidase-4 inhibitor in development for the treatment of type 2 diabetes mellitus: a phase I, randomized, double-blind, placebo-controlled trial of single and multiple escalating doses in healthy adult male japanese subjects. Clinical Therapeutics. 2010;32(6):1188–1204. doi: 10.1016/j.clinthera.2010.06.004. [DOI] [PubMed] [Google Scholar]
- 3.Augustyns K, Van der Veken P, Haemers A. Inhibitors of proline-specific dipeptidyl peptidases: DPP IV inhibitors as a novel approach for the treatment of type 2 diabetes. Expert Opinion on Therapeutic Patents. 2005;15(10):1387–1407. [Google Scholar]
- 4.Weber AE. Dipeptidyl peptidase IV inhibitors for the treatment of diabetes. Journal of Medicinal Chemistry. 2004;47(17):4135–4141. doi: 10.1021/jm030628v. [DOI] [PubMed] [Google Scholar]
- 5.Edmondson SD, Mastracchio A, Mathvink RJ, et al. (2S,3S)-3-amino-4-(3,3-difluoropyrrolidin-1-yl)-N,N-dimethyl-4-oxo-2-(4-[1, 2,4]triazolo[1,5-a]-pyridin-6-ylphenyl)butanamide: a selective α-amino amide dipeptidyl peptidase IV inhibitor for the treatment of type 2 diabetes. Journal of Medicinal Chemistry. 2006;49(12):3614–3627. doi: 10.1021/jm060015t. [DOI] [PubMed] [Google Scholar]
- 6.Duffy JL, Kirk BA, Wang L, et al. 4-Aminophenylalanine and 4-aminocyclohexylalanine derivatives as potent, selective, and orally bioavailable inhibitors of dipeptidyl peptidase IV. Bioorganic and Medicinal Chemistry Letters. 2007;17(10):2879–2885. doi: 10.1016/j.bmcl.2007.02.066. [DOI] [PubMed] [Google Scholar]
- 7.Xu J, Wei L, Mathvink RJ, et al. Discovery of potent, selective, and orally bioavailable oxadiazole-based dipeptidyl peptidase IV inhibitors. Bioorganic and Medicinal Chemistry Letters. 2006;16(20):5373–5377. doi: 10.1016/j.bmcl.2006.07.061. [DOI] [PubMed] [Google Scholar]
- 8.Xu J, Wei L, Mathvink R, et al. Discovery of potent, selective, and orally bioavailable pyridone-based dipeptidyl peptidase-4 inhibitors. Bioorganic and Medicinal Chemistry Letters. 2006;16(5):1346–1349. doi: 10.1016/j.bmcl.2005.11.052. [DOI] [PubMed] [Google Scholar]
- 9.Garcia TS, Honório KM. Two-dimensional quantitative structure-activity relationship studies on bioactive ligands of peroxisome proliferator-activated receptor δ . Journal of the Brazilian Chemical Society. 2011;22(1):65–72. [Google Scholar]
- 10.García GC, Luque Ruiz I, Gómez-Nieto MÁ. Analysis and study of molecule data sets using snowflake diagrams of weighted maximum common subgraph trees. Journal of Chemical Information and Modeling. 2011;51(6):1216–1232. doi: 10.1021/ci100484z. [DOI] [PubMed] [Google Scholar]
- 11.Jana D, Halder AK, Adhikari N, Maiti MK, Mondal C, Jha T. Chemometric modeling and pharmacophore mapping in coronary heart disease: 2-arylbenzoxazoles as cholesteryl ester transfer protein inhibitors. MedChemComm. 2011;2(9):840–852. [Google Scholar]
- 12.Kovalishyn V, Tanchuk V, Charochkina L, Semenuta I, Prokopenko V. Predictive QSAR modeling of phosphodiesterase 4 inhibitors. Journal of Molecular Graphics and Modelling. 2012;32:32–38. doi: 10.1016/j.jmgm.2011.10.001. [DOI] [PubMed] [Google Scholar]
- 13.Niu B, Su Q, Yuan XC, Lu W, Ding J. QSAR study on 5-lipoxygenase inhibitors based on support vector machine. Medicinal Chemistry. 2012;8(6):1108–1116. doi: 10.2174/1573406411208061108. [DOI] [PubMed] [Google Scholar]
- 14.Paliwal S, Seth D, Yadav D, Yadav R, Paliwal S. Development of a robust QSAR model to predict the affinity of pyrrolidine analogs for dipeptidyl peptidase IV (DPP-IV) Journal of Enzyme Inhibition and Medicinal Chemistry. 2011;26(1):129–140. doi: 10.3109/14756361003777057. [DOI] [PubMed] [Google Scholar]
- 15.Murugesan V, Sethi N, Prabhakar YS, Katti SB. CoMFA and CoMSIA of diverse pyrrolidine analogues as dipeptidyl peptidase IV inhibitors: active site requirements. Molecular Diversity. 2011;15(2):457–466. doi: 10.1007/s11030-010-9267-0. [DOI] [PubMed] [Google Scholar]
- 16.Gao YD, Feng D, Sheridan RP, et al. Modeling assisted rational design of novel, potent, and selective pyrrolopyrimidine DPP-4 inhibitors. Bioorganic and Medicinal Chemistry Letters. 2007;17(14):3877–3879. doi: 10.1016/j.bmcl.2007.04.106. [DOI] [PubMed] [Google Scholar]
- 17.Yang XY, Li MJ, Su Q, Wu M, Gu T, Lu W. QSAR studies on pyrrolidine amides derivatives as DPP-IV inhibitors for type 2 diabetes. Medicinal Chemistry Research. 2013 [Google Scholar]
- 18.Peng S, Jian-Wei Z, Peng Z, Lin X. QSPR modeling of bioconcentration factor of nonionic compounds using Gaussian processes and theoretical descriptors derived from electrostatic potentials on molecular surface. Chemosphere. 2011;83(8):1045–1052. doi: 10.1016/j.chemosphere.2011.01.063. [DOI] [PubMed] [Google Scholar]
- 19.Gu T, Lu W, Bao X, Chen N. Using support vector regression for the prediction of the band gap and melting point of binary and ternary compound semiconductors. Solid State Sciences. 2006;8(2):129–136. [Google Scholar]
- 20.Zhu J, Lu W, Liu L, Gu T, Niu B. Classification of Src kinase inhibitors based on support vector machine. QSAR and Combinatorial Science. 2009;28(6-7):719–727. [Google Scholar]
- 21.Kovalishyn V, Aires-de-Sousa J, Ventura C, Elvas Leitão R, Martins F. QSAR modeling of antitubercular activity of diverse organic compounds. Chemometrics and Intelligent Laboratory Systems. 2011;107(1):69–74. [Google Scholar]
- 22.Xing L, Goulet R, Johnson K. Statistical analysis and compound selection of combinatorial libraries for soluble epoxide hydrolase. Journal of Chemical Information and Modeling. 2011;51(7):1582–1592. doi: 10.1021/ci200123y. [DOI] [PubMed] [Google Scholar]
- 23.Kar S, Deeb O, Roy K. Development of classification and regression based QSAR models to predict rodent carcinogenic potency using oral slope factor. Ecotoxicology and Environmental Safety. 2012;82:85–95. doi: 10.1016/j.ecoenv.2012.05.013. [DOI] [PubMed] [Google Scholar]
- 24.Niu B, Yuan XC, Roeper P, et al. HIV-1 protease cleavage site prediction based on two-stage feature selection method. Protein and Peptide Letters. 2013;20(3):290–298. doi: 10.2174/0929866511320030007. [DOI] [PubMed] [Google Scholar]
- 25.Niu B, Cai YD, Lu WC, Li GZ, Chou KC. Predicting protein structural class with AdaBoost Learner. Protein and Peptide Letters. 2006;13(5):489–492. doi: 10.2174/092986606776819619. [DOI] [PubMed] [Google Scholar]
- 26.Niu B, Jin YH, Feng KY, et al. Predicting membrane protein types with bagging learner. Protein and Peptide Letters. 2008;15(6):590–594. doi: 10.2174/092986608784966921. [DOI] [PubMed] [Google Scholar]
- 27.Niu B, Jin YH, Feng KY, Lu WC, Cai YD, Li GZ. Using AdaBoost for the prediction of subcellular location of prokaryotic and eukaryotic proteins. Molecular Diversity. 2008;12(1):41–45. doi: 10.1007/s11030-008-9073-0. [DOI] [PubMed] [Google Scholar]
- 28.Niu B, Jin Y, Lu L, et al. Prediction of interaction between small molecule and enzyme using AdaBoost. Molecular Diversity. 2009;13(3):313–320. doi: 10.1007/s11030-009-9116-1. [DOI] [PubMed] [Google Scholar]
- 29.Niu B, Jin Y, Lu W, Li G. Predicting toxic action mechanisms of phenols using AdaBoost Learner. Chemometrics and Intelligent Laboratory Systems. 2009;96(1):43–48. [Google Scholar]
- 30.Niu B, Lu L, Liu L, et al. HIV-1 protease cleavage site prediction based on amino acid property. Journal of Computational Chemistry. 2009;30(1):33–39. doi: 10.1002/jcc.21024. [DOI] [PubMed] [Google Scholar]
- 31.Su Q, Lu WC, Niu B, Liu X, Gu TH. Classification of the toxicity of some organic compounds to tadpoles (Rana Temporaria) through integrating multiple classifiers. Molecular Informatics. 2011;30(8):672–675. doi: 10.1002/minf.201000129. [DOI] [PubMed] [Google Scholar]
- 32.Lu IL, Lee SJ, Tsu H, et al. Glutamic acid analogues as potent dipeptidyl peptidase IV and 8 inhibitors. Bioorganic and Medicinal Chemistry Letters. 2005;15(13):3271–3275. doi: 10.1016/j.bmcl.2005.04.051. [DOI] [PubMed] [Google Scholar]
- 33.Tsai TY, Hsu T, Chen CT, et al. Rational design and synthesis of potent and long-lasting glutamic acid-based dipeptidyl peptidase IV inhibitors. Bioorganic and Medicinal Chemistry Letters. 2009;19(7):1908–1912. doi: 10.1016/j.bmcl.2009.02.061. [DOI] [PubMed] [Google Scholar]
- 34.Weber L. JChem base—chemAxon. Chemistry World. 2008;5(10):65–66. [Google Scholar]
- 35. 2013, http://www.chemaxon.com/
- 36.Yang SS, Lu WC, Gu TH, Yan LM, Li GZ. QSPR study of n-octanol/water partition coefficient of some aromatic compounds using support vector regression. QSAR and Combinatorial Science. 2009;28(2):175–182. [Google Scholar]
- 37.Todeschini T. Milano Chemometrics and QSAR Research Group. Milan, Italy: University of Milano-Bicocca; 2004. Dragon 5.0: software for molecular descriptors. [Google Scholar]
- 38.Mukherjee V, Singh K, Singh NP, Yadav RA. Quantum chemical determination of molecular geometries and interpretation of FTIR and Raman spectra for 2,4,5- and 3,4,5-tri-fluoro-benzonitriles. Spectrochimica Acta A. 2008;71(4):1571–1580. doi: 10.1016/j.saa.2008.06.017. [DOI] [PubMed] [Google Scholar]
- 39.Chen Y, Yi Z, Chen SJ, Luo JS, Yi YG, Tang YJ. Study of density functional theory for surface-enhanced raman spectra of p-aminothiophenol. Spectroscopy and Spectral Analysis. 2011;31(11):2952–2955. [PubMed] [Google Scholar]
- 40.Zhang T. A leave-one-out cross validation bound for kernel methods with applications in learning. Computational Learning Theory Proceedings. 2001;2111:427–443. [Google Scholar]
- 41.Yuan J, Li YM, Liu CL, Zha XF. Leave-one-out cross-validation based model selection for manifold regularization. (Lecture Notes in Computer Science).Advances in Neural Networks. 2010;6063:457–464. [Google Scholar]
- 42.Kompany-Zareh M. An improved QSPR study of the toxicity of aliphatic carboxylic acids using genetic algorithm. Medicinal Chemistry Research. 2009;18(2):143–157. [Google Scholar]
- 43.Goodarzi M, Dejaegher B, Vander Heyden Y. Feature selection methods in QSAR studies. Journal of Aoac International. 2012;95(3):636–651. doi: 10.5740/jaoacint.sge_goodarzi. [DOI] [PubMed] [Google Scholar]
- 44.Ding C, Peng H. Minimum redundancy feature selection from microarray gene expression data. Proceedings of the IEEE Bioinformatics Conference; August 2003; pp. 185–205. [DOI] [PubMed] [Google Scholar]
- 45.He Z, Zhang J, Shi XH, et al. Predicting drug-target interaction networks based on functional groups and biological features. PLoS ONE. 2010;5(3) doi: 10.1371/journal.pone.0009603.e9603 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Üstün B, Melssen WJ, Buydens LMC. Facilitating the application of Support Vector Regression by using a universal Pearson VII function based kernel. Chemometrics and Intelligent Laboratory Systems. 2006;81(1):29–40. [Google Scholar]
- 47.Frank E, Hall M, Trigg L, Holmes G, Witten IH. Data mining in bioinformatics using Weka. Bioinformatics. 2004;20(15):2479–2481. doi: 10.1093/bioinformatics/bth261. [DOI] [PubMed] [Google Scholar]
- 48.Chen L, Lu L, Feng K, et al. Multiple classifier integration for the prediction of protein structural classes. Journal of Computational Chemistry. 2009;30(14):2248–2254. doi: 10.1002/jcc.21230. [DOI] [PubMed] [Google Scholar]
- 49. 2013, http://www.chemaxon.com/products/calculator-plugins/
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
A full list of the structure and molecular descriptors of compound are available in the supplementary Materials.

