Skip to main content
Computational and Mathematical Methods in Medicine logoLink to Computational and Mathematical Methods in Medicine
. 2020 Nov 10;2020:8858489. doi: 10.1155/2020/8858489

Succinylation Site Prediction Based on Protein Sequences Using the IFS-LightGBM (BO) Model

Lu Zhang 1, Min Liu 1, Xinyi Qin 1, Guangzhong Liu 1,
PMCID: PMC7673955  PMID: 33224267

Abstract

Succinylation is an important posttranslational modification of proteins, which plays a key role in protein conformation regulation and cellular function control. Many studies have shown that succinylation modification on protein lysine residue is closely related to the occurrence of many diseases. To understand the mechanism of succinylation profoundly, it is necessary to identify succinylation sites in proteins accurately. In this study, we develop a new model, IFS-LightGBM (BO), which utilizes the incremental feature selection (IFS) method, the LightGBM feature selection method, the Bayesian optimization algorithm, and the LightGBM classifier, to predict succinylation sites in proteins. Specifically, pseudo amino acid composition (PseAAC), position-specific scoring matrix (PSSM), disorder status, and Composition of k-spaced Amino Acid Pairs (CKSAAP) are firstly employed to extract feature information. Then, utilizing the combination of the LightGBM feature selection method and the incremental feature selection (IFS) method selects the optimal feature subset for the LightGBM classifier. Finally, to increase prediction accuracy and reduce the computation load, the Bayesian optimization algorithm is used to optimize the parameters of the LightGBM classifier. The results reveal that the IFS-LightGBM (BO)-based prediction model performs better when it is evaluated by some common metrics, such as accuracy, recall, precision, Matthews Correlation Coefficient (MCC), and F-measure.

1. Introduction

Posttranslational modification (PTM) is the chemical modification of the precursor protein after translation, such as the addition of a small molecule protein or the introduction of a functional group, so that the inactive precursor protein can obtain biological functions. There are many forms of posttranslational modifications of proteins, such as ubiquitination, glutarylation, sumoylation, palmitoylation, acetylation, and methylation. Succinylation is PTM that occurs on lysine. Lysine is an α-amino acid encoded by codons AAA and AAG and is easily modified [1]. Succinylation is a broadly conserved protein posttranslational modification that exists in prokaryotic and eukaryotic cells and can coordinate various biological processes [24]. Compared with the methylation and acetylation that occur on lysine, succinylation will cause more substantial changes in the chemical structure of lysine [5]. And in a variety of cell functions, including metabolism and epigenetic regulation, succinylated proteins are involved.

Some studies have shown that abnormalities and variations of succinylation associated with the pathogenesis of many diseases, including tumors [610], cardiac metabolic diseases [11, 12], liver metabolic diseases [13], and nervous system diseases [7, 14, 15]. Therefore, understanding succinylation and identifying the site of succinylation will help determine the pathogenesis of related diseases and develop targeted drugs [16].

Nowadays, many biological experimental methods have been developed to detect succinylated proteins or succinylation sites, such as the high-performance liquid chromatography assays, spectrophotometric assays, and radioactive chemical labeling [17, 18]. However, it is an arduous work on detecting protein succinylation through experiments, which inevitably waste lots of time and money. Machine learning, in contrast, has the advantage of performing a large number of experiments in a short time and has not been affected and restricted by external conditions, which are helpful on recognizing the succinylation sites. Zhao et al. [19] built a predictor called SucPred based on SVM using position amino acid weight composition, van der Waals volume normalized, grouped weight-based encoding, and autocorrelation functions. By using SVM, Xu et al. [20] developed iSuc-PseAAC that implemented a composition of pseudo amino acid (PseAAC) scheme. And then, Xu et al. [21] developed another predictor named SuccFind, which considered several amino acid-based composition encodings, including amino acid composition (AAC), k-spaced amino acid pairs (CKSAAP), and amino acid index (AAindex). Hasan et al. [1] proposed the approach SuccinSite predictor with the RF classifier by integrating multiple sequence features. Ning et al. [22] built a predictor called PSuccE based on SVM using amino acid composition (AAC), binary encoding (BE), physicochemical property (PCP), and grey pseudo amino acid composition (GPAAC).

Although many methods for predicting succinylation sites based on machine learning have been proposed, there is still much room for improvement based on machine learning. Firstly, the protein sequences' feature information is not clearly illuminated for predicting the effects of succinylation site interaction. Secondly, multi-information fusion produces high-dimensional feature vectors, which brings redundancy and noise information. There is an urgent need to use an efficient feature selection method to rank the importance of features and select the best feature subset from them. Finally, the development of experimental technology produced a large amount of succinylation data. How to make full use of experimental data to design effective prediction algorithms is very necessary.

In this study, we propose an IFS-LightGBM (BO) prediction framework based on machine learning to identify succinylation sites of lysines in protein sequences. Four sequence-based features are firstly used to represent the peptides: (1) pseudo amino acid composition (PseAAC), (2) position-specific scoring matrix (PSSM), (3) disorder status, and (4) Composition of k-spaced Amino Acid Pairs (CKSAAP). Secondly, the LightGBM method is used to prioritize the 2501-D feature vector, which obtains from these four sequence-based features. Thirdly, we combine the incremental feature selection (IFS) method and the machine learning methods to perform effective feature fusion and then determine the optimal feature subset, which can eliminate the noise and redundancy information in the original feature vector. Finally, the Bayesian optimization (BO) algorithm is used to optimize the parameters of the LightGBM classifier. This work provides not only a better understanding of the sequence characteristics of protein succinylation modification but also an effective algorithm for directly predicting the succinylation sites in proteins.

2. Materials and Methods

2.1. Dataset

In this work, the training data were collected from dbPTM [23, 24] (http://dbptm.mbc.nctu.edu.tw/index.php), which integrated published literatures, public resources, and a total of eleven biological databases related to PTMs. We obtained 2599 protein sequences including 5049 experimentally verified lysine succinylation sites and 5526 nonsuccinylation sites remained from dbPTM. The data in the dataset used a window size of 2r + 1 to extract the corresponding peptide fragment with lysine (K): “1” was a lysine (K) which was extracted as the central site of a polypeptide segment; “r” was equal to 10, which meant that 10 AA (amino acid) residues were selected from the upstream and downstream of lysine; and finally, a polypeptide segment with a length of 21 was obtained. Among them, the positive sample took the succinylated residue as the central site.

2.2. Description and Representation of Samples

Encode each polypeptide segment as a numeric vector and input it as a feature into the model. This is the most critical step in building an effective predictive model. Therefore, it is necessary to use high-quality sequence encoding methods to generate features that can effectively predict succinylation sites. In this study, we use four types of amino acid feature encoding methods including CKSAAP, disorder, pseudo amino acid composition, and PSSM.

2.2.1. CKSAAP Encoding

The CKSAAP is one of the most classic encoding methods which has been widely used by people in bioinformatics tasks [1, 2531]. In this study, we use polypeptide segments of length 21. Take AxA as an example, whose space number k is equal to 1. For the polypeptide segment which is composed mainly of 20 basic amino acids (i.e., A, R, D, C, ..., W, Y, V), when k = 0, the residue pairs that we need to extract are AA, AR, AD,..., VV, namely, there have a total of (20 × 20) = 400 amino acid pairs.

The following formula is used to calculate the feature vector:

NA,ANtotal,NA,RNtotal,NA,DNtotal,,NV,VNtotal400, (1)

where Ni,j represents the number of amino acid pairs i, j with a distance of k, Ntotal is the total number of k-spaced residue pairs in the fragment. In this study, the optimal kmax is set as 4 (i.e., k = 0, 1, 2, 3, 4); then, the Ntotal will be 20, 19, 18, 17, and 16, which is obtained by calculating Ntotal = Lk − 1. Thus, a polypeptide segment with a length of 21 is transformed into a 2000-dimensional (400 × 5) AA composition feature vector.

2.2.2. Disorder

One of the important indicators to measure the degree of protein structure is its inherent disorder. A large number of research results show that the inherent disorder of proteins plays a very important role in the prediction of protein structure and function [3234]. In order to characterize this property, we use the VSL2B [35] program to predict the disorder score value. By running the tool, two types of results will be obtained, namely, qualitative and quantitative. The quantitative result is a value in the range [0, 1]. It is generally believed that 0.5 is the boundary value for distinguishing order and disorder. If the scoring result exceeds 0.5, it is considered disorder, otherwise order. For a polypeptide segment with a length of 21, the disorder features with 21-dimensional will eventually be obtained.

2.2.3. Pseudo Amino Acid Composition (PseAAC)

In order to avoid complete loss of sequence-order information, the PseAAC [36] was proposed by Chou. PseAAC is an extended form of AAC, and it can identify salient hidden information [3739]. In this work, Type-1 PseAAC is used as the dominating sequence representations. Let A be a polypeptide segment with the length of L, and Ri(i = 1, 2, ⋯, L) is the ith residue of A:

A=R1,R2RL. (2)

The Type-1 PseAAC is a function of A, which produces a (20 + λ)-D vector. The mathematical formulation of Type-1 PseAAC is

PA=p1,p2,p3p20,p21,p22,p23p20+λT, (3)

where the first 20 features that can be obtained based on the composition of a given polypeptide segment, hydrophobicity, side-chain mass, and hydrophilicity of AAs are used to calculate the remaining λ features [38]. The pn of P can be computed by Equations (4) and (5):

pn=fni=120fi+wj=1λbj1n20,wbn20i=120fi+wj=1λbj20<nλ+20, (4)
bλ=i=1Lλ1/3H1RiH1Ri+λ2+H2RiH2Ri+λ2+MRiMRi+λ2Lλ, (5)

where fi reflects the occurrence frequency of the 20 standard amino acids, bj is the jth tier sequence correlation factor, ω is the weight balance parameter for the sequence order effect, and λ is the lag parameter. H1(Ri) and H2(Ri) are normalized values of hydrophilicity and hydrophobicity, respectively. The normalized values of the side-chain mass of Ri are M(Ri). Therefore, when λ = 20 and w = 0.05, each polypeptide segment is transformed into 40 dimensions AA composition feature vector.

2.2.4. PSSM (Position-Specific Scoring Matrix)

Use PSI-BLAST [40] software to search against the SWISS-PROT database to obtain protein evolution information, where the BLAST local protein database is an authoritative protein database which is established by the University of Geneva and the European Bioinformatics Institute (EBI) [41]. By running the PSI-BLAST tool, two L × 20(L = 21) matrices can be obtained. The first one is the position-specific scoring matrix (PSSM), which is used to represent the conversation scores of 20 standard AAs occurring at specific sequence positions during evolution. The second one is the position-specific frequency matrix (PSFM), which contains the frequency of occurrence for a given amino acid at a specific sequence position. For the PSSM, we expand it into a 440-dimensional vector (420 + 20) in the following way. Firstly:

FPSSM=S1A,S1R,,S1V,,SLA,SLR,,SLV. (6)

In addition, FPSSM−mean features can be extracted from the PSSM:

FPSSMmean=i=1LSiALi=1LSiRLi=1LSiVLT, (7)

where L is the length of each polypeptide segment, Sij(i = 1, ⋯, L; j = 1, ⋯, 20) is the values in the matrix. The features of PSSM are achieved by the integration of FPSSM−mean and FPSSM.

We combine four characterizations to obtain a total of a 2501-D feature vector from each polypeptide segment. The distribution of the 2501 features is listed in Table 1. Then, several kinds of feature selection method will be used to rank the 2501-dimensional features according to their importance.

Table 1.

Distribution of 2501-dimensional features on four typical feature types.

Features Feature description Feature dims
CKSAAP Short sequence motif information of polypeptide segments 2000
Disorder Protein intrinsic disorder score 21
PseAAC Physicochemical characteristics of amino acid factors 40
PSSM Evolutionary information of amino acid residues 440
Total 2501

2.3. Feature Selection Method

Before models are developed, because the dataset may have features that are unrelated to target value or noise interfered, it is necessary to select the optimal feature subset through feature selection, so as to reduce the dimension of feature space to further decrease the risk of overfitting. At the same time, the generalization performance and the prediction ability of the models can be further improved when irrelevant features are removed ahead of training. In this article, the LightGBM feature selection method is used to select independent variables that are used to form the optimal feature subset.

LightGBM is an effective GBDT implementation algorithm, which is designed by Microsoft Research Asia [42]. For the LightGBM algorithm, two kinds of the importance type are contained: one is the “split” and the other is “gain.” “Split” reflects the number of times the feature is used in a model. When building the boosted trees, the important features are used more frequently, and the rest are used to improve on the residuals. In this study, the importance type is “gain.” Different from the “split,” the “gain” measures the actual decrease in node impurity. The feature rankings of gain-based importance can be obtained after LightGBM fitting [43].

2.4. Incremental Feature Selection Method

The combination of the IFS method and feature selection method is helpful to select the optimal feature subset. First, the feature selection method is used to construct the feature list. Specifically, different feature selection methods will firstly generate the importance scores of all features in light of the calculation criteria, then arrange the features in descending order according to the importance scores, and finally construct a corresponding feature list based on the importance scores, where the feature list is denoted as F = [f1, f2, ⋯, fL](1 ≤ L ≤ 2501). Next, incremental feature selection (IFS) [44] makes a series of feature subsets. IFS is a process of building an increasing number of feature subsets by gradually adding features. In this study, the incremental step length of the IFS method is set to 1, thereby constructing a series of feature subsets which can be denoted as FSm1, FSm2, ⋯, FSmL, in which the feature subset of FSm1 is constructed using the features ranked first in the feature list, and FSm2 has the top 1 and top 2 features, namely, FSmi = [f1, f2, ⋯fi](1 ≤ iL). Then, the generated feature subsets are sequentially input into the classifier, which use the 10-fold cross-validation method to evaluate. Finally, when the fitness function of the predictor uses a certain feature subset to reach its maximum value, its corresponding feature subset is the best feature subset.

2.5. LightGBM

LightGBM is an algorithm that has been successfully applied in the field of classification, ranking, and many other ML tasks [43, 4548]. It is an ensemble model based on a decision tree algorithm. In order to reduce memory usage and increase the training speed, the Histogram Algorithm is used in LightGBM, which tries to discretize every feature (successive floating-point eigenvalues) in the dataset into k small bins. After that, these bins are used to construct the histogram with width k. Once all the samples are traversed, the histogram will accumulate the required statistics that are gradients and number of samples' in each bin. After accumulating these statistics, the optimal segmentation point can be found based on the maximum gain provided when k bins are divided into two parts.

Besides Histogram Algorithm, LightGBM proposes gradient-based one-side sampling (GOSS), exclusive feature bunding (EFB), and leaf-wise growth methods to further improve computational efficiency without hurting the accuracy. When finding the best split in GOSS, the absolute value of samples' gradients is firstly used to sort the data instances. Then, preserve data instances with higher gradients, and randomly sample instances with low gradients. Finally, GOSS calculates the variance gain V~jd of feature j to give the best split node.

V~jd=1nxiArgi+1a/bxiBlgi2nljd+xiArgi+1a/bxiBlgi2nljd,Al=xiA:xijd,Ar=xiA:xij>d,Bl=xiA:xijd,Br=xiA:xij>d,nljd=IxiAlBl,nrjd=IxiArBr, (8)

where A is the instance set with high gradients and B is the instance set with low gradients. gi is the gradients of each sample, a is the sampling ratio of large gradient data, b is the sampling ratio of small gradient data, and d is the number of iterations.

Furthermore, EFB can speed up the training process of GBDT via bundling exclusive features into a single “big” feature. Compared with the level-wise growth strategy, the leaf-wise growth strategy will choose the leaf with max delta loss to grow, which can reduce more errors and obtain better accuracy under the same splitting times.

2.6. Model Construction and Performance Evaluation

For convenience, the succinylation site prediction method proposed in this study is called IFS-LightGBM (BO). The use of flowcharts can more intuitively show the internal mechanism of model construction. Therefore, we draw Figure 1 to show the overall framework of IFS-LightGBM (BO). The framework is implemented using PyCharm 2020 and Python 3.6. The experiment is conducted on a computer with 2.30 GHz, 32.0 GB RAM, and Windows operating system.

Figure 1.

Figure 1

The overall framework of IFS-LightGBM (BO) for succinylation site prediction.

The specific steps of IFS-LightGBM (BO) for the succinylation site prediction are described as follows:

  1. Dataset. The dataset is used to predict succinylation sites, and it is divided into two categories: positive samples containing succinylation sites and negative samples without succinylation sites; the numbers are 5049 and 5526, respectively

  2. Feature Extraction and Feature Selection. In this study, four feature extraction methods are used: PseAAC, disorder, PSSM, and CKSAAP, which transform the protein sequence signal into a numerical signal. For PSSM, two kinds of encoding methods are used, namely, averaged by column and expanded by row. Then, these four feature extraction methods are fused to predict the succinylation site. As what follows, the LightGBM method is used to prioritize the features and generate the feature list

  3. IFS-LightGBM (BO) Model Construction. The IFS method builds a series of feature subsets based on the feature list, and then, the generated feature subsets are input into the LightGBM classifier in turn. The 10-fold cross-validation is used to evaluate the predictive ability of the classifier. When the performance of the classifier reaches the best, the corresponding feature subset is the optimal feature subset. The selected best feature subset is used as the input features of the model, and then, the BO algorithm is used to optimize the hyperparameters of IFS-LightGBM (BO). Five measurement metrics including ACC and F-measure are used to evaluate model performance

Protein PTM site prediction is essentially a binary classification problem, which can be measured using the notation in Table 2.

Table 2.

Performance table for instances labeled with a class label A.

True label A True not A
Predicted label A True positive (TP) False positive (FP)
Predicted not A False negative (FN) True negative (TN)

TP and FP represent the numbers of true positive and false positive, and the numbers of true negative and false negative are represented by TN and FN, respectively. To evaluate the prediction performance of our proposed method, five measures including accuracy (ACC), recall, precision, Matthews Correlation Coefficient (MCC), and F-measure are used [49]. These metrics are calculated as follows:

Accuracy=TP+TNTP+FP+TN+FN,Recallsensitivity=TPTP+FN,Precision=TPTP+FP,MCC=TP×TNFP×FNTP+FPTP+FNTN+FPTN+FN,Fmeasure=2×Precision×RecallPrecision+Recall. (9)

ACC describes the proportion of samples that is correctly predicted, and its value ranges from 0 to 1, where 1 indicates the best prediction. Recall is the ability to identify positive cases, and precision denotes the class agreement of the data labels with the positive labels given by the classifier. MCC is a correlation coefficient describing the relationship between actual classification and predicted classification. F-measure can combine recall and precision into a single class-specific accuracy, which is selected as the key measurement in this study.

3. Results and Discussion

3.1. Integrated Optimal Feature Extraction

CKSAAP characterizes the short sequence motif information of polypeptide segments, PSSM reflects evolutionary information, disorder features reflect natively disordered residues recognized by VSL2B [35], and PseAAC reflects the physicochemical information of polypeptide segments. Therefore, it is possible that the integration of such four kinds of encodings will characterize the sequence and structural features of polypeptide segments better. But at the same time, the dataset after feature fusion may have features that are unrelated to the target value, so the LightGBM ranking algorithm will be used to sort the fused features and generate a feature list, and then, the IFS method will be used to select the optimal feature subset. This section uses 10-fold cross-validation to evaluate the performance of the model. The corresponding results are shown in Table 3. “All (IFS)” represents the fusion of the four feature extraction methods and the dimensionality reduction through the IFS and LightGBM feature selection method. “All” represents the fusion of the four feature extraction methods without dimensionality reduction.

Table 3.

Predictive metrics of different feature extraction methods.

Features Dimensional ACC Recall MCC Precision F-measure
PseAAC 40 0.7061 0.6970 0.4113 0.6904 0.6937
Disorder 21 0.5709 0.5906 0.1434 0.5469 0.5679
CKSAAP 2000 0.6964 0.6849 0.3916 0.6810 0.6829
PSSM 440 0.6985 0.6576 0.3950 0.6947 0.6756
All 2501 0.7253 0.7053 0.4492 0.7153 0.7103
All (IFS) 2501 0.7360 0.7223 0.4708 0.7240 0.7232

It can be seen that when using the PseAAC, disorder, CKSAAP, PSSM, and “All” feature extraction methods, the F-measure scores are 69.37%, 56.79%, 68.29%, 67.56%, and 71.03%, respectively. The prediction F-measure of “All (IFS)” is 72.32%, which is 1.29% higher than that of “All” and 15.53% higher than that of disorder. We can clearly see that when the four feature extraction methods are fused, as well as the IFS and LightGBM are used to select the optimal feature subset, the prediction F-measure is significantly improved. Moreover, we can see that the prediction accuracy of “All (IFS)” is 2.99%, 16.51%, 3.96%, 3.75%, and 1.07% higher than those of PseAAC, disorder, CKSAAP, PSSM, and “All.” In summary, fusing the four feature extraction methods and using IFS and LightGBM to select the optimal feature subset can improve the prediction performance of protein succinylation sites. Hence, we combine the four feature extraction methods of PseAAC, disorder, CKSAAP, and PSSM to characterize information of each polypeptide segment.

3.2. Results of the IFS and Feature Selection Methods

Using multiple feature extraction methods can better characterize polypeptide segments, but at the same time, it will increase the risk of feature redundancy. In order to select the optimal features, the combination of the IFS method and different feature selection methods will be introduced. Meanwhile, in order to clearly reflect the superiority of the LightGBM feature selection method, we also use ReliefF [50], LinearSVR [51], XGBoost [52], and ANOVA [53] to find the optimal feature subset.

It can be seen from Table 4 that for the dataset, different feature selection methods have a great impact on the accuracy of succinylation site prediction. Among them, the LightGBM feature selection method enables the classifier to obtain the best prediction performance. When the top 351 features are selected as the feature subset, the ACC, recall, precision, MCC, and F-measure are 73.60%, 72.23%, 72.40%, 47.08%, and 72.32%, respectively. The F-measure of LightGBM is 0.51%, 0.73%, 0.04%, and 0.56% higher than that of ReliefF, LinearSVR, XGBoost, and ANOVA, respectively. The ACC is 0.41%, 0.56%, 0.09%, and 0.44% higher than that of ReliefF, LinearSVR, XGBoost, and ANOVA, respectively. The number of optimal features of LightGBM is 1963, 1756, 212, and 852 less than that of ReliefF, LinearSVR, XGBoost, and ANOVA, respectively. Furthermore, we also try the dimensionality reduction methods of the principal component analysis (PCA) [54] and t-Distributed Stochastic Neighbor Embedding (t-SNE) [55] without using the IFS method. As we can see from Table 4, when PCA is used to select the best feature subset, the F-measure value of the model is 0.6746, which is 4.86% less than LightGBM and 4.13% less than LinearSVR. Meanwhile, whent-SNE is used to select the best feature subset, the F-measure value of the model is 0.5218, which is 20.14% less than LightGBM and 19.41% less than LinearSVR. The experimental results indicate that although they consume less time, their predictive performances are worse than using other feature selection methods which combined with IFS, when processing the data in this study. Thus, we choose the way in combining IFS with different feature selection methods to select the optimal feature subset.

Table 4.

Predictive metrics of different feature selection methods.

Feature selection methods Optimal subsets ACC Recall MCC Precision F-measure
LightGBM 351 0.7360 0.7223 0.4708 0.7240 0.7232
ReliefF 2314 0.7319 0.7150 0.4626 0.7211 0.7181
LinearSVR 2107 0.7304 0.7114 0.4594 0.7204 0.7159
XGBoost 563 0.7351 0.7231 0.4692 0.7224 0.7228
ANOVA 1203 0.7316 0.7142 0.4620 0.7211 0.7176
t-SNE 3 0.5526 0.5112 0.1019 0.5328 0.5218
PCA 1517 0.6952 0.6617 0.3885 0.6880 0.6746

To further analyze, we draw the IFS curve plots of the predicted values for each feature selection method (i.e., LightGBM, ReliefF, LinearSVR, XGBoost, and ANOVA) as shown in Figure 2. It can be seen from Figure 2(b) that the LightGBM achieves the optimal F-measure result through a subset with 351 features, while the ReliefF, LinearSVR, XGBoost, and ANOVA achieve the optimal F-measure value when using the subset of the top 2314, 2107, 563, and 1203 features in their own feature list, respectively. Meanwhile, the LightGBM achieves the optimal ACC result using the subset of the top 351 features, while the ReliefF, LinearSVR, XGBoost, and ANOVA achieve the optimal ACC value when using the subset of the top 1465, 2130, 421, and 1203 features in their own feature list, respectively. In this section, the feature lists are different because these feature lists are constructed through different feature selection methods, but the feature subsets generated based on the feature lists will be input to the same classifier. The number of features used in the optimal feature subset determined by the LightGBM method is the least, while its corresponding ACC and F-measure scores are the highest. Conversely, the best feature subset of the LinearSVR contains the largest number of features, and its prediction accuracy and F-measure are smaller. Given the above, the prediction effect of the LightGBM feature selection method is better than ReliefF, LinearSVR, XGBoost, and ANOVA. Thus, we use the LightGBM feature selection method to select the best feature subset.

Figure 2.

Figure 2

(a) Describes the IFS curve of the ACC value of each feature selection method, and (b) describes the IFS curve of the F-measure value of each feature selection method. According to the constructed dataset described in Materials and Methods, the IFS curve shows the trend of (a) ACC value and (b) F-measure value of the five feature selection methods as the number of input features increases. The five feature selection methods are LightGBM, ReliefF, LinearSVR, XGBoost, and ANOVA.

3.3. Selection of Classification Algorithms

The choice of classifier plays a crucial role in constructing an effective prediction model of succinylation sites. According to the discussion in Section 3.1, the 2501-dimensional feature vector will be obtained by fusing four feature extraction methods including PseAAC, disorder, CKSAAP, and PSSM. According to the analysis in Section 3.2, LightGBM will be used as the feature selection method and combined with the IFS method to construct the optimal feature subset. For the sake of reflecting the superiority of the LightGBM classifier in predicting succinylation sites clearly, Random Forest (RF) [56], ExtraTree (ET) [57], Gradient Boosting Decision Tree (GBDT) [58], k-nearest neighbor (kNN) [59], XGBoost, and Naïve Bayes (NB) [60] algorithms will be introduced and compared with it. The number of neighbors in kNN is 5, the number of base decision tree in RF is 100, the number of iterations of LightBGM, XGBoost, and GBDT is 100. It can be seen from Table 5 that when the LightGBM classifier uses a subset of the top 351 features in the LightGBM feature list, the F-measure value can achieve the best classification performance, which is 0.7% higher than GBDT and 13.64% higher than ET. Meanwhile, the ACC value of the LightGBM classifier is 0.7360, which is 1.78%, 13.5%, 0.93%, 7.64%, 0.89%, and 4.82% higher than that of RF, ET, GBDT, kNN, XGBoost, and NB, respectively.

Table 5.

Predictive metrics of respective feature subsets classified by each classifier.

Classifier Optimal subsets ACC Recall MCC Precision F-measure
LightGBM 351 0.7360 0.7223 0.4708 0.7240 0.7232
RF 37 0.7182 0.6795 0.4345 0.7158 0.6972
ET 33 0.6010 0.5934 0.2013 0.5804 0.5868
GBDT 94 0.7267 0.7223 0.4528 0.7102 0.7162
kNN 32 0.6596 0.7209 0.3258 0.6242 0.6691
XGBoost 427 0.7271 0.7114 0.4529 0.7154 0.7134
NB 805 0.6878 0.7724 0.3867 0.6444 0.7026

Moreover, Figure 3 depicts the IFS curves of the ACC value and F-measure value of each classifier. It can be seen from the curve in the figure that when a subset of the top 351 features in the LightGBM feature list is used by the LightGBM classifier, both the F-measure and ACC achieve the optimal classification performance. For RF, ET, GBDT, KNN, XGBoost, and NB algorithms, the highest F-measure values are obtained when they use the top 37, 33, 94, 32, 427, and 805 features in the LightGBM feature list, respectively. And when separately using the top 41, 9, 131, 16, 427, and 44 features in the LightGBM feature list, they can obtain the highest ACC.

Figure 3.

Figure 3

IFS curve describing the ACC value and F-measure value of each classifier. On the basis of the dataset constructed as described in Materials and Methods, the IFS curve shows the trend of the (a) ACC values and (b) F-measure values of the seven classifiers as the number of input features increases. The seven classification algorithms are LightGBM, Random Forest, ExtraTree, GBDT, kNN, XGBoost, and NB.

To prove that the LightGBM classifier is better than the other six machine learning algorithms, we conduct 100 times 10-fold cross-validation on different classifiers by setting different random seed numbers for the cross-validation method, in which the optimal subsets constructed by IFS are acted as the trainset for training different classifiers. To further measure the performances of the seven machine learning methods, we calculate the mean of maximum ACC (Max_ACC mean), the mean of maximum F-measure (Max_F-measure mean), the standard deviation of the maximum ACC (Max_ACC std), and the standard deviation of the maximum F-measure (Max_F-measure std) of each classifier. The results are listed in Tables 6 and 7, respectively. As shown in Tables 6 and 7, the prediction performance of the LightGBM classifier is better than the other six machine learning methods because its “Max_F-measure mean” and “Max_ACC mean” are higher than the RF, ET, GBDT, kNN, XGBoost, and NB algorithms. The “Max_F-measure std” and “Max_ACC std” of NB are smaller than other algorithms, which show that it has a relatively stable prediction performance. It can be seen from Figure 4 that the results of F-measure basically fit the normal distribution, and the continuities of the evaluation metric distributions of the RF, GBDT, kNN, XGBoost, and NB are better than the LightGBM. And the maximum F-measure of the 100 experiments of the LightGBM method is concentrated between 0.712 and 0.718. In conclusion, the LightGBM model has a better predictive performance than the other six models. So, we use LightGBM as the classifier. Figure 4 describes the histogram of the results of the 100 experiments.

Table 6.

The average and standard deviation of the maximum ACC of different classifiers.

Classifier Max_ACC mean Max_ACC std Max of Max_ACC
LightGBM 0.7290 0.0027 0.7360
RF 0.7148 0.0023 0.7203
ET 0.5960 0.0049 0.6081
GBDT 0.7250 0.0021 0.7296
kNN 0.6614 0.0021 0.6661
XGBoost 0.7177 0.0032 0.7271
NB 0.6923 0.0008 0.6944

Table 7.

The average and standard deviation of the maximum F-measure of different classifiers.

Classifier Max_F-measure mean Max_F-measure std Max of Max_F-measure
LightGBM 0.7145 0.0029 0.7232
RF 0.6932 0.0027 0.6989
ET 0.5718 0.0060 0.5868
GBDT 0.7128 0.0022 0.7194
kNN 0.6677 0.0020 0.6730
XGBoost 0.7034 0.0035 0.7134
NB 0.7017 0.0009 0.7039

Figure 4.

Figure 4

The maximum F-measure value distribution obtained by the different classifiers for 100 random tests is shown. The x-coordinate represents the maximum F-measure value, and the y-coordinate represents the number of corresponding intervals. The seven classifiers are (a) LightGBM, (b) Random Forest, (c) ExtraTree, (d) GBDT, (e) kNN, (f) XGBoost, and (g) NB.

3.4. Parameter Tuning and Performance Analysis

In order to further improve the performance of the LightGBM model, the Bayesian optimization (BO) algorithm is used to optimize the model parameters. Bayesian optimization is an extremely powerful method, which uses a surrogate function to estimate the noisy, expensive black-box functions. The core idea of the Bayesian hyperparameter optimization algorithm is to establish a probabilistic model that defines a distribution over objective functions from the input space to the objective of interest [61]. BO uses prior knowledge to approach the relatively cheap posterior distribution and then infers where to explore the next optimum hyperparameter combination according to the distribution. In this study, the Gaussian process (GP) approach is selected as a surrogate model and Expected Improvement (EI) is selected as an acquisition function.

For the surrogate model hyperparameters Θ, let D = {(xn, yn)n=1N} be the observations of input-target pairs, σ2(x; D, Θ) be the predictive variance function, the predictive mean value is μ(x; D, Θ), and define

aPIx;D,Θ=Φγx,γx=fxbestμx;D,Θσx;D,Θ,aEIx;D,Θ=σx;D,ΘγxΦγx+ϕγx, (10)

where f(xbest) is the lowest observed value, Φ and ϕ are the standard cumulative and normal density, respectively [62].

The Bayesian optimization (BO) algorithm is used to optimize some critical hyperparameters in the LightGBM classifier as shown in Table 8. The F-measure between training values and predictive values of 10-fold cross-validation is defined as the fitness function evaluation of hyperparameter optimization of the LightGBM classifier. In order to distinguish from the LightGBM without hyperparameter optimized, the classifier which is optimized by the BO algorithm is called IFS-LightGBM (BO).

Table 8.

Hyperparameter optimization results of IFS-LightGBM (BO).

Hyperparameters Meanings Search ranges Optimal values
learning_rate Learning rate (0.01, 1.0) 0.0274
max_depth Maximum depth of the tree (1, 50) 20
max_bin The max number of bins that feature values will be bucketed in (10, 100) 10
reg_alpha L1 regularization (1e-9, 1.0) 0.9647
boosting_type Training method gbdt; goss; rf; dart goss
num_leaves Number of leaf nodes (1, 50) 11
n_estimators Number of iterations (100, 600) 600

For the IFS-LightGBM (BO) model which uses the optimal feature subset with the 351 features, the parameters learning_rate, max_depth, max_bin, reg_alpha, boosting_type, num_leaves, and n_estimators need to be optimized by using the BO algorithm. The IFS-LightGBM (BO) model with the highest F-measure can be implemented when learning_rate is 0.0274, max_depth is 20, max_bin is 10, reg_alpha is 0.9647, boosting_type is goss, num_leaves is 11, and n_estimators is 600, which gave F-measure of 0.7255. For the sake of reflecting the superiority of the BO algorithm in tuning parameters clearly, we also employ a grid search (GS) method to optimize parameters of the LightGBM classifier, and the optimization process of the hyperparameters is shown in Table 9. Table 10 lists other measurement results of IFS-LightGBM (BO), LightGBM, and IFS-LightGBM (GS). In Tables 10 and 11, “LightGBM (GS)” represents the model with the GS method; “LightGBM” represents the model without hyperparameter optimization.

Table 9.

Hyperparameter optimization results of LightGBM (GS).

Hyperparameters Meanings Search ranges Optimal values
learning_rate Learning rate (0.01, 1.0) 0.1
max_depth Maximum depth of the tree (1, 50) 15
max_bin The max number of bins that feature values will be bucketed in (10, 100) 20
reg_alpha L1 regularization (1e-9, 1.0) 1e-5
boosting_type Training method gbdt; goss; rf; dart gbdt
num_leaves Number of leaf nodes (4, 50) 36
n_estimators Number of iterations (100, 600) 450

Table 10.

Performance comparison of IFS-LightGBM (BO), LightGBM, and LightGBM (GS).

Classifier Optimal subsets ACC Recall MCC Precision F-measure
IFS-LightGBM (BO) 351 0.7392 0.7219 0.4771 0.7291 0.7255
LightGBM 351 0.7360 0.7223 0.4708 0.7240 0.7232
LightGBM (GS) 351 0.7377 0.7190 0.4740 0.7282 0.7235

Table 11.

The average and standard deviation of the maximum F-measure of IFS-LightGBM (BO), LightGBM, and LightGBM (GS).

Classifier Max_F-measure mean Max_F-measure std Max of Max_F-measure
IFS-LightGBM (BO) 0.7229 0.0028 0.7292
LightGBM 0.7145 0.0029 0.7232
LightGBM (GS) 0.7194 0.0027 0.7259

To confirm that the performance of the LightGBM classifier has been further improved by the BO algorithm, we conduct 100 times 10-fold cross-validation on the IFS-LightGBM (BO) model. Meanwhile, the LightGBM classifier and LightGBM (GS) model are also evaluated by 10-fold cross-validation 100 times. Table 11 shows the average and standard deviation of the maximum F-measure of three models for 100 random tests, which are IFS-LightGBM (BO), LightGBM, and LightGBM (GS). As can be seen from Table 11, the “Max_F-measure mean” of the IFS-LightGBM (BO) model is higher than the LightGBM (GS) model and LightGBM classifier. Meanwhile, compared with the grid search method, the Bayesian optimization algorithm has less time consumption in searching the optimal parameters for the LightGBM classifier. Therefore, the Bayesian optimization algorithm is used to optimize parameters of LightGBM to construct the optimal model.

Next, we will further analyze the model output results. On the basis of the output from Section 3.2, we find that when we use the subset of the top 351 features in the LightGBM feature list, the values of F-measure and ACC reach the highest point. It can be seen from Figure 5(a) that among the top 100 features in the feature list, the numbers of PseAAC, disorder, CKSAAP, and PSSM are 38, 16, 9, and 37, respectively. The evolution information (i.e., PSSM) and the physicochemical information (i.e., PseAAC) account for 38% and 37% of the total. Among the top 101-200 features in the feature list, the numbers of PseAAC, disorder, CKSAAP, and PSSM are 2, 5, 29, and 64, respectively. At the same time, among the top 201-351 features in the feature list, the numbers of CKSAAP and PSSM are 46 and 105, respectively.

Figure 5.

Figure 5

Distributions of the top 351 features in the LightGBM feature list on four feature types. Figures (a), (b), and (c) show the distribution of the four types of features ranked in the top 100, top 101-200, and top 201-351 in the feature list, respectively.

4. Conclusions

The study of succinylation and its sites plays an important role in determining the pathogenesis of related diseases and the development of targeted drugs. In this study, we propose a model IFS-LightGBM (BO) based on machine learning for the prediction of succinylation sites. For the dataset, the PseAAC, disorder, PSSM, and CKSAAP four feature extraction methods are used to extract the sequence information and physicochemical information of polypeptide segments. At the same time, the IFS method and different feature selection methods are introduced and combined with the LightGBM classifier to eliminate redundant and noise information to determine the best feature subset. Through comparison, we find that compared with ReliefF, LinearSVR, XGBoost, and ANOVA methods, the LightGBM feature selection method can search for the optimal feature subset faster, and its corresponding model performance evaluation metrics are better. Finally, the BO algorithm is used to adjust the parameters of the LightGBM classifier to establish the best model. The results show that the IFS-LightGBM (BO) model is a very effective way to predict succinylation sites because of owing ACC of 0.7392, recall of 0.7219, MCC of 0.4771, precision of 0.7291, and F-measure of 0.7255.

Acknowledgments

This study was supported by the Natural Science Foundation of Shanghai (17ZR1412500) and the Science and Technology Commission of Shanghai Municipality (STCSM) (18dz2271000).

Data Availability

The data used to support the findings of this study are from previously reported studies and public database, which have been cited.

Conflicts of Interest

The authors declare no conflict of interest.

References

  • 1.Hasan M., Yang S., Zhou Y., Mollah N. H. SuccinSite: a computational tool for the prediction of protein succinylation sites by exploiting the amino acid patterns and properties. Molecular BioSystems. 2016;12:786–795. doi: 10.1039/c5mb00853k. [DOI] [PubMed] [Google Scholar]
  • 2.Tan M., Peng C., Anderson K. A., et al. Lysine glutarylation is a protein posttranslational modification regulated by SIRT5. Cell Metabolism. 2014;19(4):605–617. doi: 10.1016/j.cmet.2014.03.014. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Weinert B. T., Schölz C., Wagner S. A., et al. Lysine succinylation is a frequently occurring modification in prokaryotes and eukaryotes and extensively overlaps with acetylation. Cell Reports. 2013;4(4):842–851. doi: 10.1016/j.celrep.2013.07.024. [DOI] [PubMed] [Google Scholar]
  • 4.Xie Z., Dai J., Dai L., et al. Lysine succinylation and lysine malonylation in histones. Molecular & Cellular Proteomics. 2012;11(5):100–107. doi: 10.1074/mcp.M111.015875. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Li X., Hu X., Wan Y., et al. Systematic identification of the lysine succinylation in the protozoan parasite Toxoplasma gondii. Journal of Proteome Research. 2014;13(12):6087–6095. doi: 10.1021/pr500992r. [DOI] [PubMed] [Google Scholar]
  • 6.Yokoyama A., Katsura S., Sugawara A. Biochemical analysis of histone succinylation. Biochemistry Research International. 2017;2017:7. doi: 10.1155/2017/8529404. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Gibson G. E., Xu H., Chen H. L., Chen W., Denton T. T., Zhang S. Alpha-ketoglutarate dehydrogenase complex-dependent succinylation of proteins in neurons and neuronal cell lines. Journal of Neurochemistry. 2015;134(1):86–96. doi: 10.1111/jnc.13096. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 8.Smestad J., Erber L., Chen Y., Maher L. J., III Chromatin succinylation correlates with active gene expression and is perturbed by defective TCA cycle metabolism. iScience. 2018;2:63–75. doi: 10.1016/j.isci.2018.03.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Wang C., Zhang C., Li X., et al. CPT1A-mediated succinylation of S100A10 increases human gastric cancer invasion. Journal of Cellular and Molecular Medicine. 2019;23(1):293–305. doi: 10.1111/jcmm.13920. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10.Chen X.‐. F., Tian M.‐. X., Sun R.‐. Q., et al. SIRT 5 inhibits peroxisomal ACOX 1 to prevent oxidative damage and is downregulated in liver cancer. EMBO Reports. 2018;19(5) doi: 10.15252/embr.201745124. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Sadhukhan S., Liu X., Ryu D., et al. Metabolomics-assisted proteomics identifies succinylation and SIRT5 as important regulators of cardiac function. Proceedings of the National Academy of Sciences of the United States of America. 2016;113(16):4320–4325. doi: 10.1073/pnas.1519858113. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Boylston J. A., Sun J., Chen Y., Gucek M., Sack M. N., Murphy E. Characterization of the cardiac succinylome and its role in ischemia–reperfusion injury. Journal of Molecular and Cellular Cardiology. 2015;88:73–81. doi: 10.1016/j.yjmcc.2015.09.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Cheng Y., Hou T., Ping J., Chen G., Chen J. Quantitative succinylome analysis in the liver of non-alcoholic fatty liver disease rat model. Proteome Science. 2016;14(1) doi: 10.1186/s12953-016-0092-y. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Liu L., Peritore C., Ginsberg J., Shih J., Arun S., Donmez G. Protective role of SIRT5 against motor deficit and dopaminergic degeneration in MPTP-induced mice model of Parkinson’s disease. Behavioural Brain Research. 2015;281:215–221. doi: 10.1016/j.bbr.2014.12.035. [DOI] [PubMed] [Google Scholar]
  • 15.Lutz M. I., Milenkovic I., Regelsberger G., Kovacs G. G. Distinct patterns of sirtuin expression during progression of Alzheimer’s disease. NeuroMolecular Medicine. 2014;16(2):405–414. doi: 10.1007/s12017-014-8288-8. [DOI] [PubMed] [Google Scholar]
  • 16.Zhu D., Hou L., Hu B., et al. Crosstalk among proteome, acetylome and succinylome in colon cancer HCT116 cell treated with sodium dichloroacetate. Scientific Reports. 2016;6(1, article 37478) doi: 10.1038/srep37478. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Machida Y., Chiba T., Takayanagi A., et al. Common anti-apoptotic roles of parkin and α-synuclein in human dopaminergic cells. Biochemical and. Biophysical. Research Communications. 2005;332(1):233–240. doi: 10.1016/j.bbrc.2005.04.124. [DOI] [PubMed] [Google Scholar]
  • 18.Lind C., Gerdes R., Hamnell Y., et al. Identification of S-glutathionylated cellular proteins during oxidative stress and constitutive metabolism by affinity purification and proteomic analysis. Archives of Biochemistry and Biophysics. 2002;406(2):229–240. doi: 10.1016/S0003-9861(02)00468-X. [DOI] [PubMed] [Google Scholar]
  • 19.Zhao X., Ning Q., Chai H., Ma Z. Accurate in silico identification of protein succinylation sites using an iterative semi-supervised learning technique. Journal of Theoretical Biology. 2015;374:60–65. doi: 10.1016/j.jtbi.2015.03.029. [DOI] [PubMed] [Google Scholar]
  • 20.Xu Y., Ding Y.-X., Ding J., Lei Y.-H., Wu L.-Y., Deng N.-Y. iSuc-PseAAC: predicting lysine succinylation in proteins by incorporating peptide position-specific propensity. Scientific Reports. 2015;5(1) doi: 10.1038/srep10184. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Xu H.-D., Shi S.-P., Wen P.-P., Qiu J.-D. SuccFind: a novel succinylation sites online prediction tool via enhanced characteristic strategy. Bioinformatics. 2015;31(23):3748–3750. doi: 10.1093/bioinformatics/btv439. [DOI] [PubMed] [Google Scholar]
  • 22.Ning Q., Zhao X., Bao L., Ma Z., Zhao X. Detecting succinylation sites from protein sequences using ensemble support vector machine. BMC Bioinformatics. 2018;19(1):p. 237. doi: 10.1186/s12859-018-2249-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Huang K.-Y., Lee T. Y., Kao H. J., et al. dbPTM in 2019: exploring disease association and cross-talk of post-translational modifications. Nucleic Acids Research. 2019;47(D1):D298–D308. doi: 10.1093/nar/gky1074. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Huang K.-Y., Su M. G., Kao H. J., et al. dbPTM 2016: 10-year anniversary of a resource for post-translational modification of proteins. Nucleic Acids Research. 2016;44(D1):D435–D446. doi: 10.1093/nar/gkv1240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Chen K., Kurgan L. A., Ruan J. Prediction of flexible/rigid regions from protein sequences using k-spaced amino acid pairs. BMC Structural Biology. 2007;7(1):p. 25. doi: 10.1186/1472-6807-7-25. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Chen Z., Chen Y. Z., Wang X. F., Wang C., Yan R. X., Zhang Z. Prediction of ubiquitination sites by using the composition of k-spaced amino acid pairs. PLoS One. 2011;6(7, article e22930) doi: 10.1371/journal.pone.0022930. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Chen Y.-Z., Tang Y.-R., Sheng Z.-Y., Zhang Z. Prediction of mucin-type O-glycosylation sites in mammalian proteins using the composition of k-spaced amino acid pairs. BMC Bioinformatics. 2008;9(1):p. 101. doi: 10.1186/1471-2105-9-101. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28.Chen Z., Zhou Y., Zhang Z., Song J. Towards more accurate prediction of ubiquitination sites: a comprehensive review of current methods, tools and features. Briefings in Bioinformatics. 2015;16(4):640–657. doi: 10.1093/bib/bbu031. [DOI] [PubMed] [Google Scholar]
  • 29.Chen K., Kurgan L. A., Ruan J. Prediction of protein structural class using novel evolutionary collocation-based sequence representation. Journal of Computational Chemistry. 2008;29(10):1596–1604. doi: 10.1002/jcc.20918. [DOI] [PubMed] [Google Scholar]
  • 30.Chen Z., Zhou Y., Song J., Zhang Z. hCKSAAP_UbSite: improved prediction of human ubiquitination sites by exploiting amino acid pattern and properties. Biochimica et Biophysica Acta (BBA) - Proteins and Proteomics. 2013;1834(8):1461–1467. doi: 10.1016/j.bbapap.2013.04.006. [DOI] [PubMed] [Google Scholar]
  • 31.Hasan M., Zhou Y., Lu X., Li J., Song J., Zhang Z. Computational identification of protein pupylation sites by using profile-based composition of k-spaced amino acid pairs. PLoS One. 2015;10(6, article e0129635) doi: 10.1371/journal.pone.0129635. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Ferron F., Longhi S., Canard B., Karlin D. A practical overview of protein disorder prediction methods. Proteins. 2006;65(1):1–14. doi: 10.1002/prot.21075. [DOI] [PubMed] [Google Scholar]
  • 33.Noivirt-Brik O., Prilusky J., Sussman J. L. Assessment of disorder predictions in CASP8. Proteins. 2009;77(S9):210–216. doi: 10.1002/prot.22586. [DOI] [PubMed] [Google Scholar]
  • 34.Liu M., Liu G. Prediction of citrullination sites on the basis of mRMR method and SNN. Combinatorial Chemistry & High Throughput Screening. 2020;22(10):705–715. doi: 10.2174/1386207322666191129113508. [DOI] [PubMed] [Google Scholar]
  • 35.Peng K., Radivojac P., Vucetic S., Dunker A. K., Obradovic Z. Length-dependent prediction of protein intrinsic disorder. BMC Bioinformatics. 2006;7(1):p. 208. doi: 10.1186/1471-2105-7-208. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Zhou X.-B., Chen C., Li Z.-C., Zou X.-Y. Using Chou’s amphiphilic pseudo-amino acid composition and support vector machine for prediction of enzyme subfamily classes. Journal of Theoretical Biology. 2007;248(3):546–551. doi: 10.1016/j.jtbi.2007.06.001. [DOI] [PubMed] [Google Scholar]
  • 37.Zhao W., Li G. P., Wang J., Zhou Y. K., Gao Y., Du P. F. Predicting protein sub-Golgi locations by combining functional domain enrichment scores with pseudo-amino acid compositions. Journal of Theoretical Biology. 2019;473:38–43. doi: 10.1016/j.jtbi.2019.04.025. [DOI] [PubMed] [Google Scholar]
  • 38.Yang Z., Wang J., Zheng Z., Bai X. A new method for recognizing cytokines based on feature combination and a support vector machine classifier. Molecules. 2018;23(8):p. 2008. doi: 10.3390/molecules23082008. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Arif M., Hayat M., Jan Z. iMem-2LSAAC: a two-level model for discrimination of membrane proteins and their types by extending the notion of SAAC into chou’s pseudo amino acid composition. Journal of Theoretical Biology. 2018;442:11–21. doi: 10.1016/j.jtbi.2018.01.008. [DOI] [PubMed] [Google Scholar]
  • 40.Altschul S. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research. 1997;25(17):3389–3402. doi: 10.1093/nar/25.17.3389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.The UniProt Consortium. UniProt: the universal protein knowledgebase. Nucleic Acids Research. 2017;45:D158–D169. doi: 10.1093/nar/gkw1099. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Ke G., Meng Q., Finley T., et al. LightGBM: a highly efficient gradient boosting decision tree. Advances in neural information processing systems; 2017; Long Beach, CA, USA. pp. 3146–3154. [Google Scholar]
  • 43.Zhou K., Hu Y., Pan H., et al. Fast prediction of reservoir permeability based on embedded feature selection and LightGBM using direct logging data. Measurement Science and Technology. 2020;31(4, article 045101) doi: 10.1088/1361-6501/ab4a45. [DOI] [Google Scholar]
  • 44.Chen L., Pan X. Y., Zhang Y. H., Liu M., Huang T., Cai Y. D. Classification of widely and rarely expressed genes with recurrent neural network. Computational and Structural Biotechnology Journal. 2019;17:49–60. doi: 10.1016/j.csbj.2018.12.002. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Gao H., Ye Z., Dong J., et al. Predicting drug/phospholipid complexation by the lightGBM method. Chemical Physics Letters. 2020;747, article 137354 doi: 10.1016/j.cplett.2020.137354. [DOI] [Google Scholar]
  • 46.Chen C., Zhang Q., Ma Q., Yu B. LightGBM-PPI: predicting protein-protein interactions through LightGBM with multi-information fusion. Chemometrics and Intelligent Laboratory Systems. 2019;191:54–64. doi: 10.1016/j.chemolab.2019.06.003. [DOI] [Google Scholar]
  • 47.Zheng K., Wang L., You Z.-H. CGMDA: an approach to predict and validate microRNA-disease associations by utilizing chaos game representation and LightGBM. IEEE Access. 2019;7:133314–133323. doi: 10.1109/ACCESS.2019.2940470. [DOI] [Google Scholar]
  • 48.Liang W., Luo S., Zhao G., Wu H. Predicting hard rock pillar stability using GBDT, XGBoost, and LightGBM algorithms. Mathematics. 2020;8(5):p. 765. doi: 10.3390/math8050765. [DOI] [Google Scholar]
  • 49.Sokolova M., Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management. 2009;45(4):427–437. doi: 10.1016/j.ipm.2009.03.002. [DOI] [Google Scholar]
  • 50.Spolaor N., Cherman E. A., Monard M. C., Lee H. D. ReliefF for multi-label feature selection. 2013 Brazilian Conference on Intelligent Systems; 2013; Fortaleza, Brazil. pp. 6–11. [DOI] [Google Scholar]
  • 51.Fan R.-E., Chang K.-W., Hsieh C.-J., Wang X.-R., Lin C.-J. LIBLINEAR: a library for large linear classification. Journal of Machine Learning Research. 2008;4:1871–1874. [Google Scholar]
  • 52.Chen T., Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016; San Francisco California USA. pp. 785–794. [DOI] [Google Scholar]
  • 53.Balbed M. A. M., Ahmad S. M. S., Shakil A. ANOVA-based feature analysis and selection in HMM-based offline signature verification system. 2009 Innovative Technologies in Intelligent Systems and Industrial Applications; 2009; Monash, Malaysia. pp. 66–69. [DOI] [Google Scholar]
  • 54.Timmerman M. E. Principal component analysis. Journal of the American Statistical Association. 2003;98(464):1082–1083. doi: 10.1198/jasa.2003.s308. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.van der Maaten L., Hinton G. Visualizing data using T-SNE. Journal of Machine Learning Research. 2008;9:2579–2605. [Google Scholar]
  • 56.Breiman L. Random forests. Machine Learning. 2001;45:5–32. [Google Scholar]
  • 57.Geurts P., Ernst D., Wehenkel L. Extremely randomized trees. Machine Learning. 2006;63(1):3–42. doi: 10.1007/s10994-006-6226-1. [DOI] [Google Scholar]
  • 58.Friedman J. H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics. 2001;29(5):1189–1536. doi: 10.1214/aos/1013203450. [DOI] [Google Scholar]
  • 59.Keller J. M., Gray M. R., Givens J. A. A fuzzy K-nearest neighbor algorithm. IEEE Transactions on Systems,, Man, and Cybernetics. 1985;SMC-15(4):580–585. doi: 10.1109/TSMC.1985.6313426. [DOI] [Google Scholar]
  • 60.Friedman N., Geiger D., Goldszmidt M. Bayesian network classifiers. Machine Learning. 1997;29(2/3):131–163. doi: 10.1023/A:1007465528199. [DOI] [Google Scholar]
  • 61.Snoek J., Rippel O., Swersky K., et al. Scalable bayesian optimization using deep neural networks. International conference on machine learning; 2015; Lille, France. pp. 2171–2180. [Google Scholar]
  • 62.Moriconi R., Kumar K. S. S., Deisenroth M. P. High-dimensional Bayesian optimization with projections using quantile Gaussian processes. Optimization Letters. 2020;14(1):51–64. doi: 10.1007/s11590-019-01433-w. [DOI] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

The data used to support the findings of this study are from previously reported studies and public database, which have been cited.


Articles from Computational and Mathematical Methods in Medicine are provided here courtesy of Wiley

RESOURCES