Abstract
Protein phosphorylation is catalyzed by kinases which regulate many aspects that control death, movement, and cell growth. Identification of the phosphorylation site-specific kinase-substrate relationships (ssKSRs) is important for understanding cellular dynamics and provides a fundamental basis for further disease-related research and drug design. Although several computational methods have been developed, most of these methods mainly use local sequence of phosphorylation sites and protein-protein interactions (PPIs) to construct the prediction model. While phosphorylation presents very complicated processes and is usually involved in various biological mechanisms, the aforementioned information is not sufficient for accurate prediction. In this study, we propose a new and powerful computational approach named KSRPred for ssKSRs prediction, by introducing a novel phosphorylation site-kinase network (pSKN) profiles that can efficiently incorporate the relationships between various protein kinases and phosphorylation sites. The experimental results show that the pSKN profiles can efficiently improve the prediction performance in collaboration with local sequence and PPI information. Furthermore, we compare our method with the existing ssKSRs prediction tools and the results demonstrate that KSRPred can significantly improve the prediction performance compared with existing tools.
1. Introduction
As one of the most common posttranslational modifications (PTMs) [1, 2], phosphorylation plays an important role in the regulation of many cellular processes, such as signal transduction, translation, and transcription [3]. Phosphorylation is catalyzed by protein kinases and usually leads to a functional change, by changing cellular location, enzyme activity, or related to other proteins, of the target protein (substrate) [4, 5]. In human, nearly 75% of all proteins can be modified by protein kinases [6]. Abnormal activity of protein kinases often causes disease, especially cancer, in which protein kinases regulate many aspects that control death, movement, and cell growth [2, 7, 8]. On this point, identification of potential site-specific kinase-substrate relationships (ssKSRs) is important for understanding cellular dynamics and provides a fundamental basis for further disease-related researches and drug design.
To this end, several experimental methods, including low-throughput [9, 10] and high-throughput [11–13] biological technique, are developed to discover phosphorylation sites and corresponding kinases. However, low-throughput experimental identification employs one-by-one manner, which is not only time-consuming but also expensive. Although thousands of phosphorylation sites can be identified by high-throughput mass spectrometry (HTP-MS) techniques [13] in a single experiment [11, 12], it is still difficult to determine which of kinases is responsible for the phosphorylation of the observed site. Therefore, with large-scale phosphoproteomics studies, there is a huge gap between phosphorylation sites and protein kinases, which greatly hampers the study and elucidation of the mechanism of protein phosphorylation in signalling pathways.
So far, several computational methods [14–19] have been put forward to solve this problem during the past few decades, and most of them are mainly based on the sequence information. For example, Zou et al. [20] developed a web server, namely, PKIS, which adopts the composition of monomer spectrum (CMS) to encode the local sequence and then constructed the model with support vector machines (SVMs). Similarly, Damle and Mohanty et al. [15] develop an automated programmer called PhosNetConstruct for predicting target kinases for a substrate protein based on analysis of domain specific kinase-substrate relationships which are derived from the HMM profiles obtained from multiple sequence alignments of related proteins [15]. In addition, recently, some methods [17, 19] use protein-protein interactions (PPIs) to filter potential false positive to further improve performance. For example, Linding et al. [17] develop a web server, namely, NetworKIN, which is based on known sequence motif extracted from Scansite and NetPhosK, and the biological context of substrates is used as a filter to reduce false positives. Meanwhile, to discover the potential protein kinases of the unannotated phosphorylation sites, Song et al. develop a software package of iGPS [19], which is extended from GPS 2.0 [21] algorithm with the interaction filter.
Although these methods have achieved success, phosphorylation presents very complicated processes, it is usually involved in various biological mechanisms. In consequence, the aforementioned information adopted in the existing methods may not fully determine the corresponding protein kinase. It is well known that one protein kinase can catalyze multiple phosphorylation sites and one phosphorylation site can also be phosphorylated by multiple protein kinases [22–24]. For example, CDK2 can catalyze T8, T179, and S213 of protein SMAD3 (P84022), S567 of protein RB1 (P06400), and many other phosphorylation sites [25, 26]. Likewise, S315 of protein TP53 (P04637) can be catalyzed by AURKA, CDK1, CDK2, and so on [27, 28]. The relationships between various protein kinases and phosphorylation sites may bring valuable functional information of protein phosphorylation, which would be helpful in ssKSRs prediction in practice.
Inspired by this information, we propose a novel computational method in this study, namely, KSRPred, for ssKSRs prediction by introducing a phosphorylation site-kinase network (pSKN) profiles that can efficiently incorporate the relationships between various protein kinases and phosphorylation sites. This method is based on the framework of kernel ridge regression [29, 30], which can effectively integrate both pSKN profiles and other useful information including local sequences and PPIs. The experimental results show that the pSKN profiles can efficiently improve the prediction performance in collaboration with local sequence and PPI information. Furthermore, we compare KSRPred with the widely used ssKSRs prediction tools. The results also indicate that the proposed method has a better or comparable prediction performance compared with the existing ssKSRs prediction tools.
2. Materials and Methods
2.1. Data Collection and Preprocessing
In this study, we employ an experimentally verified human phosphorylation sites with corresponding kinases dataset, which include 6,839 verified sites and 389 kinases with 9,480 known ssKSRs extracted from Phospho.ELM [31] and the latest PhosphoSitePlus [32]. And, for this dataset, we follow Xu et al. [33] and Wang et al. [34] and use BlastClust with 70% threshold to remove substrate redundancy. Since iGPS [19], PKIS [20], and NetworKIN [17] use Phospho.ELM as training data, the phosphorylation sites existing in both training and testing data would overestimate the prediction performance. And for fair comparison with the existing tools, we extract an independent test dataset with 1,000 phosphorylation sites from the nonredundant dataset, which excludes the existing phosphorylation sites deposited in Phospho.ELM [31] and the rest as the training dataset. For a specific kinase, the verified sites modified by this kinase are taken as positive samples, and other verified sites are used as negative samples [35]. To achieve a reliable result [15, 36], here we construct models for kinases that at least 15 positive samples and finally 103 kinases are obtained. The detailed information of these kinases are summarized in Table S1 (see Supplementary Material available online at https://doi.org/10.1155/2017/1826496).
2.2. The Sequence Kernel Similarity
A local sequence with a length of 15 amino acids is extracted from the phosphorylation site, which contains 7 upstream and 7 downstream residues. We compute the sequence similarities of two phosphorylation sites using BLOSUM62 matrix, which is an amino acid substitution matrix that shows the similarities among 20 types of amino acids and usually used to calculate the sequence similarity [37]. The similarity between two phosphorylation sites si and sj is calculated as follows:
(1) |
where BLOSUM62(si(k), sj(k)) is the similarity score between the kth amino acid of si and the kth amino acid of sj given by BLOSUM62 matrix. Applying this operation to all phosphorylation sites pairs, we construct a similarity matrix denoted as Sseq. To ensure that the value of Sseq is distributed in the range of [0, 1], normalization is performed subsequently, and the formula is defined as Kseq(i, j) = (Sseq(i, j) − minSseq)/(maxSseq − minSseq). The similarity matrix Kseq is considered as kernel similarity matrix of phosphorylation sites calculated from sequence level.
2.3. The PPI Kernel Similarity
The PPI information of substrates is extracted from STRING [38], which is a comprehensive, yet quality-controlled collection of protein-protein associations. Since these associations are derived from high-throughput experimental data, from the mining of database and literature and from predictions based on genomic context analysis [38], we follow Butland et al. [39] and Jafari et al. [40] and use a median (0.4) confidence cut-off value to filter the association. And 18,836 proteins that interacted with the 2,162 nonredundancy substrates are obtained. We compute the PPI similarities between two substrates using Jaccard Index [41]. The similarity between two substrates pi and pj is calculated as Sppi(pi, pj) = |Jpi∩Jpj|/|Jpi ∪ Jpj|, where Jpi and Jpj represent the PPI information of corresponding substrate, respectively. Applying this operation to all substrate pairs, we construct a similarity matrix denoted as Sppi. However, some substrates have more than one phosphorylation sites; these sites have same substrates and share the same PPI information [42]. The similarity matrix Kppi of phosphorylation sites can be obtained by directly extracting the similarity of substrates. The similarity matrix Kppi is considered as kernel similarity matrix of phosphorylation sites calculated from substrate level.
2.4. Construction of pSKN Profiles and Kernel Similarity
The relationships between various kinases and phosphorylation sites can be expressed as a bipartite network (Figure 1), from which we can extract a novel pSKN profiles. Formally, we denote the phosphorylation site set as Xs = {s1, s2,…, sns} and the kinase set as Xk = {k1, k2,…, knk}; the relationships between various kinases and phosphorylation sites can be described as a bipartite network G(Xs, Xk, E), where E = {eij : si ∈ Xs, kj ∈ Xk}. A link is drawn between si and kj when the phosphorylation site si has relationship with the kinase kj. This bipartite network can be presented by an ns × nk adjacent matrix Y, where yij = 1 if si and kj are linked, while all other unknown phosphorylation site-kinase pairs are labeled as 0. Afterwards, to incorporate pSKN profiles for prediction, we construct a kernel similarity matrix from the pSKN profiles using Gaussian kernel function (i.e., RBF). The similarity between two phosphorylation sites si and sj is calculated as follows:
(2) |
where ysi and ysj represent the ith and jth row of the adjacency matrix Y, respectively. The kernel bandwidth is controlled by the parameter γs. It is normally defined as a new bandwidth parameter γs′ normalized by the average number of relationships with phosphorylation site per kinase. The formula for the calculation of γs is
(3) |
Applying this operation to all phosphorylation site pairs, we construct a similarity matrix denoted as Knet. The similarity matrix Knet is considered as kernel similarity matrix of phosphorylation sites calculated from relationship level.
2.5. Kernel Ridge Classifier
To our knowledge, kernel ridge regression (KRR) is widely used in the field of bioinformatics [43–45], and existing studies [44] show that KRR and SVM have similar classification accuracy. In this study, we test these two algorithms on our dataset and find that KRR is comparable or slightly better than SVM. Therefore, we choose the KRR to construct the prediction model.
Formally, given a training dataset T = {(x1, y1),…, (xn, yn)}, where xi ∈ Rm and yi ∈ {0,1}, the basic idea of KRR relies on mapping the data into a higher dimensional space ℋ (also called feature space) according to a mapping Φ and then finding a linear regression function with the new training set T = {(Φ(x1), y1),…, (Φ(xn), yn), }, which represents a nonlinear regression in the original input space [46]. The linear ridge regression problem consists in minimizing the following cost:
(4) |
where λ is a regularization parameter used to control the trade-off between the bias and variance of the estimate. By calculating the derivative of this cost function [47], we can get the optimal solution ω∗ = ϕ(ϕTϕ + λIn)−1Y. Therefore, for a new unlabeled sample x, the predicted label y (i.e., y = ωT · Φ(x)) can be calculated by the following formula:
(5) |
where Y is the vector of values yi and K(xi, xj) = Φ(xi)TΦ(xj) is the kernel function.
In this study, we develop three similarity kernels, namely, sequence similarity kernel, PPI similarity kernel, and pSKN similarity kernel, from different data sources. In order to make full use of these kernels, we follow van Laarhoven et al. [48] and define a custom kernel function. The formula is defined as follows:
(6) |
And for the reported results of our evaluation, the unweighted average is adopted, that is, ηφ = 1/3, φ ∈ {seq, ppi, net}. Using (5) and (6), we can easily construct the corresponding model and make prediction for unlabeled phosphorylation sites. The model is implemented by the scikit-learn library (version 0.18) [49] in the Python environment.
2.6. Performance Evaluation
Following previous works [50, 51], we use 10-fold cross-validation to evaluate the prediction performance of classifier. The receiver operating characteristic (ROC) curve and the area under the curve (AUC) are used to calculate the average performance of 10-fold cross-validations. Meanwhile, in order to ensure the reliability, fairly, the commonly used measurement indexes are also adopted: specificity (Sp), sensitivity (Sn), Matthew's correlation coefficient (MCC), F-Measure (F1), and Precision (Pre). The formula is defined as follows:
(7) |
TN and TP represent the number of positive and negative sites that are correctly predicted, commonly called true negative and true positive, respectively, while FN and FP represent the number of negative and positive sites that are wrong predicted, commonly called false negative and false positive, respectively. It is noteworthy that when the numbers of positive and negative set are significantly imbalanced, MCC can be used to obtain the balance quality.
3. Results
3.1. Evaluation of pSKN Profiles
In this study, we employ a novel pSKN profiles to predict ssKSRs. To confirm the effectiveness of pSKN profiles, we compare the proposed method with and without pSKN profiles on the basis of local sequence information. The prediction performances of these two methods are evaluated on the training dataset using 10-fold cross-validation. Here, we take kinase GSK3B, PLK1, P38A (MAPK14), and CDK2 as an example to illustrate the predictive performance, as shown in Figure 2. It is indicated that the proposed method with pSKN profiles shows a higher prediction accuracy in the ssKSRs prediction. For example, for GSK3B, the AUC value of the proposed method trained with local sequences is 82.2%. After applying pSKN profiles, the AUC value is improved to 87.2%, which is 5.0% higher than the proposed method trained with local sequences only. Likewise, for PLK1, compared to the proposed method with pSKN profiles and using local sequences only, the value of AUC is increased by 7.2%. Moreover, Figure S1 also displays the ROC curves of the three most pleiotropic protein kinases (i.e., PKCA, PKACA, and CK2A1), from which we can get a consistent conclusion. Taking PKCA as an example, the AUC value of our proposed method with pSKN profiles is 90.3%, which is 5.0% higher than the method with local sequences only.
Additionally, by following previous works [19, 20, 52], some measurements such as Sp, Sn, F1, Pre, and MCC are also adopted to ensure the reliability of performance evaluation. The measurements are evaluated at medium (Sp = 90.0%) and high (Sp = 95.0%) stringency levels, respectively. Table 1 displays the Sn, F1, Pre, and MCC values of different kinases at medium stringency level. It is indicated that the proposed method with pSKN profiles achieves the best predictive performance in almost all cases. For example, for PKCA, the Sn, MCC, F1, and Pre values are 69.5%, 39.8%, 40.5%, and 28.6%, which are improved by 11.6%, 7.1%, 5.6%, and 3.6% compared with the method using local sequences only. Moreover, Table S2 displays the high stringency level of Sn, MCC, F1, and Pre values, from which we can draw a consistent conclusion. In all, these results show that pSKN profiles can significantly improve the prediction performance of different kinases.
Table 1.
Kinases | Methods | AUC | Sn | MCC | F1 | Pre |
---|---|---|---|---|---|---|
CDK2 | Seq | 88.0% | 55.9% | 35.8% | 40.5% | 31.8% |
pSKN | 91.2% | 72.2% | 46.7% | 49.4% | 37.5% | |
Full | 93.4% | 83.1% | 53.6% | 54.8% | 40.9% | |
| ||||||
CK2A1 | Seq | 93.0% | 83.4% | 50.1% | 49.8% | 35.5% |
pSKN | 94.3% | 86.1% | 51.8% | 51.0% | 36.2% | |
Full | 94.4% | 88.4% | 53.1% | 52.0% | 36.8% | |
| ||||||
FYN | Seq | 93.3% | 74.1% | 24.6% | 17.4% | 9.9% |
pSKN | 94.6% | 83.5% | 28.1% | 19.4% | 11.0% | |
Full | 95.5% | 84.7% | 28.5% | 19.7% | 11.1% | |
| ||||||
GSK3B | Seq | 82.2% | 51.7% | 21.0% | 19.4% | 11.9% |
pSKN | 87.2% | 68.5% | 28.9% | 24.9% | 15.2% | |
Full | 89.3% | 73.8% | 31.4% | 26.6% | 16.2% | |
| ||||||
P38A | Seq | 81.2% | 43.2% | 16.7% | 16.2% | 10.0% |
pSKN | 87.9% | 69.2% | 29.0% | 24.8% | 15.1% | |
Full | 90.5% | 75.3% | 31.8% | 26.7% | 16.2% | |
| ||||||
PKACA | Seq | 90.1% | 70.5% | 41.5% | 42.5% | 30.5% |
pSKN | 91.9% | 77.2% | 45.5% | 45.7% | 32.4% | |
Full | 93.0% | 81.0% | 47.8% | 47.4% | 33.5% | |
| ||||||
PKCA | Seq | 85.3% | 57.9% | 32.7% | 34.9% | 25.0% |
pSKN | 90.3% | 69.5% | 39.8% | 40.5% | 28.6% | |
Full | 91.5% | 80.2% | 46.2% | 45.3% | 31.6% | |
| ||||||
PLK1 | Seq | 79.1% | 48.0% | 20.8% | 20.7% | 13.2% |
pSKN | 86.3% | 62.6% | 28.3% | 26.1% | 16.5% | |
Full | 89.7% | 80.4% | 37.2% | 32.4% | 20.3% | |
| ||||||
SRC | Seq | 94.5% | 88.3% | 51.1% | 49.3% | 34.2% |
pSKN | 96.1% | 86.4% | 50.1% | 48.5% | 33.7% | |
Full | 97.2% | 92.9% | 53.8% | 51.2% | 35.3% |
Recently, several studies [17, 19] use the PPI information to filter false positive predictions, which can improve the precision of prediction results with the cost of reduced sensitivity [19]. Subsequently, we test the full method that integrates pSKN profile, local sequence, and PPI information to examine the ability of KSRPred in incorporating PPI information. The performance of AUC values and other measurements at high and medium stringency levels is listed in Table 1 and Table S2. As can be seen, for most of kinases, the proposed method can not only improve the precision of prediction results but also enhance the corresponding sensitivity, which indicates that the proposed method can make better use of PPI information in comparison with the existing methods [17, 19]. Taking P38A as an example, the AUC value of this full method is increased to 90.5%, which is 2.6% higher than the method with pSKN profiles. Besides, the Sn, MCC, F1, and Pre values at medium stringency level (Sp = 90.0%) are improved by 6.1%, 2.8%, 1.9%, and 1.1%, respectively. We also display the performance of other kinases in Table S3.
3.2. Comparison with the Existing ssKSRs Prediction Tools
In the previous section, we have verified the effectiveness of pSKN profiles. In this section, we use the independent test dataset to compare KSRPred with four widely used ssKSRs prediction tools, namely, NetPhosK [53], iGPS [19], NetworKIN [17], and PKIS [20], to evaluate the power of the proposed method. Here, we take four kinases that could be predicted by these tools as an example, and the corresponding ROC curves are displayed in Figure 3. It is indicated that the proposed method is generally superior to the existing tools. For example, for P38A, the AUC value of KSRPred is 90.9%, which is 12.4%, 18.7%, 16.7%, and 9.3% higher than those of NetPhosK, iGPS, NetworKIN, and PKIS, respectively. Likewise, for SRC, the AUC value of KSRPred is 4.40%, 30.10%, 48.50%, and 7.60% larger than those of NetPhosK, iGPS, NetworKIN, and PKIS, respectively.
In addition to the AUC values, the measurements (i.e., Sn, F1, Pre, and MCC) at medium and high stringency levels are also adopted to evaluate the performance. We draw the Sn-MCC-F1-Pre bar chart of the five methods based on the high and medium stringency levels, as shown in Figure 4 and the details are listed in Table S4. The experimental results show that KSRPred achieves the best performance in almost all circumstances in comparison with the existing tools. For example, for SRC, at the high stringency level, the Sn, MCC, F1, and Pre values of KSRPred are increased by 42.9%, 28.1%, 24.0%, and 14.8% compared with iGPS and have an improvement of 50.0%, 33.4%, 28.9%, and 18.3% compared with PKIS, respectively. Similarly, compared with NetPhosK and NetworKIN, the Sn, MCC, F1, and Pre values of KSRPred are also improved 42.9%, 28.1%, 24.0%, and 14.8% and 87.5%, 66.5%, 60.5%, and 45.2%, respectively. We further analyze the results of this kinase and find that at the high stringency level some phosphorylation sites can be correctly assigned by KSRPred, yet not by the existing tools. For example, Y53 of AKAP8 (O43823) is catalyzed by SRC and can be correctly assigned by our method but cannot be predicted by the existing tools. In summary, these results suggest that KSRPred achieves a better or comparable performance as compared with the existing ssKSRs prediction tools. In addition, in Figure S2, we also compare the performance of the proposed method without pSKN profile with NetPhosK and iGPS. The result shows that, compared with these two tools, KSRPred without pSKN profile can also get a better performance. Taken P38A as an example, the AUC achieved by KSRPred without pSKN profile is 7.8% and 14.1% higher than NetPhosK and iGPS, respectively.
3.3. Detailed Analysis of the Prediction Results
After confirming the advantages of the proposed method, we conduct a detailed analysis on the prediction results. It is known that the predicted top-ranked results are more important in practice, which are utilized for proteomic-wide screening and systematic examination [42]. This requires the computational method with low false positive rate. Hence, we compare the numbers of correctly retrieved ssKSRs according to different percentiles. For each percentile p%, we count the number of true ssKSRs in the top-ranked p%∗1,000 predictions. Taking P38A as an example, results of five percentiles 1%, 2%, 5%, 10%, and 15% of the total phosphorylation sites number are compared, as shown in Figure 5. It is indicated that at all percentiles KSRPred can retrieve a more true positive prediction compared with NetPhosK, NetworKIN, iGPS, and PKIS.
In addition, due to the difficulty of experimental verification, computational method is also required to have the ability to detect unknown ssKSRs [42]. In view of this, we analyze the prediction result of top 20 potential phosphorylation sites. Taking CDK2 as an example, the detailed information of these phosphorylation sites is listed in Table 2. By mining of the literature, we find that some results have been confirmed as the phosphorylation sites catalyzed by this kinase. For example, Leng et al. [54] have reported that CDK2 can catalyze the S964 site of protein RBL1 (P28749). Likewise, from the UniProtKB database, we find that this kinase can catalyze the S975 site of protein RBL1 (P28749) (http://www.uniprot.org/uniprot/P28749#ptm_processing). These discoveries suggest that KSRPred has not only a lower false positive rate but also the ability to discover unknown ssKSRs, which could be helpful for the subsequent experimental verification.
Table 2.
Ranking | UniProtKB | Protein name | Site | Score |
---|---|---|---|---|
1 | Q08999 | RBL2 | S1035 | 0.4707 |
2 | P28749 | RBL1 | S964 | 0.4672 |
3 | P28749 | RBL1 | T369 | 0.4481 |
4 | Q08999 | RBL2 | S672 | 0.4403 |
5 | P28749 | RBL1 | S975 | 0.4389 |
6 | Q08999 | RBL2 | T401 | 0.4313 |
7 | Q9UQ35 | SRRM2 | T1413 | 0.3952 |
8 | P49736 | MCM2 | S31 | 0.3736 |
9 | Q9Y5N6 | ORC6 | T195 | 0.3717 |
10 | P24928 | POLR2A | S1878 | 0.3663 |
11 | Q15910 | EZH2 | T487 | 0.3653 |
12 | Q9UQ35 | SRRM2 | T866 | 0.3553 |
13 | O15446 | CD3EAP | S285 | 0.3505 |
14 | P24928 | POLR2A | S1920 | 0.3495 |
15 | P24928 | POLR2A | S1934 | 0.3492 |
16 | Q02539 | HIST1H1A | S183 | 0.3488 |
17 | P10276 | RARA | S77 | 0.3425 |
18 | Q5TKA1 | LIN9 | T96 | 0.3412 |
19 | Q9P1Z0 | ZBTB4 | T983 | 0.3347 |
20 | P49736 | MCM2 | T59 | 0.3338 |
4. Discussions and Conclusions
Phosphorylation plays a significant role in a wide range of cellular processes, which is catalyzed by protein kinases and many phosphorylation-related diseases are closely related to kinases. Prediction of ssKSRs is important for understanding phosphorylation process and provides a fundamental basis for further cell dynamics studies and drug design. However, traditional experimental methods are high-cost and time-consuming, and it is important to develop effective computational methods to predict ssKSRs. Although several computational methods for ssKSRs prediction have been proposed, these methods usually use the local sequence and PPI information, which are not sufficient for accurate prediction. In this study, we present the pSKN profiles that can efficiently incorporate the relationships between various kinases and phosphorylation sites. Using these pSKN profiles, the performance of our proposed method has been significantly improved. Meanwhile, we use PPIs extracted from STRING database as the substrate feature, and the experimental results show that our proposed method could make better use of this information compared with the existing method (e.g., iGPS and NetworKIN). Furthermore, through the analysis of potential phosphorylation sites, we find that some highly ranked results have been confirmed as phosphorylation sites catalyzed by kinases, suggesting its efficiency in discovering new potential ssKSRs for experimental validations and elucidating the molecular mechanism of protein phosphorylation.
Although the proposed method has shown the good ability for ssKSRs prediction, there is still much room for improvement. It is well known that the quantity of training data plays crucial roles in mastering the performance of machine learning methods [55, 56], and when more training data is available, the performance would be further improved. Additionally, kinases have corresponding family information and there are studies [33, 36] showing that this information is useful for ssKSRs prediction. In this study, we do not consider the influence of kinase family information, which can be integrated into the proposed method in further work. Moreover, the PPI dataset used in this study is from STRING database, and there are many other PPI databases that are publicly available, for example, MINT [57] and I2D [58], which can be included to further improve the performance of the proposed method. Furthermore, as kinase catalyzed phosphorylation site is a complex biological process affected by various mechanisms, incorporating more relevant functional information may also enhance the performance of ssKSRs prediction. Finally, the pSKN profiles are extracted from the relationships between kinases and phosphorylation sits, and the experimental results show that this information can effectively improve the prediction performance. However, available experimentally verified relationships between kinases and phosphorylation sits are still comparatively rare. Hence, it is expected that the performance of KSRPred will be further improved when more relationships can be obtained.
Supplementary Material
Acknowledgments
This work was supported by National Natural Science Foundation of China (no. 61471331, no. 61571414, and no. 61101061).
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
References
- 1.Lou Y., Yao J., Zereshki A. NEK2A interacts with MAD1 and possibly functions as a novel integrator of the spindle checkpoint signaling. The Journal of Biological Chemistry. 2004;279(19):20049–20057. doi: 10.1074/jbc.M314205200. [DOI] [PubMed] [Google Scholar]
- 2.Singh C. R., Curtis C., Yamamoto Y., et al. Eukaryotic translation initiation factor 5 is critical for integrity of the scanning preinitiation complex and accurate control of GCN4 translation. Molecular and Cellular Biology. 2005;25(13):5480–5491. doi: 10.1128/MCB.25.13.5480-5491.2005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Cohen P. The origins of protein phosphorylation. Nature Cell Biology. 2002;4(5):E127–E130. doi: 10.1038/ncb0502-e127. [DOI] [PubMed] [Google Scholar]
- 4.Hunter T. Signaling—2000 and beyond. Cell. 2000;100(1):113–127. doi: 10.1016/S0092-8674(00)81688-8. [DOI] [PubMed] [Google Scholar]
- 5.Zhou F.-F., Xue Y., Chen G.-L., Yao X. GPS: A novel group-based phosphorylation predicting and scoring method. Biochemical and Biophysical Research Communications. 2004;325(4):1443–1448. doi: 10.1016/j.bbrc.2004.11.001. [DOI] [PubMed] [Google Scholar]
- 6.Sharma K., D'Souza R. C. J., Tyanova S., et al. Ultradeep Human Phosphoproteome Reveals a Distinct Regulatory Nature of Tyr and Ser/Thr-Based Signaling. Cell Reports. 2014;8(5):1583–1594. doi: 10.1016/j.celrep.2014.07.036. [DOI] [PubMed] [Google Scholar]
- 7.Bajpai M. Fostamatinib, a Syk inhibitor prodrug for the treatment of inflammatory diseases. IDrugs. 2009;12(3):174–185. [PubMed] [Google Scholar]
- 8.Manning G., Whyte D. B., Martinez R., Hunter T., Sudarsanam S. The protein kinase complement of the human genome. Science. 2002;298(5600):1912–1934. doi: 10.1126/science.1075762. [DOI] [PubMed] [Google Scholar]
- 9.Lin Z., Zhang P.-W., Zhu X., et al. Phosphatidylinositol 3-kinase, protein kinase C, and MEK1/2 kinase regulation of dopamine transporters (DAT) require N-terminal DAT phosphoacceptor sites. Journal of Biological Chemistry. 2003;278(22):20162–20170. doi: 10.1074/jbc.M209584200. [DOI] [PubMed] [Google Scholar]
- 10.Salinas M., Wang J., Rosa De Sagarra M., et al. Protein kinase Akt/PKB phosphorylates heme oxygenase-1 in vitro and in vivo. FEBS Letters. 2004;578(1-2):90–94. doi: 10.1016/j.febslet.2004.10.077. [DOI] [PubMed] [Google Scholar]
- 11.Han G., Ye M., Liu H., et al. Phosphoproteome analysis of human liver tissue by long-gradient nanoflow LC coupled with multiple stage MS analysis. Electrophoresis. 2010;31(6):1080–1089. doi: 10.1002/elps.200900493. [DOI] [PubMed] [Google Scholar]
- 12.Song C., Ye M., Han G., et al. Reversed-phase-reversed-phase liquid chromatography approach with high orthogonality for multidimensional separation of phosphopeptides. Analytical Chemistry. 2010;82(1):53–56. doi: 10.1021/ac9023044. [DOI] [PubMed] [Google Scholar]
- 13.Villén J., Beausoleil S. A., Gerber S. A., Gygi S. P. Large-scale phosphorylation analysis of mouse liver. Proceedings of the National Academy of Sciences of the United States of America. 2007;104(5):1488–1493. doi: 10.1073/pnas.0609836104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 14.Blom N., Sicheritz-Pontén T., Gupta R., Gammeltoft S., Brunak S. Prediction of post-translational glycosylation and phosphorylation of proteins from the amino acid sequence. Proteomics. 2004;4(6):1633–1649. doi: 10.1002/pmic.200300771. [DOI] [PubMed] [Google Scholar]
- 15.Damle N. P., Mohanty D. Deciphering kinase-substrate relationships by analysis of domain-specific phosphorylation network. Bioinformatics. 2014;30(12):1730–1738. doi: 10.1093/bioinformatics/btu112. [DOI] [PubMed] [Google Scholar]
- 16.Kumar N., Mohanty D. Identification of substrates for Ser/Thr kinases using residue-based statistical pair potentials. Bioinformatics. 2010;26(2):189–197. doi: 10.1093/bioinformatics/btp633.btp633 [DOI] [PubMed] [Google Scholar]
- 17.Linding R., Jensen L. J., Pasculescu A., et al. NetworKIN: A resource for exploring cellular phosphorylation networks. Nucleic Acids Research. 2008;36(1):D695–D699. doi: 10.1093/nar/gkm902. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Saunders N. F. W., Kobe B. The Predikin webserver: improved prediction of protein kinase peptide specificity using structural information. Nucleic acids research. 2008;36:W286–290. doi: 10.1093/nar/gkn279. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Song C., et al. Systematic analysis of protein phosphorylation networks from phosphoproteomic data. MCP. 2012;11:1070–1083. doi: 10.1074/mcp.M111.012625. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Zou L., Wang M., Shen Y., Liao J., Li A., Wang M. PKIS: Computational identification of protein kinases for experimentally discovered protein phosphorylation sites. BMC Bioinformatics. 2013;14(1, article 247) doi: 10.1186/1471-2105-14-247. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Xue Y., Ren J., Gao X., Jin C., Wen L., Yao X. GPS 2.0, a tool to predict kinase-specific phosphorylation sites in hierarchy. MCP. 2008;7:1598–1608. doi: 10.1074/mcp.M700574-MCP200. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Alessi D., Kozlowski M. T., Weng Q.-P., Morrice N., Avruch J. 3-phosphoinositide-dependent protein kinase 1 (PDK1) phosphorylates and activates the p70 S6 kinase in vivo and in vitro. Current Biology. 1998;8(2):69–81. doi: 10.1016/S0960-9822(98)70037-5. [DOI] [PubMed] [Google Scholar]
- 23.Coba M. P., et al. Neurotransmitters drive combinatorial multistate postsynaptic density networks Science signaling. 2009 doi: 10.1126/scisignal.2000102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Venerando A., Cesaro L., Pinna L. A. From phosphoproteins to phosphoproteomes: A historical account. FEBS Journal. 2017;284:1936–1951. doi: 10.1111/febs.14014. [DOI] [PubMed] [Google Scholar]
- 25.Harbour J. W., Luo R. X., Dei Santi A., Postigo A. A., Dean D. C. Cdk phosphorylation triggers sequential intramolecular interactions that progressively block Rb functions as cells move through G1. Cell. 1999;98(6):859–869. doi: 10.1016/S0092-8674(00)81519-6. [DOI] [PubMed] [Google Scholar]
- 26.Matsuura I., Denissova N. G., Wang G., He D., Long J., Liu F. Cyclin-dependent kinases regulate the antiproliferative function of Smads. Nature. 2004;430(6996):226–231. doi: 10.1038/nature02650. [DOI] [PubMed] [Google Scholar]
- 27.Katayama H., Sasai K., Kawai H., et al. Phosphorylation by aurora kinase A induces Mdm2-mediated destabilization and inhibition of p53. Nature Genetics. 2004;36(1):55–62. doi: 10.1038/ng1279. [DOI] [PubMed] [Google Scholar]
- 28.Luciani M. G., Hutchins J. R. A., Zheleva D., Hupp T. R. The C-terminal regulatory domain of p53 contains a functional docking site for cyclin A. Journal of Molecular Biology. 2000;300(3):503–518. doi: 10.1006/jmbi.2000.3830. [DOI] [PubMed] [Google Scholar]
- 29.He J., Ding L., Jiang L., Ma L. Kernel ridge regression classification. Proceedings of the International Joint Conference on Neural Networks (IJCNN '14); July 2014; pp. 2263–2267. [DOI] [Google Scholar]
- 30.Murphy K. P. Machine Learning: A Probabilistic Perspective. The MIT Press; 2012. [Google Scholar]
- 31.Dinkel H., Chica C., Via A., et al. Phospho.ELM: a database of phosphorylation sites-update 2011. Nucleic Acids Research. 2011;39(1):D261–D267. doi: 10.1093/nar/gkq1104. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Hornbeck P. V., Kornhauser J. M., Tkachev S., et al. PhosphoSitePlus: A comprehensive resource for investigating the structure and function of experimentally determined post-translational modifications in man and mouse. Nucleic Acids Research. 2012;40(1):D261–D270. doi: 10.1093/nar/gkr1122. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Xu X., Li A., Zou L., Shen Y., Fan W., Wang M. Improving the performance of protein kinase identification via high dimensional protein-protein interactions and substrate structure data. Molecular BioSystems. 2014;10(3):694–702. doi: 10.1039/c3mb70462a. [DOI] [PubMed] [Google Scholar]
- 34.Wang M., Li C., Chen W., Wang C. Prediction of PK-specific phosphorylation site based on information entropy. Science in China, Series C: Life Sciences. 2008;51(1):12–20. doi: 10.1007/s11427-008-0012-1. [DOI] [PubMed] [Google Scholar]
- 35.Wang M., Jiang Y., Xu X. A novel method for predicting post-translational modifications on serine and threonine sites by using site-modification network profiles. Molecular BioSystems. 2015;11(11):3092–3100. doi: 10.1039/c5mb00384a. [DOI] [PubMed] [Google Scholar]
- 36.Li A., Xu X., Zhang H., Wang M. Kinase identification with supervised laplacian regularized least squares. PLoS ONE. 2015;10(10) doi: 10.1371/journal.pone.0139676.139676 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37.Li A., Wang L., Shi Y., Wang M., Jiang Z., Feng H. Phosphorylation site prediction with a modified k-Nearest Neighbor algorithm and BLOSUM62 matrix. Proceedings of the 27th Annual International Conference of the Engineering in Medicine and Biology Society (IEEE-EMBS '05); September 2005; pp. 6075–6078. [DOI] [PubMed] [Google Scholar]
- 38.Szklarczyk D., Franceschini A., Kuhn M., et al. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Research. 2011;39(1):D561–D568. doi: 10.1093/nar/gkq973. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Butland G., Babu M., Greenblatt J., Emili A. eSGA: E. coli Synthetic Genetic Array analysis. Protocol Exchange. 2008;5:789–795. doi: 10.1038/nprot.2008.167. [DOI] [PubMed] [Google Scholar]
- 40.Jafari M., Nickchi P., Safari A., Tazehkand S. J., Mirzaie M. IMAN: Interlog protein network reconstruction. Matching and ANalysis. 2016 [Google Scholar]
- 41.Bass J. I. F., Diallo A., Nelson J., Soto J. M., Myers C. L., Walhout A. J. M. Using networks to measure similarity between genes: association index selection. Nature Methods. 2013;10(12):1169–1176. doi: 10.1038/nmeth.2728. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Xu X., Wang M. Inferring disease associated phosphorylation sites via random walk on multi-layer heterogeneous network. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2016;13(5):836–844. doi: 10.1109/TCBB.2015.2498548. [DOI] [PubMed] [Google Scholar]
- 43.Giguère S., Marchand M., Laviolette F., Drouin A., Corbeil J. Learning a peptide-protein binding affinity predictor with kernel ridge regression. BMC Bioinformatics. 2013;14, article 82 doi: 10.1186/1471-2105-14-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Statnikov A., Henaff M., Narendra V., et al. A comprehensive evaluation of multicategory classification methods for microbiomic data. Microbiome. 2013;1(1, article 11) doi: 10.1186/2049-2618-1-11. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Tamez-Pena J., Gonzalez P., Schreyer E., Totterman S. Structural biomarkers predict onset of knee pain: data from the osteoarthritis initiative. Osteoarthritis and Cartilage. 2012;20:p. S34. doi: 10.1016/j.joca.2012.02.561. [DOI] [Google Scholar]
- 46.Trinh D. H., Wong M., Rocchisani J.-M., Pham C. D., Dibos F. Medical image denoising using Kernel ridge regression. Proceedings of the 18th IEEE International Conference on Image Processing (ICIP '11); 2011; pp. 1597–1600. [DOI] [Google Scholar]
- 47.Saunders C., Gammerman A., Vovk V. Ridge Regression Learning Algorithm in Dual Variables. Proceedings of the Paper presented at the Proceedings of the Fifteenth International Conference on Machine Learning; 1998. [Google Scholar]
- 48.van Laarhoven T., Nabuurs S. B., Marchiori E. Gaussian interaction profile kernels for predicting drug-target interaction. Bioinformatics. 2011;27(21):3036–3043. doi: 10.1093/bioinformatics/btr500. [DOI] [PubMed] [Google Scholar]
- 49.Pedregosa F., et al. Scikit-learn: Machine Learning. Python J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
- 50.Yang N. Systems and Computational Biology - Molecular and Cellular Experimental Systems. 2011. [DOI] [Google Scholar]
- 51.Yang Z. R. Predicting sulfotyrosine sites using the random forest algorithm with significantly improved prediction accuracy. BMC Bioinformatics. 2009;10, article 1471:p. 361. doi: 10.1186/1471-2105-10-361. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Xue Y., Li A., Wang L., Feng H., Yao X. PPSP: prediction of PK-specific phosphorylation site with Bayesian decision theory. BMC Bioinformatics. 2006;7, article 163 doi: 10.1186/1471-2105-7-163. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Miller M. L., Ponten T. S., Petersen T. N., Blom N. NetPhosK - Prediction of kinase-specific phosphorylation from sequence and sequence-derived features. FEBS Journal. 2005:272–111. [Google Scholar]
- 54.Leng X., Noble M., Adams P. D., Qin J., Harper J. W. Reversal of growth suppression by p107 via direct phosphorylation by cyclin D1/cyclin-dependent kinase 4. Molecular and Cellular Biology. 2002;22(7):2242–2254. doi: 10.1128/MCB.22.7.2242-2254.2002. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55.Burian S. J., Durrans S. R., Nix S. J., Pitt R. E. Training artificial neural networks to perform rainfall disaggregation. Journal of Hydrologic Engineering. 2001;6(1):43–51. doi: 10.1061/(asce)1084-0699(2001)6:1(43). [DOI] [Google Scholar]
- 56.Hughes E. J., Lewis M., Alabaster C. M., Soldani L. F. Automatic target recognition: Problems of data separability and decision making. Proceedings of the IET Seminar on High Resolution Imaging and Target Classification; November 2005; pp. 29–37. [DOI] [Google Scholar]
- 57.Licata L., Briganti L., Peluso D., et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Research. 2012;40(1):D857–D861. doi: 10.1093/nar/gkr930. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Brown K. R., Jurisica I. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biology. 2007;8(5, article R95) doi: 10.1186/gb-2007-8-5-r95. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.