predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

Sabit Ahmed; Afrida Rahman; Md Al Mehedi Hasan; Md Khaled Ben Islam; Julia Rahman; Shamim Ahmad

doi:10.1371/journal.pone.0249396

. 2021 Apr 1;16(4):e0249396. doi: 10.1371/journal.pone.0249396

predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

Sabit Ahmed ^1,^*,^#, Afrida Rahman ^1,^#, Md Al Mehedi Hasan ¹, Md Khaled Ben Islam ², Julia Rahman ^1,^¤, Shamim Ahmad ³

Editor: Ozlem Keskin⁴

PMCID: PMC8016359 PMID: 33793659

Abstract

Post-translational modification (PTM) involves covalent modification after the biosynthesis process and plays an essential role in the study of cell biology. Lysine phosphoglycerylation, a newly discovered reversible type of PTM that affects glycolytic enzyme activities, and is responsible for a wide variety of diseases, such as heart failure, arthritis, and degeneration of the nervous system. Our goal is to computationally characterize potential phosphoglycerylation sites to understand the functionality and causality more accurately. In this study, a novel computational tool, referred to as predPhogly-Site, has been developed to predict phosphoglycerylation sites in the protein. It has effectively utilized the probabilistic sequence-coupling information among the nearby amino acid residues of phosphoglycerylation sites along with a variable cost adjustment for the skewed training dataset to enhance the prediction characteristics. It has achieved around 99% accuracy with more than 0.96 MCC and 0.97 AUC in both 10-fold cross-validation and independent test. Even, the standard deviation in 10-fold cross-validation is almost negligible. This performance indicates that predPhogly-Site remarkably outperformed the existing prediction tools and can be used as a promising predictor, preferably with its web interface at http://103.99.176.239/predPhogly-Site.

Introduction

Post-translational modifications (PTM) refer to specific events after the translation stage, where the covalent inclusion of specific functional groups occurs in a protein [1]. These modifications have enormous impacts on biological processes and proteomic analysis, such as cellular signal transduction, subcellular localization, protein folding, protein degradation, and are also responsible for various kinds of diseases [2]. Therefore, accurate identification and effective comprehension of PTM sites are significant for basic research in disease detection, prevention, and various drug developments [3]. Among the 20 standard constituent amino acid residues of cellular proteins, modifications at lysine residue (K) are commonly known as lysine PTM or K-PTM. According to the literature, several K-PTMs such as acetylation, crotonylation, ubiquitination, phosphoglycerylation, glycation, methylation, butyrylation, succinylation, biotinylation can be aided by these covalent modifications [4–8].

Lysine phosphoglycerylation is one of the reversible post-translational modifications, newly discovered in mouse liver and human cells [8, 9]. The formation of 3-phosphoglyceryl-lysine (pgK) takes place when primary glycolytic intermediate (1,3-BPG) interacts with particular lysine residues [8, 10]. A wide variety of diseases, including heart failure, arthritis, and various types of neurodegenerative disorders can be caused by this phosphoglycerylation. Metabolic labeling with substantial glucose indicates that it can be derived from glucose metabolism [9]. It has significant effects on glycolytic enzyme activities and can build up on cells with high glucose exposure [11]. Potential feedback mechanism that contributes to the creation and redirection of glycolytic intermediates to specific biosynthetic pathways is also established [8–11]. Concerning the crucial role of phosphoglycerylation in such biological processes, the effective way to characterize its functional aspects is to identify phosphoglycerylation sites with higher efficacy. Although high throughput experimental procedures to characterize phosphoglycerylation sites are known to achieve higher accuracy, computational methods are getting popularity as an effective alternative because of their laborsaving, time and cost-efficient characteristics.

Recent studies on identifying phosphoglycerylation sites have introduced several computational tools such as, Phogly-PseAAC [9], CKSAAP_PhoglySite [8], iPGK-PseAAC [12] and Bigram-PGK [11]. The first one has applied a KNN-based predictor with the pseudo amino acid feature source [9], where the second one has implemented a fuzzy SVM based predictor with the formation of k-spaced amino acid pairs feature set [8]. iPGK-PseAAC has utilized the pairwise coupling technique with an SVM classifier [11, 12]. The most recently developed predictor, Bigram-PGK has employed SVM with evolutionary information of the sequences for performance improvement [11]. Among these four predictors, only Bigram-PGK can predict phosphoglycerylation sites with an AUC higher than 0.90. However, the overall performance of this predictor needs further improvement in terms of other measurement metrics to be used as a complementary phosphoglycerylation site identification technique.

For constructing an efficient predictor, appropriate informative patterns connected with phosphoglycerylation need to be extracted. In this study, we are introducing a novel computational tool predPhogly-Site for predicting phosphoglycerylation sites by blending vectorized sequence coupling information with PseAAC [3, 13–16]. After generating necessary features from the protein sequences adopted from Bigram-PGK [11], a cost-sensitive SVM [14, 17–19] classifier has been used to predict phosphoglycerylation sites by minimizing class-level imbalance in benchmark dataset. The workflow of our proposed predictor is shown in Fig 1. For validating the statistical significance of the results, 10-fold cross-validation has been repeated ten times, and the average performances of each evaluation metric have been reported in the Results section. It can be observed that our proposed predictor, predPhogly-Site has achieved superior prediction performance than all the existing predictors. The attained performance of predPhogly-Site in terms of specificity, sensitivity, precision, accuracy, MCC, and AUC are 99.97%, 100%, 99.20%, 99.97%, 99.58%, and 99.99%, respectively. The promising results obtained by predPhogly-Site indicates that it can be used as a high-throughput supporting tool for phosphoglycerylation site prediction.

Highlighted in a series of recently published predictors [3, 6, 14, 19–23], to develop an efficient predictor with regards to computational biology, one should go through Chou’s five-step [14, 24, 25] guidelines: i) generating an acceptable benchmark dataset for training and testing the system, ii) formulating the sequences using proper mathematical representations, iii) developing a prediction approach or introducing a robust prediction algorithm, iv) conducting rigorous cross-validation tests to evaluate predictive accuracy, and v) providing an accessible and easy-to-use web-server. Following these steps, details of materials, methods, results, and analysis will be discussed in the following sections.

Materials and methods

Dataset

In this study, verified annotations of phosphoglycerylation sites were obtained from the CPLM version 2.0 [26], one of the reliable repositories of post-translational modification in lysine residue, and corresponding protein sequences were retrieved from UniProt knowledge-base [27] for developing the prediction model. Subsequently, redundant sequences were discarded with 40% similarity cutoff using CD-HIT [28] for avoiding bias in performance evaluation as this level of redundancy removal was widely accepted [11, 24, 29, 30]. As a result, a total of 91 non-redundant proteins were held out for constructing a benchmark dataset. There were 111 experimentally annotated phosphoglycerylated sites and 3249 non-phosphoglycerylated sites, which was identical to the most recent predictor, Bigram-PGK’s [11] dataset (see Table 1). The benchmark dataset containing protein sequences and site positions are given in S1 File. An overview of the dataset preparation as part of the prediction model development is presented in Fig 1. For verifying the statistically significant difference among the positive and negative sites in the obtained dataset, the distribution of amino acid residues in the phosphoglycerylated sites and non-phosphoglycerylated sites are visually analyzed with the help of WebLogo [31] (see Fig 2A and 2B).

Table 1. Summary of the non-redundant phosphoglycerylation dataset.

Similarity threshold	No. of non-redundant proteins	Phosphoglycerylated sites	Non-phosphoglycerylated sites
40%	91	111	3249

Open in a new tab

To demonstrate the viability of the proposed predictor predPhogly-Site for new proteins, an independent test set was constructed with recent phosphoglycerylation sites, utterly unknown to the benchmark dataset used for prediction model development. Protein sequences with recent phosphoglycerylation sites were collected from the PLMD database [32] (version 3.0), which is an upgraded version of the CPLM database [26], released nearly 03 years later with many newly discovered PTM sites. For ensuring the non-existence of training proteins in the independent test set, we considered only those proteins which were newly added to the PLMD repository much after the creation of the benchmark dataset with verified phosphoglycerylation sites. Therefore, we obtained 33 proteins with 41 phosphoglycerylated sites and 1334 non-phosphoglycerylated sites for the independent test (available as S2 File). Furthermore, the non-existence of recent test sites was verified manually for avoiding accidental bias in performance benchmarking.

Feature construction

To formulate the phosphoglycerylation site sequences more meticulously and comprehensively, Chou’s scheme [9, 13, 33] was adopted. According to this scheme, a potential phosphoglycerylation site containing sequence fragment could be expressed as:

\begin{matrix} Θ_{ζ} (K) = Q_{1} Q_{2} \dots Q_{ζ - 1} Q_{ζ} K Q_{ζ + 1} Q_{ζ + 2} \dots Q_{2 ζ - 1} Q_{2 ζ} \end{matrix}

(1)

Where Q₁ to Q_ζ denote the leftward and Q_ζ+1 to Q_2ζ+1 denote the rightward amino acid residues, respectively, while ζ being an integer and centered ‘K’ indicating “lysine” [14]. Furthermore, the peptide sequences Θ_ζ(K) can be categorized into two types: $Θ_{ζ}^{+} (K)$ and $Θ_{ζ}^{-} (K)$ , where the first one denotes phosphoglycerylated peptide and the later one denotes non-phosphoglycerylated peptide with a lysine residue at its center [9, 14]. The sliding window method [9] was adopted to segment the phosphoglycerylation protein sequences with different window size where ζ = 1, 2, 3, …32. Based on the MCC value, window size was selected as (2ζ + 1) = 29 where ζ = 14 (i.e. 14 rightstream and 14 leftstream amino acid residues). It should be mentioned that, only the window sizes less than 65 were taken under consideration due to the compelling protein sequence length [11]. With a sequence fragment of window size 29, Eq (1) could be expressed as:

\begin{matrix} Θ (K) = Q_{1} Q_{2} \dots Q_{13} Q_{14} K Q_{15} Q_{16} \dots Q_{27} Q_{28} \end{matrix}

(2)

At the time of segmentation, for making site sequences’ of equal length, the lacking amino acids were filled with ‘X’ residue [9, 34]. As a result, the phosphoglycerylation dataset had taken the following form:

\begin{matrix} S_{ζ} (K) = S_{ζ}^{+} (K) \cup S_{ζ}^{-} (K) \end{matrix}

(3)

where the positive subset $S_{ζ}^{+} (K)$ could contain only $Θ_{ζ}^{+} (K)$ samples, while the negative subset $S_{ζ}^{-} (K)$ could contain only $Θ_{ζ}^{-} (K)$ samples with their center residue K. All the segmented sequences with the expression of Eqs (2) and (3) are provided in S1 File.

For extracting pertinent features hidden in amino acid sequences, different sequence encoding methods such as amino acid composition, pseudo amino acid composition were used initially. However, in the proposed predictor predPhogly-Site, the vectorized sequence-coupled model [3, 14–16, 35] has been incorporated into general PseAAC [3, 14, 33, 35–39] to extract features from the phosphoglycerylation sites conserving the sequence pattern information. According to this conception, the peptide sample in Eq (2) can be expressed as:

\begin{matrix} Θ (K) = Θ^{+} (K) - Θ^{-} (K) \end{matrix}

(4)

where,

\begin{matrix} Θ^{+} (K) & = [\begin{matrix} Θ^{+} (Q_{1} | Q_{2}) \\ Θ^{+} (Q_{2} | Q_{3}) \\ ⋮ \\ Θ^{+} (Q_{13} | Q_{14}) \\ Θ^{+} (Q_{14}) \\ Θ^{+} (Q_{15}) \\ Θ^{+} (Q_{16} | Q_{15}) \\ ⋮ \\ Θ^{+} (Q_{27} | Q_{26}) \\ Θ^{+} (Q_{28} | Q_{27}) \end{matrix}] Θ^{-} (K) & = [\begin{matrix} Θ^{-} (Q_{1} | Q_{2}) \\ Θ^{-} (Q_{2} | Q_{3}) \\ ⋮ \\ Θ^{-} (Q_{13} | Q_{14}) \\ Θ^{-} (Q_{14}) \\ Θ^{-} (Q_{15}) \\ Θ^{-} (Q_{16} | Q_{15}) \\ ⋮ \\ Θ^{-} (Q_{27} | Q_{26}) \\ Θ^{-} (Q_{28} | Q_{27}) \end{matrix}] \end{matrix}

(5)

where, Θ⁺(Q₁|Q₂) denotes the conditional probability of amino acid Q₁ at the leftmost position given that its adjacent right member is Q₂ and the same applies for remaining indices of leftward residues [24]. Similarly, Θ⁺(Q₂₈|Q₂₇) denotes the conditional probability of amino acid Q₂₈ at the rightmost position given that its adjacent left member is Q₂₇ and so forth. In contrast, only Θ⁺(Q₁₄) and Θ⁺(Q₁₅) are of non-conditional probability as K is the adjoining member of both amino acids Q₁₄ and Q₁₅ [3, 6, 14, 15, 24]. In order to calculate the probability values of Θ⁺(Q₁₄) and Θ⁺(Q₁₅), firstly, we have to find the frequency of a given amino acid Q₁₄ and Q₁₅ from the set of phosphoglycerylated peptides [15]. Then the obtained values should be divided by the frequency of all amino acids occurring at position 14 and 15 respectively. Accordingly, Θ⁻(K) in Eq (5), with its probabilistic components could also be deduced from the set of non-phosphoglycerylated peptides. A few literature on vectorized sequence-coupling model [3, 13, 15, 16] could provide a better understanding of the procedure of probability calculation out of any dataset. Finally, a 28-dimensional feature vector was obtained by using Eqs 4 and 5 for each potential phosphoglycerylated and non-phosphoglycerylated sample.

For better visualization and insights on the sequence-coupling effects at different positions of any sample, we have stored all possible combinations of conditional probability values extracted from the positive subset i.e. Θ⁺(Q₁|Q₂) to Θ⁺(Q₁₃|Q₁₄) and Θ⁺(Q₁₆|Q₁₅) to Θ⁺(Q₂₈|Q₂₇) in one data frame (available in S3 File) and non-conditional probability values for each amino acid residue extracted from the positive subset i.e. Θ⁺(Q₁₄) and Θ⁺(Q₁₅) in another data frame (available in S4 File) using Pandas library [40], where the columns represent the formulated sample positions and the rows represent the amino acid residues. It should be mentioned that there could be 21 × 21 = 441 (including the dummy amino acid residue ′X′) possible combinations of conditional probability values and 21 non-conditional probability values [15] for each position at any formulated sample. Similarly, the conditional and non-conditional probability values extracted from the negative subset are stored in two separate data frames and provided in S3 and S4 Files, respectively. Fig 3A depicts the conditional probability values of amino acid residue ′A′ which have been calculated from the positive subset, given that its right member is any of the 21 amino acid residues at sample positions 1 to 13 and the conditional probability values of any of the 21 amino acid residue given that the left member is ′A′ at sample positions 16 to 28. Similarly, Fig 3B depicts the conditional probability values of amino acid residue ′A′ which have been calculated from the negative subset, given that its right member is any of the 21 amino acid residues at sample positions 1 to 13 and the conditional probability values of any of the 21 amino acid residue given that the left member is ′A′ at sample positions 16 to 28. The non-conditional probability values of 21 amino acid residues derived from the positive subset at sample positions 14 and 15 are illustrated in Fig 4A and The non-conditional probability values of 21 amino acid residues derived from the negative subset at sample position 14 and 15 are shown in Fig 4B.

Prediction method and addressing data imbalance

Phosphoglycerylation site prediction problem defined in the previous section is a classification problem. Statistical learning algorithms such as k-nearest neighbor [41], random forest [42] which are widely used in different bioinformatic prediction model development, support vector machine (SVM) [43, 44] is one of the dominant and successful among these algorithms [24, 45]. Apart from that, the structural risk minimization involves a biasing problem where the majority class [24, 46] influences the classification weight. As the set of phosphoglycerylation peptides was highly skewed (i.e. the ratio between positive and negative peptides was approximately 1:29), it could affect the classification model training directly. Inspired by the success of biasing internal decision function during training, as highlighted in recent research [8, 14, 17, 19], different penalty costs C⁺, and C⁻ were assigned for phosphoglycerylated sites and non-phosphoglycerylated sites, respectively for addressing imbalance issue. Therefore, SVM with cost-sensitivity was applied as a core learning algorithm for prediction model development which can be formulated as:

\begin{matrix} m i n_{w, ξ}^{\frac{1}{2}} {‖ w ‖}^{2} + C^{+} \sum_{k = 1}^{q} ξ_{k} + C^{-} \sum_{k = q + 1}^{n} ξ_{k} \end{matrix}

(6)

(Subject to: Y_k(w.φ(X_k) + a) ≥ 1 − ξ_k for all, k = 1, 2, ‥, n)

where the training set is denoted by {(X_k, Y_k), k = 1, 2, …, n} and first q samples (i.e. Y_k = 1, k = 1, 2, …, q) are assumed as the positive samples while the rest are assumed as the negative samples (i.e. Y_k = −1, k = q + 1, q + 2, …, n). The non-linear feature mapping and slack variables are denoted by φ(X) and ξ_k(k = 1, 2, …, n), respectively [45, 47]. In our experiments with SVM, as the kernel function, Gaussian RBF was adopted which can be described as: Υ(X_k, X_j) = φ(X_k)^T φ(X_j) = exp(−γ‖x_i − x_j‖²), where γ > 0. However, for effective separation of positive and negative samples, addressing the class imbalance problem, misclassification costs $C^{+} = \frac{C * n}{2 * q}$ and $C^{-} = \frac{C * n}{2 * (n - q)}$ were assigned for phosphoglycerylated sites and non-phosphoglycerylated sites, respectively.

Formulation of evaluation metrics

To objectively assess the prediction performance of predPhogly-Site, we have utilized five widely used statistical metrics, such as accuracy (ACC),sensitivity (Sn), specificity (Sp), precision (pre) and Matthew’s Correlation Coefficient (MCC) [20, 24, 30, 45, 47–52]. These matrices can be defined in terms of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) prediction made by the predictor as following:

\begin{matrix} {\begin{matrix} S n = \frac{T P}{T P + F N} \\ S p = \frac{T N}{T N + F P} \\ P r e c i s i o n = \frac{T P}{T P + F P} \\ A C C = \frac{T P + T N}{T P + T N + F P + F N} \\ M C C = \frac{(T P \times T N) - (F P \times F N)}{\sqrt (T P + F P) (T P + F N) (T N + F P) (T N + F N)} \end{matrix} \end{matrix}

(7)

To the best of our knowledge, state-of-the-art phosphoglycerylation site predictors [8, 9, 11, 12] have also estimated their performance based on these metrics. Thus, performance assessment using these metrics was essential to establish a fair comparative benchmarking. Eventually, we have considered the area under the ROC curve (AUC) [24, 53] in addition to MCC for illustrating the stability and robustness of the prediction model.

Validation of the proposed model

To evaluate the statistical significance of a novel predictor’s anticipated performance, three validation schemes, such as k-fold cross-validation, jackknife test, and independent test are widely used [14, 24]. Although the jackknife test can always draw out a unique result for a given dataset and highly desirable, to reduce the computational complexity of model development, researchers prefer k-fold cross-validation over the jackknife test for validating their PTM prediction models [8, 45]. Moreover, existing phosphoglycerylation site predictors validated their anticipated accuracy using k-fold cross validation except Phogly-PseAAC [9]. Even, the most recent predictor, Bigram-PGK [11] validated their model using 10-fold cross-validation and compared with existing predictors. Therefore, to develop and validate our proposed predictor predPhogly-Site, 10-fold cross-validation was adopted. However, as the 10-fold cross-validation involved some arbitrariness, highlighted in [9, 24], to validate the stability, it was repeatedly executed for 10 times. For finding the best performing predictor, a set of prediction models were generated for the hyperparameters C and γ within the grid of C = {2⁰, 2¹, 2², …, 2⁸} and γ = {2⁻¹, 2⁻², 2⁻³, …, 2⁻⁸}. Using 10-fold cross-validation with 10 repeats, the best model with optimal hyperparameters C and γ were selected (see Table 2) depending on the demonstrated AUC.

Table 2. Selected parameters of 10-fold cross validation (10 iterations).

Iteration	1^st	2^nd	3^rd	4^th	5^th
C	2⁰	2⁰	2⁰	2⁰	2⁰
γ	2⁻¹	2⁻²	2⁻²	2⁻²	2⁻²
Iteration	6^th	7^th	8^th	9^th	10^th
C	2¹	2²	2²	2⁰	2⁰
γ	2⁻¹	2⁻²	2⁻²	2⁻²	2⁻²

Open in a new tab

The 10-iterations of 10-fold cross-validation were performed according to the following steps:

Step 1: Extract the sequence-coupled features from the segmented sequences provided in S1 File using Eqs (4) and (5).
Step 2: Divide the extracted dataset randomly into 10 disjoint sets.
Step 3: Select 1 set as test set and utilize the remaining 9 sets as training set.
Step 4: Train the RBF kernel based SVM predictor with the training set using the optimal hyperparameters (C, γ) of the respective iteration (see Table 2).
Step 5: Perform prediction on the test set.
Step 6: Repeat steps 2 to 5 until all 10 sets had been used for testing.
Step 7: Merge the prediction outputs and measure the performance with Eq 7.
Step 8: Repeat steps 1 to 7 for 10 times.
Step 9: Measure the average performance of 10 repetitions with corresponding standard deviations.

The predictive decision-making workflow of predPhogly-Site is available at https://github.com/Sabit-Ahmed/predPhogly-Site as a git repository. For additional validation, an independent test was performed on a set of recent phosphoglycerylation sites. It will be discussed thoroughly in the next section.

Results and discussions

Performance of predPhogly-Site

In this work, we employed SVM with variable cost adjustments [14, 19, 24] for suppressing the imbalance between phosphoglycerylated and non-phosphoglycerylated sites. For separating samples by transforming to higher dimensional feature space, radial basis kernel function [14, 22, 24] was utilized. The average results of the considered statistical performance measures with their standard deviations in 10 repeats are presented in Table 3. As shown in Table 3, the proposed prediction model could predict phosphoglycerylation sites with 99.97% accuracy. In addition to that, its sensitivity, specificity, MCC and AUC measure crossed a benchmark of 99%. Moreover, standard deviations were almost negligible in the case of all the measures. However, for constructing the proposed predictor predPhogly-Site to be deployed as a web service, the benchmark dataset and the prediction model’s hyper-parameters with the highest AUC in 10 repetitions (i.e. C = 2⁰ and γ = 2⁻²) were used. An overview of establishing predPhogly-Site is depicted in Fig 1.

Table 3. Cross-validation performance of predPhogly-Site on the benchmark dataset.

Predictor	Sp	Sn	Pre	ACC	MCC	AUC
predPhogly-Site	0.9997 ± 0.0001	1.00±0.00	0.9920±0.0027	0.9997±0.0001	0.9958±0.0014	0.9999±0.00

Open in a new tab

Comparative analysis of cross-validation performance

To evaluate the effectiveness of the proposed predictor, predPhogly-Site, we compared it with four state-of-the-art phosphoglycerylation site predictors, such as Phogly-PseAAC [9], CKSAAP_PhoglySite [8], iPGK-PseAAC [12] and Bigram-PGK [11]. Among these predictors, the first three i.e. Phogly-PseAAC, CKSAAP_PhoglySite, and iPGK-PseAAC were benchmarked on the same phosphoglycerylation site dataset which was prepared by Xu et al. [9]. Prediction from Phogly-PseAAC and iPGK-PseAAC could be accessed by their web interface. Though CKSAAP_PhoglySite was also accessible by its Matlab interface, there was no such accessibility option in the most recent predictor, Bigram-PGK. However, Bigram-PGK had collected prediction results from these accessible predictors for its benchmark dataset and reported comparative outcomes for all the considered performance metrics. Thus, for conducting a fair comparison with all these predictors, our primary benchmark dataset, which was not resampled as Bigram-PGK’s one, was submitted to the webserver of Phogly-PseAAC and iPGK-PseAAC for getting prediction outcomes. However, CKSAAP_PhoglySite’s predictions were obtained through its Matlab interface. After achieving the prediction outcomes from the Phogly-PseAAC, CKSAAP_PhoglySite, and iPGK-PseAAC on the benchmark dataset constructed for this study, the corresponding performance was measured on the same validation set utilized for evaluating our predictor predPhogly-Site (see Section “Validation of the proposed model”). As we adopted different technique for handling the data imbalance issue and could not obtain the prediction outcomes from the Bigram-PGK predictor on our benchmark dataset, a comparative summary of all the measures was presented in Table 4 in line with Bigram-PGK’s experimental findings [11]. As shown in Table 4 and Fig 5, predPhogly-Site achieved a significant improvement over Phogly-PseAAC, CKSAAP_PhoglySite, and iPGK-PseAAC on the same benchmark dataset used in this study. It remarkably outperformed these predictors in sensitivity, specificity, overall accuracy, and AUC. For instance, predPhogly-Site crossed the milestone of 99% in case of sensitivity, specificity, precision, overall accuracy, MCC and AUC.

Table 4. Cross-validation performance of the existing prediction systems.

Predictor	Sp	Sn	Pre	ACC	MCC	AUC
iPGK-PseAAC	0.9846	0.4595	0.5050	0.9673	0.4648	0.7220
iPGK-PseAAC^*	0.9864	0.4555	0.9548	0.8119	0.5692	0.7230
CKSAAP_PhoglySite	0.8941	0.8288	0.2110	0.8920	0.3845	0.8615
CKSAAP_PhoglySite^*	0.9420	0.8285	0.8765	0.9043	0.7818	0.8854
Phogly-PseAAC	0.7064	0.6937	0.0747	0.7060	0.1550	0.7000
Phogly-PseAAC^*	0.7193	0.6927	0.5518	0.7102	0.3951	0.7062
Bigram-PGK^*	0.8973	0.9642	0.8253	0.9193	0.8330	0.9306
predPhogly-Site	0.9997	1.00	0.9920	0.9997	0.9958	0.9999

Open in a new tab

* Corresponds to the experimental findings reported by the Bigram-PGK study [11].

However, the most recent predictor, Bigram-PGK’s [11] performance was relatively higher in most of the metrics. It obtained a sensitivity of 96.42%, an accuracy of 91.93%, an MCC of 83.30%, and an AUC of 93.06% on the dataset utilized in Bigram-PGK [11]. As demonstrated in Table 4, our proposed predictor predPhogly-Site also outperformed Bigram-PGK [11] by 3.58% in sensitivity, 8.04% in accuracy measure, 16.28% in MCC and 6.93% in AUC. Furthermore, the effectiveness of predPhogly-Site over the recent predictors including Bigram-PGK [11] has been demonstrated in Fig 5.

It can be observed that a comparatively higher specificity and precision of 98.64% and 95.48%, respectively, were obtained by iPGK-PseAAC [12] on the Bigram-PGK’s [11] resampled dataset. Our proposed predictor, predPhogly-Site, has obtained 1.33% and 3.72% increased performance in both specificity and precision, respectively. Both the results represented in Table 4 and Fig 5 indicate that our proposed predictor predPhogly-Site can identify phosphoglycerylation sites more effectively than any other existing predictors.

It is worth mentioning that among these predictors, Phogly-PseAAC [9] has employed the position-specific amino acid propensity which reflects the position-wise occurrence frequency of each amino acid and the K-Nearest Neighbor (KNN) algorithm for prediction, CKSAAP_PhoglySite [8] has utilized the composition of k-spaced amino acid pairs with the fuzzy SVM, iPGK-PseAAC [12] has applied the pairwise coupling technique with the posterior probability-based SVM and Bigram-PGK [11] have considered the SVM engine with the combination of position-specific scoring matrix and profile bigrams for performance improvement.

It might be intuitive to find some insight into why our proposed predictor predPhogly-Site achieved such superior performance. It was possible because of the effective representation of phosphoglycerylation modification in terms of sequence coupling model among the amino acid residues via the conditional probability (see Figs 3 and 4). Suppressing the imbalance ratio of phosphoglycerylated and non-phosphoglycerylated sites using different error costs based SVM also boosted up the performance improvement.

However, the precision calculation measures the believability of a system when it says a peptide sample is phosphoglycerylated. According to Eq 7, the precision measure depends highly on the false positive rate, and a lower false positive rate results in a higher precision rate. In the Bigram-PGK [11] study, the dataset contained only 111 positive samples and 224 negative samples after applying the k-nearest neighbor cleaning treatment [11] and the experimental findings on the resampled dataset might not reflect the false positive rate properly. Moreover, the existing predictors i.e. iPGK-PseAAC, CKSAAPPhoglySite, and Phogly-PseAAC might not handle the real world imbalanced situation of the dataset appropriately. Hence, when we have uploaded the benchmark dataset containing 111 positive instances and 3249 negative instances (see Table 1) to the web or Matlab interfaces of the existing predictors, the false positive rates have come out higher and results in lower precision rates as compared to the experimental findings reported by the Bigram-PGK study (see Table 4). On the other hand, our proposed predictor has obtained a much lower false positive rate and got a higher precision rate as well as higher sensitivity and specificity for having cost-sensitive SVM as an imbalance management technique. By observing all the performance measurements in this study, it can be concluded that our predictor predPhogly-Site could be a high throughput tool for predicting phosphoglycerylation sites more precisely.

Independent test

Existing phosphoglycerylation site, particularly, the most recent predictor assessed their model using 10-fold cross-validation. However, some researchers [54–57] highlighted the necessity of independent test for assessing prediction model in addition to k-fold (e.g. k = 5,10) cross-validation. Thus, in our work, an independent test was conducted for further evaluation of our proposed model predPhogly-Site on an independent set of phosphoglycerylation sites. The same independent test set was uploaded to the web servers of the existing predictors i.e. iPGK-PseAAC, Phogly-PseAAC and predPhogly-Site for obtaining the prediction results. However, the prediction results of CKSAAP_PhoglySite on the independent test set were obtained from the Matlab interface. The predictive performance of predPhogly-Site as well as other predictors were summarized in Table 5. However, as Bigram-PGK [11] had no established web-server, so we could not report the performance of these predictors on the independent test set.

Table 5. Prediction performance in Independent test.

Predictor	Sp	Sn	Pre	ACC	MCC	AUC
iPGK-PseAAC	0.9738	0.2927	0.2553	0.9535	0.2494	0.6332
Phogly-PseAAC	0.6837	0.6829	0.0622	0.6836	0.1329	0.6833
CKSAAP_PhoglySite	0.8823	0.7561	0.1649	0.8785	0.3161	0.8192
predPhogly-Site	0.9993	0.9512	0.9750	0.9978	0.9619	0.9752

Open in a new tab

As shown in Table 5, predPhogly-Site predicted independent phosphoglycerylation sites with specificity, sensitivity, precision, accuracy, MCC and AUC of 99.93%, 95.12%, 97.50%, 99.78%, 96.19% and 97.52%, respectively, which were almost identical to the cross-validation performance delineated in Table 4. According to the experimental results in Table 5 and the ROC curve illustrated in Fig 6, it was apparent that the proposed predictor predPhogly-Site achieved a significant improvement over their counterparts in terms of all the evaluation metrics.

Web-server

For intensifying user accessibility without the concern of experimental implementations, an easy-to-use web-server for predPhogly-Site has been developed. It can be accessed at http://103.99.176.239/predPhogly-Site. Users can submit one or more query protein sequence(s) directly on the web-server as text input in Fasta format or may prefer to upload as a batch to get their predictions. More detailed guidelines on how to use the web-server as well as the working mechanism of this server can also be found there. After submitting a query protein or as a batch, it may take a few moments to get the prediction result, depending on the availability of server resources. Finally, predPhogly-Site will generate a result page based on the user’s submission, i.e., if protein sequences are submitted into the input box, the predictive data will be shown on the result page. Otherwise, it will be sent to the corresponding user through email.

Conclusion

In this study, for identifying phosphoglycerylation sites in protein with higher accuracy, a novel computational tool, predPhogly-Site, has been developed utilizing the coupling effects in a sequence. It exploits probabilistic sequence pattern information with variable cost adjustment in the classifier’s decision function for achieving higher predictive performance compared to the existing phosphoglycerylation site predictors. It has achieved significant performance improvement not only in the 10-fold cross-validation, which has been used as the benchmarking technique in the existing predictors but also in an independent test. Moreover, it has also achieved almost identical performance in both 10-fold cross-validation and independent test, which clearly demonstrates its stability. In the 10-fold cross-validation test, it has achieved more than 0.99 in both AUC and MCC, and in case of the independent test, it has achieved nearly 0.97 in the corresponding measures. These experimental outcomes demonstrate that predPhogly-Site is highly promising compared to the existing state-of-the-art phosphoglycerylation site predictors. It is expected to become a high throughput computational tool for PTM researcher for fast exploration of lysine modifications. Even the experimental scientists would be benefited from this web-based tool without going through its mathematical and implementation details. For further performance improvement and usability of this prediction tool, multiple types of post-translational modification with heterogeneous data would be incorporated simultaneously along with prediction interpretation support.

Supporting information

S1 File. Benchmark dataset.

The phosphoglycerylated proteins as well as the segmented sequences with respective protein ID and positions have been provided.

(PDF)

Click here for additional data file.^{(237.6KB, pdf)}

S2 File. Independent test dataset.

Proteins which have been recently added to the PLMD database and completely unknown to the proposed system.

(PDF)

Click here for additional data file.^{(65.5KB, pdf)}

S3 File. All possible combinations of the conditional probability values derived from the positive and negative subset.

(XLSX)

Click here for additional data file.^{(160.5KB, xlsx)}

S4 File. The non-conditional probability values of 21 amino acids derived from the positive and negative subset.

(XLSX)

Click here for additional data file.^{(10.3KB, xlsx)}

Data Availability

All relevant data are within the paper and its Supporting information files.

Funding Statement

The author(s) received no specific funding for this work.

References

1. Saraswathy N, Ramalingam P. Concepts and techniques in genomics and proteomics. Elsevier; 2011. [Google Scholar]
2. McDowell G, Philpott A. New insights into the role of ubiquitylation of proteins. In: International review of cell and molecular biology. vol. 325. Elsevier; 2016. p. 35–88. [DOI] [PubMed] [Google Scholar]
3. Qiu WR, Sun BQ, Xiao X, Xu ZC, Chou KC. iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics. 2016;32(20):3116–3123. 10.1093/bioinformatics/btw380 [DOI] [PubMed] [Google Scholar]
4. Freiman RN, Tjian R. Regulating the regulators: lysine modifications make their mark. Cell. 2003;112(1):11–17. 10.1016/S0092-8674(02)01278-3 [DOI] [PubMed] [Google Scholar]
5. Reddy HM, Sharma A, Dehzangi A, Shigemizu D, Chandra AA, Tsunoda T. GlyStruct: glycation prediction using structural properties of amino acid residues. BMC bioinformatics. 2019;19(13):55–64. 10.1186/s12859-018-2547-x [DOI] [PMC free article] [PubMed] [Google Scholar]
6. Jia J, Liu Z, Xiao X, Liu B, Chou KC. iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Analytical biochemistry. 2016;497:48–56. 10.1016/j.ab.2015.12.009 [DOI] [PubMed] [Google Scholar]
7. Xu Y, Chou KC. Recent progress in predicting posttranslational modification sites in proteins. Current topics in medicinal chemistry. 2016;16(6):591–603. 10.2174/1568026615666150819110421 [DOI] [PubMed] [Google Scholar]
8. Ju Z, Cao JZ, Gu H. Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou’s general PseAAC. Journal of Theoretical Biology. 2016;397:145–150. 10.1016/j.jtbi.2016.02.020 [DOI] [PubMed] [Google Scholar]
9. Xu Y, Ding YX, Ding J, Wu LY, Deng NY. Phogly-PseAAC: prediction of lysine phosphoglycerylation in proteins incorporating with position-specific propensity. Journal of Theoretical Biology. 2015;379:10–15. 10.1016/j.jtbi.2015.04.016 [DOI] [PubMed] [Google Scholar]
10. Moellering RE, Cravatt BF. Functional lysine modification by an intrinsically reactive primary glycolytic metabolite. Science. 2013;341(6145):549–553. 10.1126/science.1238327 [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Chandra A, Sharma A, Dehzangi A, Shigemizu D, Tsunoda T. Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix. BMC molecular and cell biology. 2019;20(2):1–9. 10.1186/s12860-019-0240-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
12. Liu LM, Xu Y, Chou KC. iPGK-PseAAC: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC. Medicinal Chemistry. 2017;13(6):552–559. 10.2174/1573406413666170515120507 [DOI] [PubMed] [Google Scholar]
13. Chou KC. Prediction of signal peptides using scaled window. peptides. 2001;22(12):1973–1979. 10.1016/S0196-9781(01)00540-X [DOI] [PubMed] [Google Scholar]
14. Hasan MAM, Ahmad S. mLysPTMpred: Multiple Lysine PTM Site Prediction Using Combination of SVM with Resolving Data Imbalance Issue. Natural Science. 2018;10(9):370–384. 10.4236/ns.2018.109035 [DOI] [Google Scholar]
15. Chou KC. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. Journal of Biological Chemistry. 1993;268(23):16938–16948. 10.1016/S0021-9258(19)85285-7 [DOI] [PubMed] [Google Scholar]
16. Chou KC. Prediction of human immunodeficiency virus protease cleavage sites in proteins. Analytical biochemistry. 1996;233(1):1–14. 10.1006/abio.2000.4757 [DOI] [PubMed] [Google Scholar]
17.Veropoulos K, Campbell C, Cristianini N, et al. Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI. vol. 55; 1999. p. 60.
18. Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PloS one. 2011;6(9). 10.1371/journal.pone.0024756 [DOI] [PMC free article] [PubMed] [Google Scholar]
19. Hasan MAM, Ahmad S, Molla MKI. iMulti-HumPhos: a multi-label classifier for identifying human phosphorylated proteins using multiple kernel learning based support vector machines. Molecular BioSystems. 2017;13(8):1608–1618. 10.1039/C7MB00180K [DOI] [PubMed] [Google Scholar]
20. Ju Z, Wang SY. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou’s general pseudo amino acid composition. Gene. 2018;664:78–83. 10.1016/j.gene.2018.04.055 [DOI] [PubMed] [Google Scholar]
21. Ju Z, He JJ. Prediction of lysine propionylation sites using biased SVM and incorporating four different sequence features into Chou’s PseAAC. Journal of Molecular Graphics and Modelling. 2017;76:356–363. 10.1016/j.jmgm.2017.07.022 [DOI] [PubMed] [Google Scholar]
22. Hasan MAM, Li J, Ahmad S, Molla MKI. predCar-site: Carbonylation sites prediction in proteins using support vector machine with resolving data imbalanced issue. Analytical biochemistry. 2017;525:107–113. 10.1016/j.ab.2017.03.008 [DOI] [PubMed] [Google Scholar]
23. Bao W, Yang B, Huang DS, Wang D, Liu Q, Chen YH, et al. IMKPse: Identification of protein malonylation sites by the key features into general PseAAC. IEEE Access. 2019;7:54073–54083. 10.1109/ACCESS.2019.2900275 [DOI] [Google Scholar]
24. Hasan MA, Ben Islam MK, Rahman J, Ahmad S. Citrullination Site Prediction by Incorporating Sequence Coupled Effects into PseAAC and Resolving Data Imbalance Issue. Current Bioinformatics. 2020;15(3):235–245. 10.2174/1574893614666191202152328 [DOI] [Google Scholar]
25. Qiu WR, Xiao X, Lin WZ, Chou KC. iMethyl-PseAAC: identification of protein methylation sites via a pseudo amino acid composition approach. BioMed research international. 2014;2014. 10.1155/2014/947416 [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Liu Z, Wang Y, Gao T, Pan Z, Cheng H, Yang Q, et al. CPLM: a database of protein lysine modifications. Nucleic acids research. 2014;42(D1):D531–D536. 10.1093/nar/gkt1093 [DOI] [PMC free article] [PubMed] [Google Scholar]
27. Consortium U. UniProt: a worldwide hub of protein knowledge. Nucleic acids research. 2019;47(D1):D506–D515. 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]
28. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. 10.1093/bioinformatics/btl158 [DOI] [PubMed] [Google Scholar]
29. Ju Z, Wang SY. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou’s 5-steps rule and general pseudo components. Genomics. 2020;112(1):859–866. 10.1016/j.ygeno.2019.05.027 [DOI] [PubMed] [Google Scholar]
30. Ning Q, Ma Z, Zhao X. dForml (KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou’s 5-step rule and pseudo components. Journal of theoretical biology. 2019;470:43–49. 10.1016/j.jtbi.2019.03.011 [DOI] [PubMed] [Google Scholar]
31. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome research. 2004;14(6):1188–1190. 10.1101/gr.849004 [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Xu H, Zhou J, Lin S, Deng W, Zhang Y, Xue Y. PLMD: An updated data resource of protein lysine modifications. Journal of Genetics and Genomics. 2017;44(5):243–250. 10.1016/j.jgg.2017.03.007 [DOI] [PubMed] [Google Scholar]
33. Du P, Wang X, Xu C, Gao Y. PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions. Analytical biochemistry. 2012;425(2):117–119. 10.1016/j.ab.2012.03.015 [DOI] [PubMed] [Google Scholar]
34. Qiu WR, Sun BQ, Xiao X, Xu ZC, Chou KC. iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC. Oncotarget. 2016;7(28):44310. 10.18632/oncotarget.10027 [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology. 2011;273(1):236–247. 10.1016/j.jtbi.2010.12.024 [DOI] [PMC free article] [PubMed] [Google Scholar]
36. Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21(1):10–19. 10.1093/bioinformatics/bth466 [DOI] [PubMed] [Google Scholar]
37. Ju Z, He JJ. Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou’s general PseAAC. Journal of Molecular Graphics and Modelling. 2017;77:200–204. 10.1016/j.jmgm.2017.08.020 [DOI] [PubMed] [Google Scholar]
38. Min JL, Xiao X, Chou KC. iEzy-Drug: A web server for identifying the interaction between enzymes and drugs in cellular networking. BioMed research international. 2013;2013. 10.1155/2013/701317 [DOI] [PMC free article] [PubMed] [Google Scholar]
39. Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PloS one. 2014;9(8):e105018. 10.1371/journal.pone.0105018 [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Reback J, McKinney W, jbrockmendel, den Bossche JV, Augspurger T, Cloud P, et al. pandas-dev/pandas: Pandas 1.2.0rc0; 2020. Available from: 10.5281/zenodo.4311557. [DOI]
41. Wang D, Liu D, Yuchi J, He F, Jiang Y, Cai S, et al. MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Research. 2020;. 10.1093/nar/gkaa275 [DOI] [PMC free article] [PubMed] [Google Scholar]
42. Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Frontiers in Bioengineering and Biotechnology. 2020;8. 10.3389/fbioe.2020.00134 [DOI] [PMC free article] [PubMed] [Google Scholar]
43. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–297. 10.1023/A:1022627411411 [DOI] [Google Scholar]
44. Vapnik V. The nature of statistical learning theory. Springer science & business media; 2013. [Google Scholar]
45. Ju Z, Wang SY. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou’s 5-steps rule and general pseudo components. Genomics. 2020;112(1):859–866. 10.1016/j.ygeno.2019.05.027 [DOI] [PubMed] [Google Scholar]
46.Zhang L, Tan B, Liu T, Sun X. Classification study for the imbalanced data based on Biased-SVM and the modified over-sampling algorithm. In: Journal of Physics: Conference Series. vol. 1237. IOP Publishing; 2019. p. 022052.
47. Ju Z, He JJ. Prediction of lysine glutarylation sites by maximum relevance minimum redundancy feature selection. Analytical biochemistry. 2018;550:1–7. 10.1016/j.ab.2018.04.005 [DOI] [PubMed] [Google Scholar]
48. Al-Barakati HJ, Saigo H, Newman RH, et al. RF-GlutarySite: a random forest based predictor for glutarylation sites. Molecular omics. 2019;15(3):189–204. 10.1039/C9MO00028C [DOI] [PubMed] [Google Scholar]
49. Wu M, Yang Y, Wang H, Xu Y. A deep learning method to more accurately recall known lysine acetylation sites. BMC bioinformatics. 2019;20(1):49. 10.1186/s12859-019-2632-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
50. Jia C, Zhang M, Fan C, Li F, Song J. Formator: predicting lysine formylation sites based on the most distant undersampling and safe-level synthetic minority oversampling. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019;. 10.1109/TCBB.2019.2957758 [DOI] [PubMed] [Google Scholar]
51. Yu J, Shi S, Zhang F, Chen G, Cao M. PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics. 2019;35(16):2749–2756. 10.1093/bioinformatics/bty1043 [DOI] [PubMed] [Google Scholar]
52. Qu K, Han K, Wu S, Wang G, Wei L. Identification of DNA-binding proteins using mixed feature representation methods. Molecules. 2017;22(10):1602. 10.3390/molecules22101602 [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Malebary SJ, Rehman MSu, Khan YD. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou’s 5-step rule. PloS one. 2019;14(11):e0223993. 10.1371/journal.pone.0223993 [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics. 2018;34(24):4223–4231. 10.1093/bioinformatics/bty522 [DOI] [PMC free article] [PubMed] [Google Scholar]
55. Adilina S, Farid DM, Shatabda S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. Journal of theoretical biology. 2019;460:64–78. 10.1016/j.jtbi.2018.10.027 [DOI] [PubMed] [Google Scholar]
56. Thapa N, Chaudhari M, McManus S, Roy K, Newman RH, Saigo H, et al. DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction. BMC bioinformatics. 2020;21:1–10. 10.1186/s12859-020-3342-z [DOI] [PMC free article] [PubMed] [Google Scholar]
57. Liu K, Cao L, Du P, Chen W. im6A-TS-CNN: identifying N6-methyladenine site in multiple tissues by using convolutional neural network. Molecular Therapy-Nucleic Acids. 2020;. 10.1016/j.omtn.2020.07.034 [DOI] [PMC free article] [PubMed] [Google Scholar]

PLoS One. doi: 10.1371/journal.pone.0249396.r001

Decision Letter 0

Ozlem Keskin

14 Dec 2020

PONE-D-20-30897

predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

PLOS ONE

Dear Dr. Ahmed,

Thank you for submitting your manuscript to PLOS ONE. After careful consideration, we feel that it has merit but does not fully meet PLOS ONE’s publication criteria as it currently stands. Therefore, we invite you to submit a revised version of the manuscript that addresses the points raised during the review process.

Both reviewers have major concerns about the paper. These include the preparation of the training and test sets and the methods. All points needs to be clarified. The writing and organisation of the paper also need clarification.

Please submit your revised manuscript by Jan 28 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

If you would like to make changes to your financial disclosure, please include your updated statement in your cover letter. Guidelines for resubmitting your figure files are available below the reviewer comments at the end of this letter.

If applicable, we recommend that you deposit your laboratory protocols in protocols.io to enhance the reproducibility of your results. Protocols.io assigns your protocol its own identifier (DOI) so that it can be cited independently in the future. For instructions see: http://journals.plos.org/plosone/s/submission-guidelines#loc-laboratory-protocols

We look forward to receiving your revised manuscript.

Kind regards,

Ozlem Keskin

Academic Editor

PLOS ONE

Journal Requirements:

When submitting your revision, we need you to address these additional requirements.

1. Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. Is the manuscript technically sound, and do the data support the conclusions?

The manuscript must describe a technically sound piece of scientific research with data that supports the conclusions. Experiments must have been conducted rigorously, with appropriate controls, replication, and sample sizes. The conclusions must be drawn appropriately based on the data presented.

Reviewer #1: Yes

Reviewer #2: Yes

**********

2. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Have the authors made all data underlying the findings in their manuscript fully available?

The PLOS Data policy requires authors to make all data underlying the findings described in their manuscript fully available without restriction, with rare exception (please refer to the Data Availability Statement in the manuscript PDF file). The data should be provided as part of the manuscript or its supporting information, or deposited to a public repository. For example, in addition to summary statistics, the data points behind means, medians and variance measures should be available. If there are restrictions on publicly sharing data—e.g. participant privacy or use of data from a third party—those must be specified.

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Is the manuscript presented in an intelligible fashion and written in standard English?

PLOS ONE does not copyedit accepted manuscripts, so the language in submitted articles must be clear, correct, and unambiguous. Any typographical or grammatical errors should be corrected at revision, so please note any specific errors here.

Reviewer #1: Yes

Reviewer #2: No

**********

5. Review Comments to the Author

Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters)

Reviewer #1: 1. The Authors have provided the training set in S1 File. It is not clear how a training set be fixed if 10-fold cross-validation is carried out where there was 10 disjoint test sets and then it was run for a total of 10 iterations. The train and test sets will be randomized each time cross-validation is carried out.

2. The independent test set is said to have 33 proteins. Are these proteins part of the 91 protein sequences obtained after applying the CD-HIT tool? How did the authors obtain the protein sequences for independent test set?

3. The Eq (4) is not very clear and why is there a minus sign between the two terms?

4. How exactly were the conditional as well as the non-conditional probabilities obtained?

5. The work has mentioned under the section ‘Comparative analysis of cross-validation performance’ that the predictions were obtained by submitting the benchmark dataset to the webservers, and also to the Matlab interface for one of the predictors. This however is not apparent as the same result from what Bigram-PGK obtained for those methods/predictors have been tabulated in Table 4. For fair comparison, the cross validation for other methods should also be undertaken for 10 iterations and averaged, while maintaining the same train and test sets in the respective iterations as with the predPhogly-Site method.

6. How did the Authors obtain their result on the independent test set? Was it obtained from the predPhogly-Site webserver? If not, how exactly was it obtained?

7. Why was the CKSAAP_PhoglySite left out in the Independent test? It was mentioned before that predictions could be obtained using its Matlab interface.

8. Calculation of AUC might not be possible if the probability scores are not present. If so, how did the Authors obtain AUC for iPGK-PseAAC and Phogly-PseAAC in Table 5?

9. If the predPhogly-Site is performing better in all the metrics, why is ‘almost all the metrics’ mentioned?

10. In conclusion, the Authors have mentioned that predPhogly-Site has been developed using ‘only primary sequence information’. This begs the question on how the conditional and non-conditional probabilities were obtained. If those probabilities were obtained using a tool such as PSI-BLAST toolbox (position specific scoring matrix of probabilities), it would be incorrect to use that sentence.

11. The Authors are requested to provide algorithm with train and test sets so that the method can be replicated in-order to verify the results.

12. Various PTMs are mentioned in the introduction, it would also be useful to mention about Glycation [PMID: 30717650].

Reviewer #2: Authors describe a computational tool predPhogly-Site for predicting phosphoglycerylation sites from protein sequences. predPhogly-Site extracts features from the protein sequences by a method adopted from the recent Bigram-PGK method and prediction is performed using support vector machines (SVM). The SVM is modified to handle class imbalance between positive dataset and negative dataset which inherently exist in this problem. The training dataset is taken from Bigram-PGK method in which the bias is avoided. The performance of predPhogly-Site tool is compared with those of the existing methods and predPhogly-Site tool comes out to outperform all of the existing methods. predPhogly-Site tool is also tested with an independent test set and again, its performance was the best. The tool is publicly available.

The method described in this paper has already been established and published. The data used in this study is same with Bigram-PGK. I presume that the only addition is the use of modified SVM to handle imbalanced data and this technique was previously formulated and published. The independent test set is also added. Finally, the performance seems to be better than those of existing methods and a web server is available for public access.

The structure of the manuscript should be improved.

Tables and Figures should be improved and better explained. For example, it is very hard to understand what Table 2 tells to the reader. Figures are given in low-resolution and the reader cannot understand them as they are. They should be better explained in the text and in their captions.

Sometimes the vocabulary is inconsistent. For example, “position-specific features” is used only once. What is “formulated samples”? In the beginning of “Feature Construction”, the authors use the terms “positive sites” and “negative sites”; then, these terms are not anymore used in that section.

In the lines 108-110, window size is written to be selected as 21 stating that it is based on preliminary analysis, however no reference is given for the preliminary analysis.

The authors may want to make a table of existing methods in which the columns may include dataset, method, performance values, etc for each method.

Spelling and typographical errors

Line 54

“1.00%” should be “100%”

The attained performance of predPhogly-Site in terms of accuracy, specificity, sensitivity, precision, MCC, and AUC are 99.86%, 99.86%, 1.00%, 95.94%, 54

97.88%, and 99.93%, respectively.

Lines 95-96

“are” is missing (but also it is difficult to understand the sentence)

“The non-existence of recent test sites also manually verified for avoiding accidental bias in performance benchmarking.”

**********

6. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

[NOTE: If reviewer comments were submitted as an attachment file, they will be attached to this email and accessible via the submission site. Please log into your account, locate the manuscript record, and check for the action link "View Attachments". If this link does not appear, there are no attachment files.]

While revising your submission, please upload your figure files to the Preflight Analysis and Conversion Engine (PACE) digital diagnostic tool, https://pacev2.apexcovantage.com/. PACE helps ensure that figures meet PLOS requirements. To use PACE, you must first register as a user. Registration is free. Then, login and navigate to the UPLOAD tab, where you will find detailed instructions on how to use the tool. If you encounter any issues or have any questions when using PACE, please email PLOS at figures@plos.org. Please note that Supporting Information files do not need this step.

PLoS One. 2021 Apr 1;16(4):e0249396. doi: 10.1371/journal.pone.0249396.r002

Author response to Decision Letter 0

30 Dec 2020

Points raised by the academic editor:

Please ensure that your manuscript meets PLOS ONE's style requirements, including those for file naming. The PLOS ONE style templates can be found at

https://journals.plos.org/plosone/s/file?id=wjVg/PLOSOne_formatting_sample_main_body.pdf

and

https://journals.plos.org/plosone/s/file?id=ba62/PLOSOne_formatting_sample_title_authors_affiliations.pdf

Answer: Thank you very much for pointing out some issues regarding our manuscript. We have gone through the given links and changed our manuscript accordingly including the file naming format.

Points raised by the reviewers:

Reviewer #1:

1. The Authors have provided the training set in S1 File. It is not clear how a training set be fixed if 10-fold cross-validation is carried out where there was 10 disjoint test sets and then it was run for a total of 10 iterations. The train and test sets will be randomized each time cross-validation is carried out.

Answer: Thank you very much for addressing the issue. We would like to inform you that the benchmark dataset provided in S1 File had been utilized for 10-fold cross-validation by following the steps mentioned in the section “Validation of the proposed model”. Furthermore, we had repeated it 10 times for achieving more stability in the performance measurement. As in the previous study (i.e. Bigram-PGK), the 10 fold cross-validation had been conducted for one time, we had reported the average performance of 10 iterations of 10-fold cross-validation in our study. Each time the train and test sets had been randomized as described in the same section. Later, after getting the best hyperparameters and statistical measures, the entire benchmark dataset was used to train the web-server.

Answer: Apparently, the 33 proteins were not part of the 91 protein sequences. Those were collected from the Protein Lysine Modifications Database (PLMD). As mentioned in Section “Dataset”, we considered the proteins which were newly added to this repository much after the creation of Bigram-PGK’s dataset. Furthermore, we had cross-checked those proteins manually and ensured that those have not existed in the benchmark dataset.

3. The Eq (4) is not very clear and why is there a minus sign between the two terms?

Answer: According to the sequence-coupling model mentioned in [1], [2], [3], Eq (4) was used to subtract the conditional and non-conditional probability values of the negative subset from the conditional and non-conditional probability values of the positive subset.

4. How exactly were the conditional as well as the non-conditional probabilities obtained?

Answer: We have provided a detailed discussion on how to calculate the conditional and non-conditional probability values out of any dataset in the section “Feature construction”. Additionally, we have provided references of a few established predictors where the vectorized sequence-coupled model [1], [2], [3], [4], [5] have been adopted. The notations of Eq (4) and Eq (5) have been changed in an easier to understand form in the revised manuscript.

5. The work has mentioned under the section ‘Comparative analysis of cross-validation performance’ that the predictions were obtained by submitting the benchmark dataset to the web-servers and also to the Matlab interface for one of the predictors. This however is not apparent as the same result from what Bigram-PGK obtained for those methods/predictors have been tabulated in Table 4. For a fair comparison, the cross-validation for other methods should also be undertaken for 10 iterations and averaged, while maintaining the same train and test sets in the respective iterations as with the predPhogly-Site method.

Answer: We have addressed this issue and reported the corresponding prediction results on both the dataset used in predPhogly-Site and Bigram-PGK in Section “Comparative analysis of cross-validation performance”. As we could not obtain the prediction outcomes from Bigram-PGK, we have compared the other three predictors’ (i.e. CKSAAP_PhoglySite, iPGK-PseAAC and Phogly-PseAAC) performance with the same benchmark dataset used in the predPhogly-Site study maintaining the same train and test folds on each iteration. Additionally, we have included the prediction performance of each predictor obtained by Bigram-PGK with different notations in Table 4.

6. How did the Authors obtain their result on the independent test set? Was it obtained from the predPhogly-Site webserver? If not, how exactly was it obtained?

Answer: The independent test result was obtained from the predPhogly-Site web-server and has been mentioned in Section “Independent test”.

7. Why was the CKSAAP_PhoglySite left out in the Independent test? It was mentioned before that predictions could be obtained using its Matlab interface.

Answer: As mentioned in the supporting information of CKSAAP_PhoglySite, its execution needed a 32-bit Matlab package. We could not manage the 32-bit software at the time of our manuscript submission. Later, we have managed it and reported the prediction performance of CKSAAP_PhoglySite on the training set and independent test set in Table 4 and Table 5 respectively.

8. Calculation of AUC might not be possible if the probability scores are not present. If so, how did the Authors obtain AUC for iPGK-PseAAC and Phogly-PseAAC in Table 5?

Answer: We have followed a similar procedure utilized by Bigram-PGK for calculating AUC. This approach could be found at https://github.com/abelavit/Bigram-PGK, a GitHub link provided by Bigram-PGK.

9. If the predPhogly-Site is performing better in all the metrics, why is ‘almost all the metrics’ mentioned?

Answer: As we could not report the performance of CKSAAP_PhoglySite at that time, we could not state that our predictor performed the best among all the predictors. This issue has been resolved after obtaining the prediction performance from the CKSAAP_PhoglySite predictor and the text “almost all the metrics” has been corrected.

Answer: The probability values were calculated according to the vectorized sequence-coupling model proposed by K.C. Chou [1], [2], [3], [4], [5]. According to this formula, Eq (4) and Eq (5) were written. We have tried to simplify the equation and explained both in the manuscript in the Section “Feature construction” and in the response to question no. 3. The selection of the words in Section “Conclusion” might be inappropriate, and so we have mitigated this issue with proper sets of words in the revised manuscript.

11. The Authors are requested to provide algorithm with train and test sets so that the method can be replicated in-order to verify the results.

Answer: Thank you for your suggestion. Based on your suggestion, the steps of the cross-validation procedure have been provided in Section “Validation of the proposed model”.

12. Various PTMs are mentioned in the introduction, it would also be useful to mention about Glycation [PMID: 30717650].

Answer: Recognizing the importance of PTM site Glycation, we have mentioned it with proper reference. Thanks for providing valuable suggestions that could help to improve our revised manuscript.

References:

1. Chou, Kuo-Chen. "A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins." Journal of Biological Chemistry 268.23 (1993): 16938-16948.

2. Chou, Kuo-Chen. "Prediction of human immunodeficiency virus protease cleavage sites in proteins." Analytical biochemistry 233.1 (1996): 1-14.

3. Qiu, Wang-Ren, et al. "iPTM-mLys: identifying multiple lysine PTM sites and their different types." Bioinformatics 32.20 (2016): 3116-3123.

4. Chou, Kuo‐Chen. "Prediction of protein cellular attributes using pseudo‐amino acid composition." Proteins: Structure, Function, and Bioinformatics 43.3 (2001): 246-255.

5. Chou, Kuo-Chen. "Some remarks on protein attribute prediction and pseudo amino acid composition." Journal of theoretical biology 273.1 (2011): 236-247.

The structure of the manuscript should be improved.

In the lines 108-110, window size is written to be selected as 21 stating that it is based on preliminary analysis, however no reference is given for the preliminary analysis.

The authors may want to make a table of existing methods in which the columns may include dataset, method, performance values, etc for each method.

Spelling and typographical errors

Line 54

“1.00%” should be “100%”

The attained performance of predPhogly-Site in terms of accuracy, specificity, sensitivity, precision, MCC, and AUC are 99.86%, 99.86%, 1.00%, 95.94%, 54

97.88%, and 99.93%, respectively.

Lines 95-96

“are” is missing (but also it is difficult to understand the sentence)

“The non-existence of recent test sites also manually verified for avoiding accidental bias in performance benchmarking.”

Answer: Thank you very much for addressing some important issues and providing your valuable suggestions associated with them. We have tried to alleviate these issues in the revised version of our manuscript.

1. The structure of the manuscript should be improved.

Answer: We have tried our best to improve the overall structure of our manuscript. The Tables and Figures have been taken under consideration and we have tried to provide a proper explanation with captions. The column names of Table 2 have been modified so that it can better explain what purpose it serves. In the revised manuscript, we have tried to provide high-resolution Figures which is further corrected by PACE which helps to ensure that Figures meet PLOS requirements.

2. Sometimes the vocabulary is inconsistent. For example, “position-specific features” is used only once. What is “formulated samples”? In the beginning of “Feature Construction”, the authors use the terms “positive sites” and “negative sites”; then, these terms are not anymore used in that section.

Answer: We have revised the manuscript and tried our best to reduce the inconsistency of vocabulary throughout the manuscript. The addressed issues have been resolved by using proper sets of words, especially, at the beginning of Section “Feature construction”. In addition to that, we would like to inform you that, we have obtained a set of formulated samples after adopting Chou’s scheme for sample formulation.

3. In the lines 108-110, window size is written to be selected as 21 stating that it is based on preliminary analysis, however no reference is given for the preliminary analysis.

Answer: Thank you very much for addressing such an important point. We would like to inform you that, previously we had considered the sliding window method with a window size of 3,5,7,9,….,21 and at the window size of 21, we have achieved much higher performance. However, based on your concern, we have experimented further to find the optimal window size. We have found that window size 29 gives the most promising results. We have reported the improved performance and reflected all the changes based on your addressed issues.

4. The authors may want to make a table of existing methods in which the columns may include dataset, method, performance values, etc for each method.

Answer: Thank you for the suggestion. We have tried to follow your suggestion by providing the performance of each method on the dataset constructed in the predPhogly-Site study and on the dataset of Bigram-PGK in Table 4 with distinct notations.

5. Spelling and typographical errors

Line 54

“1.00%” should be “100%”

The attained performance of predPhogly-Site in terms of accuracy, specificity, sensitivity, precision, MCC, and AUC are 99.86%, 99.86%, 1.00%, 95.94%, 97.88%, and 99.93%, respectively.

Answer: We have corrected the spelling and typographical error in Line 54. In addition to that, we have tried to find out these types of errors throughout the manuscript and made proper corrections.

6. Lines 95-96

“are” is missing (but also it is difficult to understand the sentence)

“The non-existence of recent test sites also manually verified for avoiding accidental bias in performance benchmarking.”

Answer: We have made corrections in Lines 95-96 and provided easier to understand explanation. We would like to thank you once again for reviewing our manuscript and providing the necessary suggestions.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(21.1KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0249396.r003

Decision Letter 1

Ozlem Keskin

20 Jan 2021

PONE-D-20-30897R1

predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

PLOS ONE

Dear Dr. Ahmed,

Please submit your revised manuscript by Mar 06 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Ozlem Keskin

Academic Editor

PLOS ONE

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

1. If the authors have adequately addressed your comments raised in a previous round of review and you feel that this manuscript is now acceptable for publication, you may indicate that here to bypass the “Comments to the Author” section, enter your conflict of interest statement in the “Confidential to Editor” section, and submit your "Accept" recommendation.

Reviewer #1: (No Response)

Reviewer #2: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

Reviewer #2: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

Reviewer #2: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

Reviewer #2: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

Reviewer #2: Yes

**********

6. Review Comments to the Author

Reviewer #1: The authors have addressed majority of the comments. However, a few comments were not addressed properly.

The authors have not provided algorithm with train and test sets so that the method can be replicated in-order to verify the results. The proposed method has achieved very high results and so it would be crucial to check and verify their result through the algorithm they have used.

New Comment: The authors are requested to check the Precision calculation as it seems to be quite low for the other methods in Table 4 when compared to what Bigram-PGK has reported for the methods. The other metrics appear to be similar.

Also, the references [1]-[5] are unnecessary and do not add any information.

Reviewer #2: The authors have appropriately answered my comments. The manuscript structure and content have improved and I support publishing the work.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

Reviewer #2: No

PLoS One. 2021 Apr 1;16(4):e0249396. doi: 10.1371/journal.pone.0249396.r004

Author response to Decision Letter 1

8 Feb 2021

Points raised by the reviewers:

Reviewer #1:

1. The authors have addressed majority of the comments. However, a few comments were not addressed properly.

Answer: Thank you very much for your feedback. We have tried addressing the comments which might not be covered in our previous submission. We have updated the flowchart for more insights on the overall procedure of constructing the predPhogly-Site predictor. It is mentioned in the “Introduction” section as well as in the “Dataset” section. Summarizing the steps included in our system, firstly, we have constructed the benchmark dataset from the CPLM database. The phosphoglycerylated proteins, as well as the segmented sequences with respective protein ID and positions, have been provided as File S1. Secondly, we have extracted the sequence-coupling features from the segmented sequences given in File S1. Then we have used 10 times 10-fold cross-validation scheme to train and evaluate our SVM based predictor with the optimal hyperparameters. The step-by-step guidelines are discussed in the “Validation of the proposed model” section and the optimal hyperparameters are given in Table 2. Later, we have performed an independent test where the test dataset contains newly added proteins (available as File S2), completely unknown to the predPhogly-Site predictor for further evaluation. We hope that our system is now reproducible and our results can be checked and verified easily.

2. The authors are requested to check the Precision calculation as it seems to be quite low for the other methods in Table 4 when compared to what Bigram-PGK has reported for the methods. The other metrics appear to be similar.

Answer: Thanks for your suggestion regarding the precision calculation. We have rechecked the precision calculation and explained this issue properly in Section “Comparative analysis of cross-validation performance” which reflects that the dataset used in the Bigram-PGK study contained only 111 positive samples and 224 negative samples and thus, the false-positive rates of the existing predictors i.e. iPGK-PseAAC, CKSAAPPhoglySite, and Phogly-PseAAC were lower and precision measures were higher. The Bigram-PGK dataset has not reflected the real future test data. Normally, in the future test dataset, the number of positive and negative sites will not be balanced. Moreover, the existing predictors i.e. iPGK-PseAAC, CKSAAPPhoglySite, and Phogly-PseAAC might not handle the real world imbalanced situation of the dataset appropriately. As a result, high false-positive rates and low precision measures were obtained on our benchmark dataset that contains 111 positive instances and 3249 negative instances.

3. Also, the references [1]-[5] are unnecessary and do not add any information.

Answer: We have removed a few references between [1]-[5] according to your suggestion. However, a few references have been still included as these references could be vital for further researches on phosphoglycerylation sites.

Reviewer #2: The authors have appropriately answered my comments. The manuscript structure and content have improved and I support publishing the work.

Answer: Thank you very much for your feedback.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(18.5KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0249396.r005

Decision Letter 2

Ozlem Keskin

3 Mar 2021

PONE-D-20-30897R2

predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

PLOS ONE

Dear Dr. Ahmed,

Please submit your revised manuscript by Apr 17 2021 11:59PM. If you will need more time than this to complete your revisions, please reply to this message or contact the journal office at plosone@plos.org. When you're ready to submit your revision, log on to https://www.editorialmanager.com/pone/ and select the 'Submissions Needing Revision' folder to locate your manuscript file.

Please include the following items when submitting your revised manuscript:

A rebuttal letter that responds to each point raised by the academic editor and reviewer(s). You should upload this letter as a separate file labeled 'Response to Reviewers'.
A marked-up copy of your manuscript that highlights changes made to the original version. You should upload this as a separate file labeled 'Revised Manuscript with Track Changes'.
An unmarked version of your revised paper without tracked changes. You should upload this as a separate file labeled 'Manuscript'.

We look forward to receiving your revised manuscript.

Kind regards,

Ozlem Keskin

Academic Editor

PLOS ONE

Journal Requirements:

Please review your reference list to ensure that it is complete and correct. If you have cited papers that have been retracted, please include the rationale for doing so in the manuscript text, or remove these references and replace them with relevant current references. Any changes to the reference list should be mentioned in the rebuttal letter that accompanies your revised manuscript. If you need to cite a retracted article, indicate the article’s retracted status in the References list and also include a citation and full reference for the retraction notice.

[Note: HTML markup is below. Please do not edit.]

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: (No Response)

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Partly

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: No

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: Major Revision.

Previous comments to the authors are not adequately addressed and hence I am summarising it again here.

The authors have stated that the result can be replicated by going through various section of the paper such as ‘Introduction’, ‘Dataset’, ‘Validation of the proposed model’, etc. However, for the review process, the authors are requested to put their dataset and code in a repository from which the reviewers can easily verify the result without the need to go about replicating the result by duplicating their method from scratch.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: No

PLoS One. 2021 Apr 1;16(4):e0249396. doi: 10.1371/journal.pone.0249396.r006

Author response to Decision Letter 2

8 Mar 2021

Journal Requirements:

Answer: Thanks to the concern very much for the valuable suggestion. I would like to inform you that we have revised and corrected the reference list according to the journal requirements. It is to be included that two of the references had been removed in our previously submitted manuscript according to the suggestion of one of the reviewers. The reviewer indicated that the first five references were irrelevant and unimportant even though there were some vital references in that list. It was not clear why those references seemed so insignificant to the reviewer. However, we had removed only reference numbers three and four as per the request of the reviewer. I would like to include the full citations which had been removed previously as well as in the current revised manuscript.

3.Weissman JD, Raval A, Singer DS. Assay of an intrinsic acetyltransferase activity of the transcriptional coactivator CIITA. In: Methods in enzymology. vol. 370. Elsevier; 2003. p. 378–386.

4. Chou KC. Impacts of bioinformatics to medicinal chemistry. Medicinal chemistry. 2015;11(3):218–234.

Points raised by the reviewers:

Reviewer #1: Major Revision.

1. Previous comments to the authors are not adequately addressed and hence I am summarising it again here.

Answer: Thank you for your kind response. We would like to inform you that a git repository containing the source code of predPhogly-Site is available at https://github.com/Sabit-Ahmed/predPhogly-Site, which could aid the reviewers in verifying the performance obtained by the predPhogly-Site predictor. All the relevant information i.e. the benchmark dataset, independent test dataset, extracted features, source code are provided either as supporting information or git repository and the results can be reproduced and verified. A web-server is also available at http://103.99.176.239/predML-Site for further verification.

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(19.5KB, docx)}

PLoS One. doi: 10.1371/journal.pone.0249396.r007

Decision Letter 3

Ozlem Keskin

18 Mar 2021

predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

PONE-D-20-30897R3

Dear Dr. Ahmed,

We’re pleased to inform you that your manuscript has been judged scientifically suitable for publication and will be formally accepted for publication once it meets all outstanding technical requirements.

Within one week, you’ll receive an e-mail detailing the required amendments. When these have been addressed, you’ll receive a formal acceptance letter and your manuscript will be scheduled for publication.

An invoice for payment will follow shortly after the formal acceptance. To ensure an efficient process, please log into Editorial Manager at http://www.editorialmanager.com/pone/, click the 'Update My Information' link at the top of the page, and double check that your user information is up-to-date. If you have any billing related questions, please contact our Author Billing department directly at authorbilling@plos.org.

If your institution or institutions have a press office, please notify them about your upcoming paper to help maximize its impact. If they’ll be preparing press materials, please inform our press team as soon as possible -- no later than 48 hours after receiving the formal acceptance. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information, please contact onepress@plos.org.

Kind regards,

Ozlem Keskin

Academic Editor

PLOS ONE

Additional Editor Comments (optional):

Reviewers' comments:

Reviewer's Responses to Questions

Comments to the Author

Reviewer #1: All comments have been addressed

**********

2. Is the manuscript technically sound, and do the data support the conclusions?

Reviewer #1: Yes

**********

3. Has the statistical analysis been performed appropriately and rigorously?

Reviewer #1: Yes

**********

4. Have the authors made all data underlying the findings in their manuscript fully available?

Reviewer #1: Yes

**********

5. Is the manuscript presented in an intelligible fashion and written in standard English?

Reviewer #1: Yes

**********

6. Review Comments to the Author

Reviewer #1: authors addressed my concerns therefore i recommend to accept the paper in its current form.

**********

7. PLOS authors have the option to publish the peer review history of their article (what does this mean?). If published, this will include your full peer review and any attached files.

If you choose “no”, your identity will remain anonymous but your review may still be made public.

Do you want your identity to be public for this peer review? For information about this choice, including consent withdrawal, please see our Privacy Policy.

Reviewer #1: Yes: Alok Sharma

PLoS One. doi: 10.1371/journal.pone.0249396.r008

Acceptance letter

Ozlem Keskin

22 Mar 2021

PONE-D-20-30897R3

predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

Dear Dr. Ahmed:

I'm pleased to inform you that your manuscript has been deemed suitable for publication in PLOS ONE. Congratulations! Your manuscript is now with our production department.

If your institution or institutions have a press office, please let them know about your upcoming paper now to help maximize its impact. If they'll be preparing press materials, please inform our press team within the next 48 hours. Your manuscript will remain under strict press embargo until 2 pm Eastern Time on the date of publication. For more information please contact onepress@plos.org.

If we can help with anything else, please email us at plosone@plos.org.

Thank you for submitting your work to PLOS ONE and supporting open access.

Kind regards,

PLOS ONE Editorial Office Staff

on behalf of

Dr. Ozlem Keskin

Academic Editor

PLOS ONE

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

S1 File. Benchmark dataset.

The phosphoglycerylated proteins as well as the segmented sequences with respective protein ID and positions have been provided.

(PDF)

Click here for additional data file.^{(237.6KB, pdf)}

S2 File. Independent test dataset.

Proteins which have been recently added to the PLMD database and completely unknown to the proposed system.

(PDF)

Click here for additional data file.^{(65.5KB, pdf)}

S3 File. All possible combinations of the conditional probability values derived from the positive and negative subset.

(XLSX)

Click here for additional data file.^{(160.5KB, xlsx)}

S4 File. The non-conditional probability values of 21 amino acids derived from the positive and negative subset.

(XLSX)

Click here for additional data file.^{(10.3KB, xlsx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(21.1KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(18.5KB, docx)}

Attachment

Submitted filename: Response to Reviewers.docx

Click here for additional data file.^{(19.5KB, docx)}

Data Availability Statement

All relevant data are within the paper and its Supporting information files.

[pone.0249396.ref001] 1. Saraswathy N, Ramalingam P. Concepts and techniques in genomics and proteomics. Elsevier; 2011. [Google Scholar]

[pone.0249396.ref002] 2. McDowell G, Philpott A. New insights into the role of ubiquitylation of proteins. In: International review of cell and molecular biology. vol. 325. Elsevier; 2016. p. 35–88. [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref003] 3. Qiu WR, Sun BQ, Xiao X, Xu ZC, Chou KC. iPTM-mLys: identifying multiple lysine PTM sites and their different types. Bioinformatics. 2016;32(20):3116–3123. 10.1093/bioinformatics/btw380 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref004] 4. Freiman RN, Tjian R. Regulating the regulators: lysine modifications make their mark. Cell. 2003;112(1):11–17. 10.1016/S0092-8674(02)01278-3 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref005] 5. Reddy HM, Sharma A, Dehzangi A, Shigemizu D, Chandra AA, Tsunoda T. GlyStruct: glycation prediction using structural properties of amino acid residues. BMC bioinformatics. 2019;19(13):55–64. 10.1186/s12859-018-2547-x [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref006] 6. Jia J, Liu Z, Xiao X, Liu B, Chou KC. iSuc-PseOpt: identifying lysine succinylation sites in proteins by incorporating sequence-coupling effects into pseudo components and optimizing imbalanced training dataset. Analytical biochemistry. 2016;497:48–56. 10.1016/j.ab.2015.12.009 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref007] 7. Xu Y, Chou KC. Recent progress in predicting posttranslational modification sites in proteins. Current topics in medicinal chemistry. 2016;16(6):591–603. 10.2174/1568026615666150819110421 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref008] 8. Ju Z, Cao JZ, Gu H. Predicting lysine phosphoglycerylation with fuzzy SVM by incorporating k-spaced amino acid pairs into Chou’s general PseAAC. Journal of Theoretical Biology. 2016;397:145–150. 10.1016/j.jtbi.2016.02.020 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref009] 9. Xu Y, Ding YX, Ding J, Wu LY, Deng NY. Phogly-PseAAC: prediction of lysine phosphoglycerylation in proteins incorporating with position-specific propensity. Journal of Theoretical Biology. 2015;379:10–15. 10.1016/j.jtbi.2015.04.016 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref010] 10. Moellering RE, Cravatt BF. Functional lysine modification by an intrinsically reactive primary glycolytic metabolite. Science. 2013;341(6145):549–553. 10.1126/science.1238327 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref011] 11. Chandra A, Sharma A, Dehzangi A, Shigemizu D, Tsunoda T. Bigram-PGK: phosphoglycerylation prediction using the technique of bigram probabilities of position specific scoring matrix. BMC molecular and cell biology. 2019;20(2):1–9. 10.1186/s12860-019-0240-1 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref012] 12. Liu LM, Xu Y, Chou KC. iPGK-PseAAC: identify lysine phosphoglycerylation sites in proteins by incorporating four different tiers of amino acid pairwise coupling information into the general PseAAC. Medicinal Chemistry. 2017;13(6):552–559. 10.2174/1573406413666170515120507 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref013] 13. Chou KC. Prediction of signal peptides using scaled window. peptides. 2001;22(12):1973–1979. 10.1016/S0196-9781(01)00540-X [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref014] 14. Hasan MAM, Ahmad S. mLysPTMpred: Multiple Lysine PTM Site Prediction Using Combination of SVM with Resolving Data Imbalance Issue. Natural Science. 2018;10(9):370–384. 10.4236/ns.2018.109035 [DOI] [Google Scholar]

[pone.0249396.ref015] 15. Chou KC. A vectorized sequence-coupling model for predicting HIV protease cleavage sites in proteins. Journal of Biological Chemistry. 1993;268(23):16938–16948. 10.1016/S0021-9258(19)85285-7 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref016] 16. Chou KC. Prediction of human immunodeficiency virus protease cleavage sites in proteins. Analytical biochemistry. 1996;233(1):1–14. 10.1006/abio.2000.4757 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref017] 17.Veropoulos K, Campbell C, Cristianini N, et al. Controlling the sensitivity of support vector machines. In: Proceedings of the international joint conference on AI. vol. 55; 1999. p. 60.

[pone.0249396.ref018] 18. Lin WZ, Fang JA, Xiao X, Chou KC. iDNA-Prot: identification of DNA binding proteins using random forest with grey model. PloS one. 2011;6(9). 10.1371/journal.pone.0024756 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref019] 19. Hasan MAM, Ahmad S, Molla MKI. iMulti-HumPhos: a multi-label classifier for identifying human phosphorylated proteins using multiple kernel learning based support vector machines. Molecular BioSystems. 2017;13(8):1608–1618. 10.1039/C7MB00180K [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref020] 20. Ju Z, Wang SY. Prediction of citrullination sites by incorporating k-spaced amino acid pairs into Chou’s general pseudo amino acid composition. Gene. 2018;664:78–83. 10.1016/j.gene.2018.04.055 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref021] 21. Ju Z, He JJ. Prediction of lysine propionylation sites using biased SVM and incorporating four different sequence features into Chou’s PseAAC. Journal of Molecular Graphics and Modelling. 2017;76:356–363. 10.1016/j.jmgm.2017.07.022 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref022] 22. Hasan MAM, Li J, Ahmad S, Molla MKI. predCar-site: Carbonylation sites prediction in proteins using support vector machine with resolving data imbalanced issue. Analytical biochemistry. 2017;525:107–113. 10.1016/j.ab.2017.03.008 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref023] 23. Bao W, Yang B, Huang DS, Wang D, Liu Q, Chen YH, et al. IMKPse: Identification of protein malonylation sites by the key features into general PseAAC. IEEE Access. 2019;7:54073–54083. 10.1109/ACCESS.2019.2900275 [DOI] [Google Scholar]

[pone.0249396.ref024] 24. Hasan MA, Ben Islam MK, Rahman J, Ahmad S. Citrullination Site Prediction by Incorporating Sequence Coupled Effects into PseAAC and Resolving Data Imbalance Issue. Current Bioinformatics. 2020;15(3):235–245. 10.2174/1574893614666191202152328 [DOI] [Google Scholar]

[pone.0249396.ref025] 25. Qiu WR, Xiao X, Lin WZ, Chou KC. iMethyl-PseAAC: identification of protein methylation sites via a pseudo amino acid composition approach. BioMed research international. 2014;2014. 10.1155/2014/947416 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref026] 26. Liu Z, Wang Y, Gao T, Pan Z, Cheng H, Yang Q, et al. CPLM: a database of protein lysine modifications. Nucleic acids research. 2014;42(D1):D531–D536. 10.1093/nar/gkt1093 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref027] 27. Consortium U. UniProt: a worldwide hub of protein knowledge. Nucleic acids research. 2019;47(D1):D506–D515. 10.1093/nar/gky1049 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref028] 28. Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–1659. 10.1093/bioinformatics/btl158 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref029] 29. Ju Z, Wang SY. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou’s 5-steps rule and general pseudo components. Genomics. 2020;112(1):859–866. 10.1016/j.ygeno.2019.05.027 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref030] 30. Ning Q, Ma Z, Zhao X. dForml (KNN)-PseAAC: Detecting formylation sites from protein sequences using K-nearest neighbor algorithm via Chou’s 5-step rule and pseudo components. Journal of theoretical biology. 2019;470:43–49. 10.1016/j.jtbi.2019.03.011 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref031] 31. Crooks GE, Hon G, Chandonia JM, Brenner SE. WebLogo: a sequence logo generator. Genome research. 2004;14(6):1188–1190. 10.1101/gr.849004 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref032] 32. Xu H, Zhou J, Lin S, Deng W, Zhang Y, Xue Y. PLMD: An updated data resource of protein lysine modifications. Journal of Genetics and Genomics. 2017;44(5):243–250. 10.1016/j.jgg.2017.03.007 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref033] 33. Du P, Wang X, Xu C, Gao Y. PseAAC-Builder: A cross-platform stand-alone program for generating various special Chou’s pseudo-amino acid compositions. Analytical biochemistry. 2012;425(2):117–119. 10.1016/j.ab.2012.03.015 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref034] 34. Qiu WR, Sun BQ, Xiao X, Xu ZC, Chou KC. iHyd-PseCp: Identify hydroxyproline and hydroxylysine in proteins by incorporating sequence-coupled effects into general PseAAC. Oncotarget. 2016;7(28):44310. 10.18632/oncotarget.10027 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref035] 35. Chou KC. Some remarks on protein attribute prediction and pseudo amino acid composition. Journal of theoretical biology. 2011;273(1):236–247. 10.1016/j.jtbi.2010.12.024 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref036] 36. Chou KC. Using amphiphilic pseudo amino acid composition to predict enzyme subfamily classes. Bioinformatics. 2005;21(1):10–19. 10.1093/bioinformatics/bth466 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref037] 37. Ju Z, He JJ. Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou’s general PseAAC. Journal of Molecular Graphics and Modelling. 2017;77:200–204. 10.1016/j.jmgm.2017.08.020 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref038] 38. Min JL, Xiao X, Chou KC. iEzy-Drug: A web server for identifying the interaction between enzymes and drugs in cellular networking. BioMed research international. 2013;2013. 10.1155/2013/701317 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref039] 39. Xu Y, Wen X, Wen LS, Wu LY, Deng NY, Chou KC. iNitro-Tyr: Prediction of nitrotyrosine sites in proteins with general pseudo amino acid composition. PloS one. 2014;9(8):e105018. 10.1371/journal.pone.0105018 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref040] 40.Reback J, McKinney W, jbrockmendel, den Bossche JV, Augspurger T, Cloud P, et al. pandas-dev/pandas: Pandas 1.2.0rc0; 2020. Available from: 10.5281/zenodo.4311557. [DOI]

[pone.0249396.ref041] 41. Wang D, Liu D, Yuchi J, He F, Jiang Y, Cai S, et al. MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Research. 2020;. 10.1093/nar/gkaa275 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref042] 42. Lv Z, Zhang J, Ding H, Zou Q. RF-PseU: A Random Forest Predictor for RNA Pseudouridine Sites. Frontiers in Bioengineering and Biotechnology. 2020;8. 10.3389/fbioe.2020.00134 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref043] 43. Cortes C, Vapnik V. Support-vector networks. Machine learning. 1995;20(3):273–297. 10.1023/A:1022627411411 [DOI] [Google Scholar]

[pone.0249396.ref044] 44. Vapnik V. The nature of statistical learning theory. Springer science & business media; 2013. [Google Scholar]

[pone.0249396.ref045] 45. Ju Z, Wang SY. Prediction of lysine formylation sites using the composition of k-spaced amino acid pairs via Chou’s 5-steps rule and general pseudo components. Genomics. 2020;112(1):859–866. 10.1016/j.ygeno.2019.05.027 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref046] 46.Zhang L, Tan B, Liu T, Sun X. Classification study for the imbalanced data based on Biased-SVM and the modified over-sampling algorithm. In: Journal of Physics: Conference Series. vol. 1237. IOP Publishing; 2019. p. 022052.

[pone.0249396.ref047] 47. Ju Z, He JJ. Prediction of lysine glutarylation sites by maximum relevance minimum redundancy feature selection. Analytical biochemistry. 2018;550:1–7. 10.1016/j.ab.2018.04.005 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref048] 48. Al-Barakati HJ, Saigo H, Newman RH, et al. RF-GlutarySite: a random forest based predictor for glutarylation sites. Molecular omics. 2019;15(3):189–204. 10.1039/C9MO00028C [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref049] 49. Wu M, Yang Y, Wang H, Xu Y. A deep learning method to more accurately recall known lysine acetylation sites. BMC bioinformatics. 2019;20(1):49. 10.1186/s12859-019-2632-9 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref050] 50. Jia C, Zhang M, Fan C, Li F, Song J. Formator: predicting lysine formylation sites based on the most distant undersampling and safe-level synthetic minority oversampling. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2019;. 10.1109/TCBB.2019.2957758 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref051] 51. Yu J, Shi S, Zhang F, Chen G, Cao M. PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics. 2019;35(16):2749–2756. 10.1093/bioinformatics/bty1043 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref052] 52. Qu K, Han K, Wu S, Wang G, Wei L. Identification of DNA-binding proteins using mixed feature representation methods. Molecules. 2017;22(10):1602. 10.3390/molecules22101602 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref053] 53. Malebary SJ, Rehman MSu, Khan YD. iCrotoK-PseAAC: Identify lysine crotonylation sites by blending position relative statistical features according to the Chou’s 5-step rule. PloS one. 2019;14(11):e0223993. 10.1371/journal.pone.0223993 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref054] 54. Li F, Li C, Marquez-Lago TT, Leier A, Akutsu T, Purcell AW, et al. Quokka: a comprehensive tool for rapid and accurate prediction of kinase family-specific phosphorylation sites in the human proteome. Bioinformatics. 2018;34(24):4223–4231. 10.1093/bioinformatics/bty522 [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref055] 55. Adilina S, Farid DM, Shatabda S. Effective DNA binding protein prediction by using key features via Chou’s general PseAAC. Journal of theoretical biology. 2019;460:64–78. 10.1016/j.jtbi.2018.10.027 [DOI] [PubMed] [Google Scholar]

[pone.0249396.ref056] 56. Thapa N, Chaudhari M, McManus S, Roy K, Newman RH, Saigo H, et al. DeepSuccinylSite: a deep learning based approach for protein succinylation site prediction. BMC bioinformatics. 2020;21:1–10. 10.1186/s12859-020-3342-z [DOI] [PMC free article] [PubMed] [Google Scholar]

[pone.0249396.ref057] 57. Liu K, Cao L, Du P, Chen W. im6A-TS-CNN: identifying N6-methyladenine site in multiple tissues by using convolutional neural network. Molecular Therapy-Nucleic Acids. 2020;. 10.1016/j.omtn.2020.07.034 [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

predPhogly-Site: Predicting phosphoglycerylation sites by incorporating probabilistic sequence-coupling information into PseAAC and addressing data imbalance

Sabit Ahmed

Afrida Rahman

Md Al Mehedi Hasan

Md Khaled Ben Islam

Julia Rahman

Shamim Ahmad

Roles

Abstract

Introduction

Fig 1. An overview of predPhogly-Site for phosphoglycerylation site prediction.

Materials and methods

Dataset

Table 1. Summary of the non-redundant phosphoglycerylation dataset.

Fig 2. Amino acid frequencies around the K-PTM and non-K-PTM sites.

Feature construction

Fig 3. The conditional probability of amino acids at sample positions 1 to 13 and 15 to 28.

Fig 4. Probabilistic information of 21 amino acids at sample positions 14 and 15.

Prediction method and addressing data imbalance

Formulation of evaluation metrics

Validation of the proposed model

Table 2. Selected parameters of 10-fold cross validation (10 iterations).

Results and discussions

Performance of predPhogly-Site

Table 3. Cross-validation performance of predPhogly-Site on the benchmark dataset.

Comparative analysis of cross-validation performance

Table 4. Cross-validation performance of the existing prediction systems.

Fig 5. Cross-validation performance of the available predictors.

Independent test

Table 5. Prediction performance in Independent test.

Fig 6. Comparative ROC curves between different prediction methods based on the independent test.

Web-server

Conclusion

Supporting information

Data Availability

Funding Statement

References

Decision Letter 0

Ozlem Keskin

Roles

Author response to Decision Letter 0

Decision Letter 1

Ozlem Keskin

Roles

Author response to Decision Letter 1

Decision Letter 2

Ozlem Keskin

Roles

Author response to Decision Letter 2

Decision Letter 3

Ozlem Keskin

Roles

Acceptance letter

Ozlem Keskin

Roles

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases