Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2024 Oct 10;40(11):btae602. doi: 10.1093/bioinformatics/btae602

Sitetack: a deep learning model that improves PTM prediction by using known PTMs

Clair S Gutierrez 1,2,2, Alia A Kassim 3,2, Benjamin D Gutierrez 4, Ronald T Raines 5,6,7,
Editor: Macha Nikolski
PMCID: PMC11552626  PMID: 39388212

Abstract

Motivation

Post-translational modifications (PTMs) increase the diversity of the proteome and are vital to organismal life and therapeutic strategies. Deep learning has been used to predict PTM locations. Still, limitations in datasets and their analyses compromise success.

Results

We evaluated the use of known PTM sites in prediction via sequence-based deep learning algorithms. For each PTM, known locations of that PTM were encoded as a separate amino acid before sequences were encoded via word embedding and passed into a convolutional neural network that predicts the probability of that PTM at a given site. Without labeling known PTMs, our models are on par with others. With labeling, however, we improved significantly upon extant models. Moreover, knowing PTM locations can increase the predictability of a different PTM. Our findings highlight the importance of PTMs for the installation of additional PTMs. We anticipate that including known PTM locations will enhance the performance of other proteomic machine learning algorithms.

Availability and implementation

Sitetack is available as a web tool at https://sitetack.net; the source code, representative datasets, instructions for local use, and select models are available at https://github.com/clair-gutierrez/sitetack.

1 Introduction

Post-translational modifications (PTMs) are chemical alterations to proteins following their biosynthesis (Walsh et al. 2005, Ryšlavá et al. 2013). Typically catalyzed by enzymes, PTMs expand the size of the human proteome by orders of magnitude and are critical to the proper function of cells and organisms (Karve and Cheema 2011, Schaffert and Carter 2020, Ebert et al. 2022, Pan and Chen 2022, Hu et al. 2024). Despite advances in mass spectrometry and other experimental methods, the large-scale analysis of PTMs remains challenging and costly (Hou et al. 2020, Hermann et al. 2022). Machine learning-based methods, particularly deep learning methods, have been useful for predicting the location of some PTMs from amino acid sequences alone (Ramazi and Zahiri 2021). Many models have been developed in recent years (Wang et al. 2017, 2019, 2020, Huang et al. 2018, Luo et al. 2019, Wu et al. 2019, Yan et al. 2023). Still, the accuracy of prediction is not highly reliable for the vast majority of PTMs.

We reasoned that including the location of known PTMs along with sequence information could improve PTM prediction. Our approach has biochemical precedent. For example, crosstalk is well-known between phosphorylation and O-linked glycosylation at the same and proximal serine or threonine residues (Hart et al. 2011, van der Laarse et al. 2018). Kinase-recognition sites themselves often contain phosphorylated residues (Johnson et al. 2023). Likewise, a variety of PTMs can affect the ubiquitination of a protein (Barbour et al. 2023). More broadly, PTM crosstalk pairs have been compiled from 82 human proteins and found to correlate with sequence co-evolution (Huang et al. 2018).

Despite the importance of PTMs in directing other PTMs, computational methods that deploy these data for PTM prediction are rare and of limited scope (Chang et al. 2018). Moreover, we are not aware of a systematic evaluation of how PTMs influence other PTM sites of the same type. Here, we present “Sitetack” and show that its inclusion of known PTMs improves the prediction of other PTM sites significantly.

2 Materials and methods

2.1 Benchmark datasets

In this study, we deployed several previously curated PTM datasets that had been used to train different machine-learning models. Many of our models used the various PTM datasets compiled by Xu and workers (Wang et al. 2020) that were obtained from www.musite.net (Wang et al. 2017, 2019, 2020). For O-glycosylation specifically, we used the OGP dataset from Cao and coworkers (Huang et al. 2021) because the dataset from Xu and coworkers (Wang et al. 2020) was relatively small. In addition, to obtain a large dataset for O-GlcNAc O-glycosylation, we used a dataset processed from O-GlcNAc site atlas (Ma et al. 2021) that was used by Jia, Ma, and coworkers (Hu et al. 2024) in their prediction model. For N-glycosylation, the specificity for the N-X-S/T sequon led us to use two additional datasets from KC and coworkers (Pakhrin et al. 2021), which curated positive and negative examples specifically for the N-X-S/T sequon. For kinase-specific models, the datasets from Ishihama and coworkers (Sugiyama et al. 2019) were processed to remove tyrosine sites and sites not in reference proteome UP000005640. Of these, only datasets containing over 500 sites were used in order to have better training. Specific information on dataset processing and biases present in these datasets are discussed in depth in Supplementary Section S2.

To evaluate model ability, we split each dataset into training, validation, and test sets in the ratio of 80%:10%:10% selected at random. Though the composition of these sets varied with the specific models and different validation runs, the test set was always completely distinct from the training set for a given evaluation. To further validate models, 10-fold Monte Carlo cross validation was performed on the best models by resplitting the datasets, retraining, and retesting separately. Model statistics were reported as an average of the 10 trials, with an error reported as the standard deviation of the trials.

2.2 Data representation

For a given protein sequence, subsequences or k-mers of length k centered at possible PTM sites were generated with various window sizes. Optimization was performed on the best size of “k” by retraining the CNN model for serine or threonine phosphorylation with different sizes of “k” (Supplementary Section S4.1). For all subsequent models, optimization of k was not performed due to the number of models that needed to be trained, so a size of k = 53 was chosen. For each type of PTM in models that contained PTM information, known PTMs of the same type were encoded as a separate amino acid. This was symbolically represented as “@” (and “&” if there was another type of amino acid residue that could undergo modification) in the sequences. For all datasets, the 20 canonical amino acids were encoded in a standard manner along with “-” to denote no amino acid (for sites near the ends of a protein) and “U” for selenocysteine. In our models, the amino acids were then vectorized where each amino acid corresponded to a different number. These vectors were then encoded into a word embedding, which was learned during training, to dense vectors of length 21 before passage into the main model. For example, the sequence “ACD” would become <1, 2, 3>, then after word embedding a vector of dimension [3, 21]. For additional information, see Supplementary Section S1.7.

Given that there are many more negative examples than positive examples, for each model, we randomly selected the same number of negative sites as there were positive sites, except in the case of N-X-S/T sequon glycosylation, which already had a relatively balanced dataset. During 10-fold Monte Carlo cross validation, we randomly selected negatives again, such that the negative datasets used in each validation run differed significantly. In addition, to handle homology between the k-mers themselves for our main models, we ensured that no sequence in any test set was identical to the cognate train or validation sets. For each model, we evaluated different sequence identity cutoffs down to 40% in the k-mers (which is stricter than on a whole protein level) using CD-HIT (Li and Godzik 2006, Fu et al. 2012), which can be seen in Supplementary Section S11. Most performances suffer only minor reductions with lower sequence similarities.

The human-only datasets were curated to contain only human proteins by taking only those in UniProt reference proteome UP000005640 for Homo sapiens (Bateman et al. 2023). For all-organism datasets, because of the vast size of an interrogation of all proteins from all organisms, negatives were taken from unmodified sites in proteins that had the PTM. In this case “reference proteomes” were generated by obtaining the sequences for the modified proteins from UniProt.

2.3 Model architecture

The general deep learning architecture for this work is shown in Fig. 1. Several other architectures were tried, including an LSTM and attention-based model (Supplementary Fig. S4.2). In short, the general model used in this study took the vectorized sequences of length 53, embedded them as a word embedding, then passed them through a 2D convolution layer, a max pooling layer, and two dense layers with a dropout of 0.1 in between. ReLu activation was used for all layers except for the output that used Sigmoid.

Figure 1.

Figure 1.

Overall framework of the Sitetack model for PTM prediction.

2.4 Training

Model parameters were tuned using the human protein phosphorylation dataset (Supplementary Fig. S4.3). All other models were trained by using the Adam Optimizer (Kingma and Ba 2014) (learning rate = 0.001) with Binary Cross Entropy as a loss function. Early stopping by monitoring the validation set loss with a patience of 15 was used to prevent overfitting. If early fitting was not activated, then the models could train up to 400 epochs using a batch size of 100. In addition, kernel regularization was used at the convolutional layer via L2 regularization with a weight of (1E−6).

2.5 Interpretability using integrated gradients

To interpret the results of Sitetack predictions we used the post hoc interpretability method integrated gradients (Sundarajan et al. 2017) on the input word embedding tensors, and summed them at each position to obtain the position-summed integrated gradients (PSIG). Model gradients were obtained using GradientTape from tensorflow and integrated gradients were calculated using a Riemann approximation for integration and then summed over each position:

PSIGiapproxx=j=1nxi,j-xi,j×k=1mF(x+ km×x-x)xi,j×1m

where x = input tensor (word embedding sequence vector), i,j = features (i: sequence position, j: word embedding position), m = number of steps in Riemann integral approximation, n = length of word embedding vector, and x′ = baseline tensor.

The resulting tensors (before position summation) were clustered using K means clustering with 10 clusters determined through the elbow method. Clusters were visualized via t-SNE. The sequence frequency plot of a cluster was generated using WebLogo (Crooks et al. 2004) for all the input k-mer sequences for that cluster, with the k-mer center set to zero and then truncated to ±10 residues on each side.

3 Results

3.1 PTM data improves model performance

Given the success of convolutional neural network (CNN) architectures, we employed a simple CNN-based deep-learning model for testing. Although other architectures were tested, including long-short term memory (LSTM) models, attention-based models, and combinations thereof (Supplementary Fig. S4.2), a simple CNN tended to perform well and have reasonable training times. The overall model architecture used for this study is shown in Fig. 1. For each dataset, we trained two separate models with identical architectures, one that contained other known PTM locations (of the same type of PTM) and one that did not. The former is unique to this work. This training was done using known datasets (Wang et al. 2020) for thirteen PTMs: Phosphorylation of serine and threonine, phosphorylation of tyrosine, N-glycosylation of asparagine, O-glycosylation of serine and threonine, ubiquitination of lysine, SUMOylation of lysine, acetylation of lysine, methylation of lysine, methylation of arginine, pyroglutamylation of glutamine, palmitoylation of cysteine, hydroxylation of proline, and hydroxylation of lysine (Supplementary Fig. S1.2). For O-glycosylation, we trained additional models for subtype-specific models in which the glycan was either GlcNAc or GalNAc using datasets taken from O-GlcNAcAtlas (Ma et al. 2021) and OGP (Huang et al. 2021). In addition, models were trained on datasets that include PTMs known to exist only in humans or in any organism.

For a majority of the models we trained, we observed an increase in prediction performance when known PTM locations were encoded as separate amino acids. This increase was most striking for phosphorylation, which went from an AUC of 0.881 to 0.931 for the human-only model and 0.891 to 0.928 for the all-organism model. Select ROC curves are shown in Fig. 2, and additional ROC curves can be found in Supplementary Sections S6.1, S6.2, S7.2, S7.3, S8.2, and S8.3. Detailed performance metrics for all trained models are shown in Supplementary Figs S4.4, S5.1, S5.2, and S7.1.

Figure 2.

Figure 2.

Prediction of three representative PTMs with and without the use of labels: serine or threonine phosphorylation, N-linked glycosylation, and proline hydroxylation. (A) AUC curves. (B) AUPRC curves.

Performance enhancements correlated with the size of the dataset of known PTMs. PTMs with larger datasets of known PTMs had larger improvements in prediction performance when those known PTMs were labeled a priori. An exception was N-glycosylation, which had a large dataset and did not have improved performance when known sites were included (Fig. 2). This anomaly was likely due to the simplicity of the N-glycosylation recognition sequence, so we examined N-X-S/T sequon-specific models explicitly. It is a relatively easy task to predict if a site can be N-glycosylated, but it is more difficult to distinguish which of the N-X-S/T sequons are actually N-glycosylated. For this task, we trained two sets of models using data from NGlyDE (Pitti et al. 2019) and NGlycositeAtlas (Sun et al. 2019), which were compiled by others (Pakhrin et al. 2021). We saw an improvement in performance upon the inclusion of known N-glycosylation sites, indicating that known sites could be helpful in this task (Table 1 and Supplementary Fig. S8.1). Still, the best models for this task use structural inputs, and these models can perform better than our purely sequence-based models, suggesting that other factors are at play beyond nearby PTMs (Pakhrin et al. 2021, 2023).

Table 1.

Comparison of PTM predictions for all-organism datasets by Sitetack and other models.a

PTM type (residue) Subtype AUC/AUPRC
Comparator model
Sitetack − PTMs Sitetack + PTMs Comparator
Phosphorylation (S,T) 0.869/0.853 0.928 b /0.922b 0.896/0.329 MusiteDeep (Wang et al. 2020)
Phosphorylation (Y) 0.892/0.856 0.901/0.869 0.958/0.864 MusiteDeep (Wang et al. 2020)
N-Glycosylation (N) 0.982/0.969 0.989b/0.985b 0.993/0.937 MusiteDeep (Wang et al. 2020)
O-Glycosylation (S,T) 0.947/0.933 0.960/0.944 0.943/0.539 MusiteDeep (Wang et al. 2020)
Ubiquitination (K) 0.888/0.801 0.923 b /0.862 b 0.804/0.279 MusiteDeep (Wang et al. 2020)
SUMOylation (K) 0.957/0.930 0.966/0.945 0.990/0.881 MusiteDeep (Wang et al. 2020)
Acetylation (K) 0.863/0.761 0.904b/0.836b 0.978/0.858 MusiteDeep (Wang et al. 2020)
Methylation (K) 0.965/0.905 0.968/0.917 0.951/0.850 MusiteDeep (Wang et al. 2020)
Methylation (R) 0.928/0.870 0.945 b /0.906 b 0.941/0.844 MusiteDeep (Wang et al. 2020)
Pyroglutamylation (Q) 0.961/0.941 0.980/0.978 0.979/0.947 MusiteDeep (Wang et al. 2020)
Palmitoylation (C) 0.949/0.924 0.972/0.949 0.961/0.922 MusiteDeep (Wang et al. 2020)
Hydroxylation (P) 0.901/0.856 0.980 b /0.968 b 0.732/0.627 MusiteDeep (Wang et al. 2020)
Hydroxylation (K) 0.918/0.968 0.947b/0.876b 0.982/0.930 MusiteDeep (Wang et al. 2020)
N-Glycosylation (N) Sequon-specific 0.656/0.482 0.687/0.521 0.907/0.857 DeepNGlyPred (Sun et al. 2019)
(Pakhrin et al. 2021)
O-Glycosylation (S,T) 0.829/0.832 0.953/0.963 0.983/0.915c OGP SVM (Huang et al. 2021)
O-Glycosylation (S,T) GalNAc 0.859/0.872 0.963/0.968 OGP SVM (Huang et al. 2021)
O-Glycosylation (S,T) GlcNAc 0.766/0.767 0.879/0.890 0.828/– O-GlcNAcPRED-DL
(Ma et al. 2021, Hu et al. 2024)
a

Representative models for each PTM were chosen and Sitetack was trained on the same datasets as those models with and without known PTM locations. Values in bold typeface refer to the largest AUC and AUPRC value for each PTM among the three models: Sitetack − PTMs, Sitetack + PTMs, and a comparator. For results using a similarity reduced test set to 80%, see Supplementary Figs S11.2–S11.5.

b

A significant difference (P <0.05) exists between the Sitetack models with and without known PTM locations.

c

This model was trained on an unspecified subset of OGP.

3.2. Comparison with existing methods

We compared the AUC and AUPRC of Sitetack with and without known PTMs compared to representative models trained on similar datasets. Because we used the same datasets as MusiteDeep and few other models study a similar breadth of PTMs, our main comparison was with MusiteDeep. In general, for our models trained without known PTM locations, MusiteDeep outperformed Sitetack, which is not surprising given its use of a more complex architecture. Yet, when known PTM sites were included, Sitetack outperformed MusiteDeep (AUC or AUPRC) for most PTMs (Table 1, Supplementary Section S5.2). For O-glycosylation, we tried several other models with a larger dataset or subtype-specific models. For those models, Sitetack with labels generally outperformed corresponding models (Table 1, Supplementary Section S7.1). For GalNAc-specific models, there were no models trained on a similar dataset. Nonetheless, including known PTM locations increased prediction reliability, and partitioning of the GalNAc sites was beneficial compared to an analysis of the overall OGP dataset.

3.3 Proximity of PTMs to other PTMs

To understand why nearby PTM sites improve model performance, we examined the frequency of nearby PTM sites in a given dataset at a target residue. The results followed three patterns, as shown in Fig. 3 (all PTMs are shown in Supplementary Section S3). In the first pattern, as is evident with N-glycosylation, we found a relatively even distribution of PTMs at different distances from the site of interest. In this pattern, there is little benefit to knowing the location of nearby PTMs. Accordingly, nearby PTMs are more likely due to random chance than the substrate recognition necessary for N-glycosylation. The second pattern is illustrated best by the hydroxylation of proline, which has a strongly repetitive pattern. The emergence of this pattern is not surprising because collagen is the most common protein with hydroxyproline and follows a triplet repeating pattern of Xaa-Yaa-Gly in which Yaa is hydroxyproline 38% of the time (Shoulders and Raines 2009). In this pattern, the inclusion of nearby PTMs improves prediction only slightly, which is likely due to the simplicity of the pattern that a model could learn. The third pattern is illustrated by the phosphorylation of serine and threonine, where there are fewer nearby phosphorylation events as distance increases from a given site. In such cases, we saw that including known nearby PTMs in the dataset improved predictions significantly. We hypothesize that enzymes that catalyze these modifications are influenced by amino acid residues that have already undergone a PTM.

Figure 3.

Figure 3.

Frequency of nearby PTM sites for three representative PTMs: serine or threonine phosphorylation, asparagine N-glycosylation, and proline hydroxylation. (A) All-organism datasets. (B) Human-only datasets. (C) Interpretability of the human serine or threonine phosphorylation dataset using integrated gradients. Integrated gradients were used to cluster k-mers; for select clusters (1, 2, and 6), the PSIG was superimposed over the amino acid frequency at each position.

To understand how our model is using nearby PTMs we developed an interpretability module for Sitetack using integrated gradients (Sundararajan et al. 2017) summed over each position of the input. With this module, we can see which residues in an input are important to the model for its final prediction based on the intensity of the integrated gradients at each position. To illustrate the utility of this module, we evaluated the independent test set of Phosphorylation (S,T) in humans with our interpretability model and clustered the position summed integrated gradients (PSIG) for each input sequence using t-SNE and K means clustering (Fig. 3C) with 10 clusters (Lloyd 1982, van der Maaten and Hinton 2008). Of the 10 clusters, eight had an enrichment of phosphoserine or phosphothreonine at some position, and we chose three examples and plotted the average PSIG against the amino acid frequency in the k-mers for that cluster, as shown in Fig. 3C. In cluster 2, we see a sharp peak in the PSIG at the +1 position that correlates with the presence of proline, as reported previously (Lu et al. 2002, Yan et al. 2023), and shows that our model and interpretability module are picking up known motifs. In clusters 1 and 6, we see similar peaks in the PSIG that correspond to the presence of phosphorylated residues in the context of acidic residues (cluster 1) or bulky charged molecules (cluster 6). These results illustrate that our model is able to recognize different amino acid motifs and that phosphoserine and phosphothreonine are important components of phosphorylation motifs. Finally, our model not only gives insight into motifs that might be important for phosphorylation, but also provides a method to explain predictions that could inspire and enlighten mechanistic studies.

3.4 Extension to kinase-specific models

The human kinome contains >500 kinases (Zhang et al. 2021). Accordingly, the identity of the kinase that catalyzes a particular phosphorylation event is useful information. Toward this goal, we developed kinase-specific models (Wang et al. 2017, Yang et al. 2021) to determine whether or not nearby phosphorylation sites were important for prediction at the enzymatic level.

The datasets are necessarily much smaller for a particular kinase than for phosphorylation in general. We envisioned that a reasonable approach would be to predict sites that might be phosphorylated in a protein of interest and then to see which kinase might be responsible for the phosphorylation of each highly scoring site. We were not interested in which kinase is responsible in natural systems per se, as that question also requires a consideration of which phosphatases might be present and likely have a smaller training set. Instead, we asked a more prudent question: Which kinases could target a particular site? To answer this question, we selected a dataset from a publication by Ishihama and coworkers (Sugiyama et al. 2019) due to its breadth in the number of kinases assayed, the number of sites reported, and the use of an in vitro assay. From this dataset, we looked at 68 kinases that target serine or threonine residues and that had over 500 reported sites (Supplementary Fig. S9.1). For each of these kinases, we trained models with and without known phosphorylation sites. Overall, we saw that the inclusion of known phosphorylation sites improved kinase-specific prediction, with some kinases more affected than others (Fig. 4, Supplementary Section S9.2). To understand why some kinase models were influenced more by the inclusion of known sites, we looked at three examples: NEK4, which had the largest difference in AUC between the two models; IKKε, which was near the median difference in AUC between the two models; and PKA Cα, which had the smallest difference between the AUC of the two models. For these models, the number of nearby phosphorylation sites greatly influenced the performance of the labeled model (Fig. 4B).

Figure 4.

Figure 4.

Phosphorylation (S,T) prediction using kinase-specific models. (A) Difference in AUC with and without known site locations. (B) Frequencies of nearby sites in the dataset. (C) AUC and AUPRC curves.

3.5 Cross PTM models

Having shown that encoding known PTM locations of a given type as a separate amino acid can improve prediction accuracy for that PTM, we wondered if including information about a different PTM can improve prediction accuracy. To explore this crosstalk, we chose O-glycosylation with GlcNAc and phosphorylation at serine and threonine, which generally occur in the same subcellular compartments and at the same amino acid residues (serine and threonine) (Hart et al. 2011, van der Laarse et al. 2018, Xu et al. 2023). Moreover, both had large datasets, which we suspected was important to detect an impact on prediction accuracy(Ma et al. 2021). We chose to look at O-GlcNAc prediction using phosphorylation sites and not the converse, given that the size of the phosphorylation dataset was much larger. For this task, we trained two separate models with known PTM locations, one in which known phosphorylation sites were encoded as separate amino acids (but not known O-GlcNAc sites) and another in which known O-GlcNAc sites were encoded as a separate amino acid but not phosphorylation sites. The result was that both phosphorylation sites and O-GlcNAc sites improved prediction, though O-GlcNAc sites did so to a much higher level (Fig. 5A and C) in the all-organism datasets. There was an improvement, but not a significant one, in the human-only datasets (Supplementary Section S10). We found that phosphorylation sites were present near the prediction residue but not at the level that O-GlcNAc sites occurred (Fig. 5B), which offers an explanation for O-GlcNAc sites showing a larger improvement. Thus, known PTM locations are able to improve the prediction of a different PTM.

Figure 5.

Figure 5.

O-GlcNAc prediction using known O-GlcNAc or phosphorylation sites. (A) Performance assessments with and without labels. (B) Frequencies of nearby sites in the dataset. (C) AUC and AUPRC curves.

3.6 Accessibility

To make this tool accessible to researchers, we incorporated our best models into a free web tool that is accessible at https://sitetack.net. Users can input multiple sequences of interest and select a model to use for PTM prediction. For sequences with known PTM locations, there are instructions for each model on how to encode that information (e.g. a phosphoserine is encoded as “@”). Results are displayed on the page for a single input, and multiple inputs can be downloaded as a CSV file. The website includes models for every PTM from this work, trained with and without known PTM locations.

Our method of encoding known PTMs in prediction models has broad applicability beyond what is reported herein, extending to newly identified PTMs, different organisms, and context-specific models. To support this ever-developing field, we created a TYOM (train your own model) script for researchers to generate their own models for any PTM dataset. We envision these scripts as being useful for researchers who have compiled sites for a given enzyme, rare PTM, or temporal dataset that was not evaluated here, allowing researchers to train prediction models to aid in their research. As an example of biological credibility, we trained a model to predict phosphorylation changes upon physical exercise using the dataset from Parker and coworkers (Blazev et al. 2022), and we used our method to predict differential phosphorylation in different contexts (Supplementary Section S12). Using these context-specific models, we were able to predict the differential phosphorylation of troponin-C (TNNC2), an important protein for skeletal muscle contraction, at position 92, which is indeed a known phosphorylation site. This functionality is available at https://github.com/clair-gutierrez/sitetack, along with step-by-step instructions.

4 Discussion

This work demonstrates that encoding known PTM locations as separate amino acids is a relatively simple way to improve the performance of deep learning PTM prediction models for many PTMs. From a biochemical perspective, this finding is not surprising, given that PTMs alter the chemical, structural, and functional landscape of proteins (Walsh et al. 2005, Ryšlavá et al. 2013). We found that extra data were especially useful in prediction tasks with large datasets like phosphorylation, which is intelligible from a technical standpoint because a certain level of coverage is necessary for PTMs to be near each other. This reliance on the size of datasets is also a limitation, as smaller datasets have less PTM coverage and tend to provide less improvement unless there is significant clustering of PTM sites, as in the kinase-specific datasets. Due to limitations in current structural datasets for proteins with PTMs, we did not include structural inputs in Sitetack. In future iterations, we hope to explore how including known PTMs could improve protein structure-based models.

We found that kinase-specific models generally benefited from the inclusion of known phosphorylation locations more so than did most other models. This benefit might be due to the sensitivity of protein kinases to amino acid sequence (Johnson et al. 2023) and that the modification installs a compact dianionic functional group. We anticipate that future work on the development of kinase- or kinase family-specific models would benefit greatly by including known phosphorylation sites. In addition, context-specific information could be learned with models trained on phosphorylation levels in different contexts (e.g. cell types, hypoxia, or stress states).

We also showed that phosphorylation sites can improve the prediction of O-glycosylation via O-GlcNAc. These models did not perform especially well for O-GlcNAc prediction but were still significantly better than models that omitted phosphorylation data. In future work, we foresee probing which PTMs affect the prediction of other PTMs, like ubiquitination, of biological relevance. The ensuing crosstalk underscores the importance of PTM selection in prediction tasks, as the choice of which PTMs to include can affect the resulting prediction.

Including PTMs in training datasets encoded as unique amino acids has broad utility beyond PTM prediction. PTMs are known to affect the structure, function, and activity of proteins and could improve other models ranging from structure prediction to de novo protein design. In addition, many current PTM models lack temporal information because many modifications, such as phosphorylation, N-glycosylation, and O-glycosylation, are physiologically reversible. More tools that allow researchers to see which modifications are likely in different contexts would allow for more information on proteins of interest and begin to reveal situational information. We conclude that the encoding of PTM locations can improve the reliability of deep learning PTM prediction models and is poised for beneficial application to a variety of prediction tasks.

Supplementary Material

btae602_Supplementary_Data

Acknowledgements

We thank Professors Forest M. White and Matthew D. Shoulders, Dr Volga Kojasoy, and Oscar J. Molina for helpful discussions and manuscript review. This work would not have been possible without the MIT Engaging computational cluster and the staff that maintains it. Figure 1 and Supplementary Fig. S3.1 were created in part with BioRender.

Contributor Information

Clair S Gutierrez, Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, United States; Broad Institute of MIT and Harvard, Cambridge, MA 02143, United States.

Alia A Kassim, Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, United States.

Benjamin D Gutierrez, Independent Researcher.

Ronald T Raines, Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, United States; Broad Institute of MIT and Harvard, Cambridge, MA 02143, United States; Koch Institute for Integrated Cancer Research at MIT, Cambridge, MA 02139, United States.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

This work was supported by the National Institutes of Health [R01 CA073808 and R35 GM148220]. C.S.G. was supported by a National Defense Science and Engineering Graduate Fellowship sponsored by the Air Force Research Laboratory. A.A.K. was supported by the Massachusetts Institute of Technology Undergraduate Research Opportunities Program and its Paul E. Gray (1954) Fund.

Data availability

Sitetack is available as a web tool at: https://sitetack.net for protein PTM prediction using the CNN models reported herein. The source code, representative datasets, instructions for local use, and select models can be found at: https://github.com/clair-gutierrez/sitetack.

References

  1. Barbour H, Nkwe NS, Estavoyer B. et al. An inventory of crosstalk between ubiquitination and other post-translational modifications in orchestrating cellular processes. iScience 2023;26:106276. [DOI] [PMC free article] [PubMed] [Google Scholar]
  2. Bateman A, Martin MJ, Orchard S. et al. UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res 2023;51:D523–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Blazev R, , CarlCS, , Ng Y-K. et al. Phosphoproteomics of three exercise modalities identifies canonical signaling and C18ORF25 AS AN AMPK substrate regulating skeletal muscle function. Cell Metab 2022;34:1561–77.e9. [DOI] [PubMed] [Google Scholar]
  4. Chang CC, Tung CH, Chen CW. et al. SUMOgo: prediction of sumoylation sites on lysines by motif screening models and the effects of various post-translational modifications. Sci Rep 2018;8:15512–0. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Crooks GE, Hon G, Chandonia JM. et al. WebLogo: a sequence logo generator. Genome Res 2004;14:1188–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Ebert T, Tran N, Schurgers L. et al. Ageing—oxidative stress, PTMs and disease. Mol Aspects Med 2022;86:101099. [DOI] [PubMed] [Google Scholar]
  7. Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Hart GW, Slawson C, Ramirez-Correa G. et al. Cross talk between O-GlcNAcylation and phosphorylation: roles in signaling, transcription, and chronic disease. Annu Rev Biochem 2011;80:825–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Hermann J, Schurgers L, Jankowski V.. Identification and characterization of post-translational modifications: clinical implications. Mol Aspects Med 2022;86:101066. [DOI] [PubMed] [Google Scholar]
  10. Hou R, Wu J, Xu L. et al. Computational prediction of protein arginine methylation based on composition−transition−distribution features. ACS Omega 2020;5:27470–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Hu F, Li W, Li Y. et al. O-GlcNAcPRED-DL: prediction of protein O-GlcNAcylation sites based on an ensemble model of deep learning. J Proteome Res 2024;23:95–106. [DOI] [PubMed] [Google Scholar]
  12. Huang H, Arighi CN, Ross KE. et al. iPTMnet: an integrated resource for protein post-translational modification network discovery. Nucleic Acids Res 2018;46:D542–50. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Huang J, Wu M, Zhang Y. et al. OGP: a repository of experimentally characterized O-glycoproteins to facilitate studies on O-glycosylation. Genomics Proteomics Bioinf 2021;19:611–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. Johnson JL, Yaron TM, Huntsman EM. et al. An atlas of substrate specificities for the human serine/threonine kinome. Nature 2023;613:759–66. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Karve TM, Cheema AK.. Small changes huge impact: the role of protein posttranslational modifications in cellular homeostasis and disease. J Amino Acids 2011;2011:1–13. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. Kingma DP, Ba JL. Adam: a method for stochastic optimization. arXiv 2017;1412.6980v9, preprint: not peer reviewed.
  17. Li W, Godzik Z.. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. [DOI] [PubMed] [Google Scholar]
  18. Lloyd SP. Least squares quantization in PCM. IEEE Trans Inform Theory 1982;28:129–37. [Google Scholar]
  19. Lu KP, Liou YC, Zhou XZ.. Pinning down proline-directed phosphorylation signaling. Trends Cell Biol 2002;12:164–72. [DOI] [PubMed] [Google Scholar]
  20. Luo F, Wang M, Liu Y. et al. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics 2019;35:2766–73. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Ma J, Li Y, Hou C. et al. O-GlcNAcAtlas: a database of experimentally identified O-GlcNAc sites and proteins. Glycobiology 2021;31:719–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Pakhrin SC, Aoki-Kinoshita KF, Caragea D. et al. DeepNGlyPred: a deep neural network-based approach for human N-linked glycosylation site prediction. Molecules 2021;26:7314. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Pakhrin SC, Pokharel S, Aoki-Kinoshita KF. et al. LMNglyPred: prediction of human N-linked glycosylation sites using embeddings from a pre-trained protein language model. Glycobiology 2023;33:411–22. [DOI] [PubMed] [Google Scholar]
  24. Pan S, Chen R.. Pathological implication of protein post-translational modifications in cancer. Mol Aspects Med 2022;86:101097. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Pitti T, Chen CT, Lin HN. et al. N-GlyDE: a two-stage N-linked glycosylation site prediction incorporating gapped dipeptides and pattern-based encoding. Sci Rep 2019;9:15975–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Ramazi S, Zahiri J.. Post-translational modifications in proteins: resources, tools and prediction methods. Database 2021;2021:baab012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  27. Ryšlavá H, Doubnerová V, Kavan D. et al. Effect of posttranslational modifications on enzyme function and assembly. J Proteomics 2013;92:80–109. [DOI] [PubMed] [Google Scholar]
  28. Schaffert LN, Carter WG.. Do post-translational modifications influence protein aggregation in neurodegenerative diseases: a systematic review. Brain Sci 2020;10:232. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Shoulders MD, Raines RT.. Collagen structure and stability. Annu Rev Biochem 2009;78:929–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Sugiyama N, Imamura H, Ishihama Y.. Large-scale discovery of substrates of the human kinome. Sci Rep 2019;9:12. [DOI] [PMC free article] [PubMed] [Google Scholar]
  31. Sun S, Hu Y, Ao M. et al. N-GlycositeAtlas: a database resource for mass spectrometry-based human N-linked glycoprotein and glycosylation site mapping. Clin Proteom 2019;16:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Sundarajan M, Taly A, Yan Q. Axiomatic attribution for deep networks. In: Proceedings of the 34th International Conference on Machine Learning. Sydney, Australia: PMLR, 2017.
  33. van der Laarse SAM, Leney AC, Heck AJR.. Crosstalk between phosphorylation and O-GlcNAcylation: friend or foe. FEBS J 2018;285:3152–67. [DOI] [PubMed] [Google Scholar]
  34. van der Maaten L, Hinton G.. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–605. [Google Scholar]
  35. Walsh CT, Garneau-Tsodikova S, Gatto GJ. Jr. Protein posttranslational modifications: the chemistry of proteome diversifications. Angew Chem Int Ed Engl 2005;44:7342–72. [DOI] [PubMed] [Google Scholar]
  36. Wang D, Liang Y, Xu D.. Capsule network for protein post-translational modification site prediction. Bioinformatics 2019;35:2386–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Wang D, Liu D, Yuchi J. et al. MusiteDeep: a deep-learning based webserver for protein post-translational modification site prediction and visualization. Nucleic Acids Res 2020;48:W140–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Wang D, Zeng S, Xu C. et al. MusiteDeep: a deep-learning framework for general and kinase-specific phosphorylation site prediction. Bioinformatics 2017;33:3909–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  39. Wu M, Yang Y, Wang H. et al. A deep learning method to more accurately recall known lysine acetylation sites. BMC Bioinformatics 2019;20:1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  40. Xu S, Suttapitugsakul S, Tong M. et al. Systematic analysis of the impact of phosphorylation and O-GlcNAcylation on protein subcellular localization. Cell Rep 2023;42:112796. [DOI] [PMC free article] [PubMed] [Google Scholar]
  41. Yan Y, Jiang JY, Fu M. et al. MIND-S is a deep-learning prediction model for elucidating protein post-translational modifications in human diseases. Cell Rep Methods 2023;3:100430. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Yang H, Wang M, Liu X. et al. PhosIDN: an integrated deep neural network for improving protein phosphorylation site prediction by combining sequence and protein–protein interaction information. Bioinformatics 2021;37:4668–76. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Zhang H, Cao X, Tang M. et al. A subcellular map of the human kinome. Elife 2021;10:34943. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae602_Supplementary_Data

Data Availability Statement

Sitetack is available as a web tool at: https://sitetack.net for protein PTM prediction using the CNN models reported herein. The source code, representative datasets, instructions for local use, and select models can be found at: https://github.com/clair-gutierrez/sitetack.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES