Abstract
Accurate identification of phosphorylation sites is fundamental for advancing the study of protein phosphorylation, which plays a critical role in elucidating protein function and facilitating rational drug design. While mass spectrometry-based experimental techniques are considered the effective method for phosphorylation site identification, their widespread application is often hindered by high costs, limited throughput, and the requirement for specialized equipment. Computational prediction methods have become a popular alternative strategy for providing candidate phosphorylation sites by inferring their presence from the entire protein sequence. However, many current methods struggle to extract sufficient contextual features, limiting their predictive reliability in complex biological settings. To address these issues, we present a novel protein large language model (pLLM)-based model, called ProtPSP, for phosphorylation site prediction. By leveraging a pretrained pLLM, ProtPSP effectively captures complex sequence contexts that conventional models trained on limited phosphorylation data may miss. Comprehensive benchmarking on curated data sets demonstrates that ProtPSP achieves reliable phosphorylation site prediction for both serine/threonine and tyrosine sites, outperforming other commonly used methods across multiple evaluation metrics. Ablation studies substantiate the critical contribution of both pLLM-driven features and the fusion model architecture to overall performance improvements. Moreover, case studies demonstrate that ProtPSP consistently identifies all true phosphorylation sites, underscoring its significant potential as a complementary approach to mass spectrometry-based techniques in biomedical research and drug discovery.
1. Introduction
Protein phosphorylation, as one of the most extensively studied post-translational modifications (PTMs), plays a crucial role in life activities. By dynamically regulating protein function, localization, and interactions, it is widely involved in various biological processes such as metabolism, signaling, gene expression, and cell cycle progression. − Numerous studies have demonstrated that dysregulated phosphorylation is intimately associated with the onset and progression of various diseases. − Consequently, unraveling the molecular underpinnings of protein phosphorylation has emerged as a central theme in biomedical research, providing critical insights into the development of innovative approaches to disease prevention, diagnosis, and targeted therapy.
Accurate identification of phosphorylation sites within protein sequences is a fundamental prerequisite for comprehensive studies of protein phosphorylation, which can be achieved through either experimental or computational approaches. Mass spectrometry-based techniques are widely employed experimental approaches for the reliable identification of phosphorylation sites, and significant progress has been made in this domain as evidenced by numerous recent studies. − For example, Lyu et al. developed a pseudotargeted mass spectrometry method based on parallel reaction monitoring, which enables sensitive identification and quantification of phosphopeptides and yields more sites than conventional experimental approaches. However, these mass spectrometry-based methods are often constrained by factors such as high operational costs, limited throughput, and reliance on specialized instrumentation. In contrast, computational methods offer an effective alternative by leveraging protein sequence information to predict phosphorylation sites. These approaches typically involve the development of predictive models trained on manually annotated data sets, under the assumption that the specificity of phosphorylation events is associated with the structural context surrounding potential modification sites. Due to their scalability, high-throughput, and lack of reliance on specialized instruments, computational strategies serve as valuable complements to experimental methods for identifying protein phosphorylation sites.
Recently, several computational methods have been developed for the prediction of phosphorylation sites, which can be roughly divided into machine learning and deep learning methods. Machine learning approaches typically involve the design of effective handcrafted features, followed by the constructed classifier, to predict protein phosphorylation sites. − For example, Musite encodes sequence features including k-nearest neighbor scores, protein disorder scores, and amino acid frequency matrices. An ensemble of multiple support vector machines (SVMs) is then employed to predict phosphorylation sites. Despite these advancements, machine learning approaches remain constrained by their reliance on domain-specific knowledge for manual feature engineering, resulting in a suboptimal prediction performance given the current incompleteness of biological understanding. Deep learning methods employing data-driven strategies − can automatically extract features relevant to the specificity of phosphorylation events from protein sequences and have demonstrated superior predictive performance compared to traditional machine learning approaches. For example, Luo et al. propose DeepPhos, which utilizes densely connected convolutional neural network (CNN) blocks to capture diverse representations of sequences, and the final phosphorylation site prediction is achieved through intragroup and intergroup crosstalk layers. Guo et al. introduce a bidirectional long short-term memory (BiLSTM)-based model, DeepPSP, which, for the first time, incorporated the complete protein sequence as global information to improve the performance of phosphorylation site prediction. Then, building upon these foundational works, more recent models such as LMPhosSite and AttenPhos have been presented to further optimize the inputting feature and model architecture to enhance phosphorylation site prediction performance. Although current methods have exhibited commendable performance in phosphorylation site prediction, their limited capacity to model long-range dependencies and other critical sequence-level features still limits their prediction accuracy in complex biological scenarios.
The emergence of protein large language models (pLLMs) presents new opportunities to further enhance the performance of the phosphorylation site prediction. These models, such as ProtT5, are built upon highly scalable Transformer-based architectures and are pretrained on extensive data sets comprising approximately 2.5 billion protein sequences sourced from the UniRef50 database. Through leveraging self-supervised training, the pretrained pLLMs can capture complex sequence dependencies and generate context-aware representations that significantly outperform traditional sequence encoding approaches. Nonetheless, the application of pLLMs is currently constrained by insufficient task-specific optimization. There is a pressing need to develop a pLLM-based model with tailored architectures to enhance predictive reliability in phosphorylation site identification.
In this study, we introduce ProtPSP, a novel pLLM-based framework for the prediction of phosphorylation site prediction. By integrating the pretrained ProtT5 model into the architecture, the proposed ProtPSP is able to extract informative sequence features related to the specific phosphorylation event. To further enhance the quality of the pLLM-driven features, ProtPSP incorporates a squeeze-and-excitation (SE-Net) block, a dual-branch fusion module composed of bidirectional LSTM and Transformer layers, and a Bahdanau attention mechanism. Qualitative evaluations on our curated data sets demonstrate that the proposed ProtPSP framework achieves superior performance of phosphorylation site prediction across sensitivity (SN), specificity (SP), F1-Score, and Matthews correlation coefficient (MCC), outperforming existing methods such as Musite, DeepPhos, DeepPSP, LMPhosSite, and AttenPhos in most cases. Ablation studies are conducted to assess the contributions of both the incorporation of pLLMs and the designed model architecture. The results indicate that each component significantly enhances the phosphorylation site prediction performance, demonstrating the effectiveness of the proposed ProtPSP. Additionally, case studies involving three recently published protein sequences are carried out to further evaluate the practical utility of various models. ProtPSP is the only method to consistently achieve optimal sensitivity. These findings suggest that ProtPSP has a strong potential to become a widely adopted tool in biomedical research.
2. Results and Discussion
2.1. ProtPSP Outperforms Other Methods on Benchmark Data Set
To demonstrate the superior performance of the proposed ProtPSP model in phosphorylation site prediction, we compare it with five widely used methods: Musite, DeepPhos, DeepPSP, LMPhosSite, and AttenPhos, as illustrated in Figure . All evaluations are conducted on our curated benchmark data set. Detailed descriptions and implementation specifics for each baseline method are provided in Material S1. Among the evaluated approaches, Musite, representative of traditional machine learning-based method, exhibited the weakest predictive performance, achieving an SN of 0.7100, SP of 0.5590, F1-Score of 0.6601, and MCC of 0.2721 for S/T sites and an SN of 0.6640, SP of 0.5312, F1-Score of 0.6227, and MCC of 0.1971 for Y sites. These results highlight the limitations of traditional machine learning methods in capturing the complex sequence patterns relevant to phosphorylation site prediction. In contrast, recent deep learning-based models, such as AttenPhos, DeepPhos, DeepPSP, LMPhosSite, and ProtPSP, consistently outperform Musite across all evaluation metrics, demonstrating the significant advantage conferred by deep learning architectures in modeling complex sequence relationships and extracting discriminative features relevant to the specific phosphorylation events. Notably, the proposed ProtPSP method exhibits superior performance compared to previous deep learning methods in most cases. For S/T sites, the ProtPSP model attains an SN of 0.7210, SP of 0.7505, F1-Score of 0.7318, and MCC of 0.4717. For Y sites, the ProtPSP model produces an SN of 0.7236, SP of 0.6589, F1-Score of 0.7009, and MCC of 0.3834. Under both site conditions, ProtPSP yields the highest F1-Score and MCC, highlighting its more reliable predictive performance in the phosphorylation site prediction.
1.
Comparative results of SN, SP, F1-Score, and MCC for Musite, AttenPhos, DeepPhos, DeepPSP, LMPhosSite, and ProtPSP in phosphorylation site prediction. (A) S/T sites; (B) Y sites.
Additionally, we plotted the ROC curves for all models on both S/T and Y sites, as shown in Figure S1. The proposed ProtPSP attained the highest AUC values, 0.814 for S/T sites and 0.765 for Y sites, respectively, surpassing all other evaluated approaches. These results further substantiate the enhanced discriminative power and generalization capability of ProtPSP in phosphorylation site prediction.
2.2. Critical Role of pLLMs in Enhancing Phosphorylation Site Prediction
To evaluate the contribution of the introduction of pLLM, we conduct ablation experiments to systematically assess its impact on phosphorylation site prediction. Specifically, three distinct variants of the ProtPSP framework are compared on testing data set: the ProtPSP without the global inputting, ProtPSP without the pLLM component, and the full ProtPSP model, as shown in Figure A. For S/T sites, the ProtPSP model lacking global inputting exhibits notably reduced performance compared to the full ProtPSP architecture, with an F1-Score of 0.7037, MCC of 0.4338, and AUC of 0.7926. These results underscore the importance of global information in enhancing phosphorylation site prediction, which aligns with findings reported in previous studies. Furthermore, the ProtPSP variant without the pLLM component achieves an F1-Score of 0.7214, MCC of 0.4393, and AUC of 0.7983. The observed decline in predictive performance further demonstrates the criticality of integrating pLLMs for improving the accuracy of phosphorylation site prediction. Similarly, for Y sites, without global inputting, the pLLM component decreases to F1-Score of 0.6806 and 0.6946, MCC of 0.3612 and 0.3542, and AUC of 0.7498 and 0.7461, respectively. To further highlight the contribution of pLLM features to phosphorylation site prediction, SHAP analysis is performed to calculate the contribution of each feature, as shown in Figure S2. The SHAP beeswarm plot demonstrates that eight out of the top ten most influential features identified by SHAP are derived from pLLM, with all of the top five features belonging to these features. These results suggest the critical contribution of pLLMs in advancing the accuracy of the phosphorylation site prediction.
2.
Ablation study results assessing the impact of pLLM on the testing set. (A) Performance metrics including SN, SP, F1-Score, MCC, and AUC for three ProtPSP model variants: the full ProtPSP model, the ProtPSP model without global inputting, and the ProtPSP model without the pLLM component, evaluated on both S/T and Y phosphorylation sites. UMAP visualizations of feature representations for the three ProtPSP variants on the (B) S/T and (C) Y sites.
In addition, we also analyze the features extracted from the three distinct variants of the ProtPSP model. Here, uniform manifold approximation and projection (UMAP) is used to project the high-dimensional features extracted from the three distinct variants to 2-dimensional embedding space, as shown in Figure B,C. The resulting scatter plots illustrate that the full ProtPSP model yields a markedly clearer separation between phosphorylation and nonphosphorylation sites compared to variants lacking either the global inputting or the pLLM component. For a quantitative assessment, we evaluated the clustering quality of the learned feature representations using the Silhouette Coefficient (SC), the Calinski–Harabasz Index (CHI), and the Davies–Bouldin Index (DBI), with results summarized in Table S1. The full ProtPSP model achieves an SC of 0.089, a CHI of 3511.602, and a DBI of 3.069, reflecting superior clustering performance across these metrics. These results further underscore the efficacy of integrating both global contextual information and pLLMs in substantially improving the accuracy of phosphorylation site prediction.
2.3. Impact of Designed Model Architecture on Phosphorylation Site Prediction
Previous studies have frequently employed BiLSTM and Transformer architectures to extract context-aware features from protein sequences for phosphorylation site prediction. In this study, we design a BiLSTM-Transformer fusion block, aiming to combine the complementary advantages of sequential modeling and global attention mechanisms to further enhance prediction performance. To quantitatively evaluate the contribution of this fusion strategy, we conduct an ablation study that compares the full fusion model with two baseline variants: BiLSTM-only variant and Transformer-only variant. As shown in Figure A, both variants demonstrate comparable predictive performance. Specifically, for S/T sites, the BiLSTM-only variant attains an F1-Score of 0.7111, an MCC of 0.4523, and an AUC of 0.8055 and for Y sites, it achieves an F1-Score of 0.6621, an MCC of 0.3743, and an AUC of 0.7586. The Transformer-only variant yields an F1-Score of 0.7117, an MCC of 0.4450, and an AUC of 0.7997 for S/T sites and an F1-Score of 0.6859, an MCC of 0.3742, and an AUC of 0.7577 for Y sites. Both variants demonstrate lower predictive performance compared to the full ProtPSP model. By leveraging the ability of BiLSTM to capture long-range sequence dependencies and the capability of Transformer to model complex contextual relationships, the fusion approach enables more comprehensive feature extraction, thereby achieving superior performance in phosphorylation site prediction.
3.
Ablation study results assessing the impact of BiLSTM–Transformer fusion on the testing set. (A) Performance metrics including SN, SP, F1-Score, MCC, and AUC for three ProtPSP model variants: the full BiLSTM–Transformer fusion model, BiLSTM-only, and Transformer-only implementations, evaluated on S/T and Y phosphorylation sites. UMAP visualizations of feature representations for the three ProtPSP variants on (B) S/T and (C) Y sites.
Furthermore, we visualize the two-dimensional embedding vectors by applying UMAP to the features generated by the three distinct ProtPSP model variants, as illustrated in Figure B,C. Notably, the BiLSTM–Transformer fusion block achieves better separation of the data points compared to both the BiLSTM-only and Transformer-only variants, evidenced by an SC of 0.089, a CHI of 3511.602, and a DBI of 3.069 in Table S2. These metrics collectively indicate a superior clustering performance for the fusion model. The BiLSTM architecture, owing to its gated design, excels at capturing local temporal dependencies but may suffer from information degradation when modeling long-range relationships. In contrast, Transformers, equipped with multihead attention, are adept at modeling long-range dependencies by capturing interactions between arbitrary positions yet they may be less effective in detecting local temporal patterns. Thus, the integration of BiLSTM and Transformer architectures allows the model to synergistically leverage the strengths of both approaches, thereby facilitating multiscale sequence modeling and ultimately improving the accuracy of phosphorylation site prediction.
2.4. Case Studies Demonstrate the Reliability of ProtPSP in Practical Applications
To demonstrate the reliability of the proposed ProtPSP in accurately identifying true phosphorylation sites, we conduct case studies on three recently published proteins, P04179 (June 18, 2025), P32754 (June 18, 2025), and A0A1W2PQ27 (June 18, 2025). Here, several widely adopted phosphorylation site prediction methods, including Musite, DeepPhos, DeepPSP, LMPhosSite, and AttenPhos, are included for comparison. To ensure a fair comparison, we consider sites with a confidence score greater than 50% as positive. Table provides a detailed comparison of the phosphorylation site positions predicted by each algorithm for the selected proteins. Notably, ProtPSP is the only method to accurately identify all true phosphorylation sites in both P04179 and P32754, demonstrating its superior performance. In the case of A0A1W2PQ27, while ProtPSP, DeepPhos, DeepPSP, and AttenPhos successfully detected the true phosphorylation sites, ProtPSP distinguished itself by yielding the fewest false negative predictions, thereby minimizing missed sites of biological relevance.
1. Prediction Results of Six Different Methods.
| protein ID | position of true site | model | position of predicted site |
|---|---|---|---|
| A0A1W2PQ27 | 3, 4, 5, 14, 32 | ProtPSP | 3, 4, 5, 14, 19, 24, 32, 35, 38, 40, 55, 61, 65 |
| DeepPhos | 3, 4, 5, 14, 19, 24, 32, 35, 38, 40, 55, 59, 61, 65 | ||
| LMPhosSite | 3, 4, 5, 24, 32, 35, 38, 40, 55, 61, 65 | ||
| DeepPSP | 3, 4, 5, 14, 19, 24, 32, 35, 38, 40, 55, 59, 60, 61, 65 | ||
| AttenPhos | 3, 4, 5, 14, 19, 24, 32, 35, 38, 40, 55, 60, 61, 64, 65, 76, 77 | ||
| Musite | 24, 32, 35, 38, 40, 165, 55, 65, 76, 116, 194 | ||
| P04179 | 3, 19, 58, 79, 106, 127, 200 | ProtPSP | 3, 9, 10, 19, 22, 33, 35, 58, 65, 69, 79, 99, 106, 127, 139, 200, 217 |
| DeepPhos | 9, 10, 19, 22, 58, 69, 127 | ||
| LMPhosSite | 19, 22, 27, 33, 35, 58, 69, 79, 99, 103, 106, 127, 136, 189, 190, 200, 217 | ||
| DeepPSP | 9, 10, 19, 22, 58, 127 | ||
| AttenPhos | 3, 9, 10, 19, 22, 127, 136, 217 | ||
| Musite | 9, 19, 27, 33, 35, 58, 69, 106, 127, 217 | ||
| P32754 | 4, 105, 211, 215, 219, 221, 222, 223, 226, 232, 235, 250, 293, 295, 296, 366 | ProtPSP | 2, 3, 4, 5, 36, 47, 52, 54, 59, 105, 138, 139, 142, 143, 145, 152, 160, 200, 211, 215, 219, 221, 222, 223, 226, 232, 235, 250, 255, 290, 293, 294, 295, 296, 326, 331, 337, 345, 366, 382, 386 |
| DeepPhos | 2, 3, 4, 5, 36, 47, 105, 139, 221, 223, 226, 232, 235, 250, 255, 258, 290, 293, 294, 295, 296, 326, 331, 366, 382, 386 | ||
| LMPhosSite | 2, 3, 4, 5, 38, 47, 139, 152, 160, 211, 215, 221, 222, 223, 226, 232, 235, 250, 255, 258, 290, 293, 294, 295, 296, 331, 326, 366, 382, 386 | ||
| DeepPSP | 2, 3, 4, 5, 21, 36, 47, 52, 54, 59, 105, 152, 221, 222, 223, 226, 232, 235, 255, 258, 290, 293, 294, 295, 296, 326, 331, 366, 382, 386 | ||
| AttenPhos | 2, 3, 4, 5, 105, 139, 232, 235, 250, 255, 258, 290, 293, 294, 295, 296, 305, 326, 331, 386 | ||
| Musite | 2, 4, 5, 47, 52, 72, 105, 221, 250, 255, 258, 290, 296, 382, 386 |
These findings collectively underscore the robust performance and practical applicability of ProtPSP in phosphorylation site prediction. The enhanced accuracy and reduced false negative rate highlight its potential utility as a valuable tool for large-scale phosphoproteomic analyses and for guiding experimental validation in protein PTM research.
3. Conclusions
Accurate identification of protein phosphorylation sites is a critical preprocessing step in the investigation of protein phosphorylation mechanisms. Mass spectrometry-based proteomics remains the gold standard for the precise localization of phosphorylation sites, owing to its high sensitivity and ability to generate site-specific information. In recent years, advancements in mass spectrometry technologies, such as tandem mass spectrometry coupled to enrichment strategies, have significantly improved the detection and characterization of phosphopeptides. However, the widespread adoption of mass spectrometry-based approaches is still constrained by operational costs, limited throughput, and the need for specialized instrumentation. To address these challenges, we introduce ProtPSP, a novel phosphorylation site prediction method incorporating the large-scale pLLM model. Comprehensive evaluations on benchmark data sets reveal that the ProtPSP delivers consistently superior predictive performance across multiple metrics, outperforming the widely adopted methods. This enhanced performance suggests ProtPSP as a robust tool for high-throughput identification of protein phosphorylation sites, positioning it as a valuable computational complement to mass spectrometry-based approaches.
The benefits of the proposed ProtPSP method can be summarized by the following factors. (1) Different from the previous methods that primarily utilize sequence-based features derived from the primary protein sequence, the ProtPSP method employs the pretrained pLLM, ProtT5, allowing for the extraction of informative structural features relevant to specific phosphorylation events, thereby improving predictive accuracy. (2) The proposed ProtPSP utilizes a carefully designed dual-architecture that integrates BiLSTM and Transformer networks, effectively combining the strengths of sequential representation learning and global attention mechanisms to further improve prediction accuracy. (3) The proposed ProtPSP method does not rely on phosphorylation-specific biological prior assumptions, thereby offering greater flexibility and generalizability. As a result, the proposed ProtPSP can be readily extended to other PTM site prediction tasks.
There are, however, some current pLLMs that may exhibit hallucination, generating predictions that, while seemingly plausible, are not supported by biological evidence. This challenge is particularly pertinent when considering the integration with mass spectrometry, which remains the authoritative approach for phosphorylation site identification due to its direct empirical validation. Future research could focus on incorporating additional biological priors or deploying uncertainty estimation techniques to further enhance the reliability and of phosphorylation site prediction. The utility of computational predictions, such as those produced by ProtPSP, can be greatly enhanced when used to guide or prioritize MS-based experimental workflows, thereby improving the phosphopeptide detection efficiency and facilitating the targeted validation of predicted sites.
In summary, by harnessing the representational capacity of pretrained protein language models, the ProtPSP framework serves as an effective computational complement to mass spectrometry-based approaches for phosphorylation site identification, thereby offering substantial potential to accelerate advancements in proteomics, structural biology, and high-throughput drug discovery pipelines.
4. Materials and Methods
4.1. Data Set Construction
In this study, we curate a comprehensive data set from four databases, Swiss-Prot, dbPTM, PhosphoELM, and PhosphoSitePlus, for the development and evaluation of the proposed ProtPSP model. The data set construction process comprises three main stages: data collection, quality control, and redundancy reduction. Detailed methodologies for each stage are outlined in Material S2. Subsequently, the protein sequences are randomly partitioned into training and testing sets at a ratio of 9:1. Potential phosphorylation sites, specifically serine (S), threonine (T), and tyrosine (Y) residues, are extracted from each sequence. For site annotation, residues experimentally validated as phosphorylation sites are assigned as positive samples, while all remaining S, T, and Y residues are considered to be negative samples. Previous studies have demonstrated that S and T residues can be phosphorylated by the same specific kinase. Accordingly, we construct separate site prediction models for S, T, and Y, respectively. Then, the distributions of positive and negative sites in the curated training and testing sets are summarized in Table and Figure S3.
2. Data Distribution of Positive and Negative Sites in the Training and Testing Sets.
| data set | residues | number of proteins | positive | negative |
|---|---|---|---|---|
| training set | S/T | 13437 | 297590 | 822389 |
| Y | 13213 | 44755 | 163696 | |
| testing set | S/T | 1493 | 32989 | 88163 |
| Y | 1477 | 5113 | 18246 |
4.2. Workflow of the ProtPSP Framework
The comprehensive workflow of the proposed ProtPSP framework for phosphorylation site prediction is depicted in Figure A. It comprises three primary stages: (1) sequence encoding, in which protein sequences are transformed into numerical vector representations; (2) generation of feature representation, where relevant features associated with potential phosphorylation sites are derived from the encoded representations; and (3) phosphorylation site prediction, where phosphorylation sites are identified based on the learned representation. Detailed descriptions of each stage are provided below.
4.
Overflow of proposed ProtPSP for prediction of the phosphorylation site. (A) ProtPSP model simultaneously processes both the global sequence s g and the local sequence s l as inputs. For the global sequence, a trainable embedding layer is first applied, followed by the addition of positional embeddings to obtain the global sequence representation e g. For the local sequence, features are extracted using the pretrained pLLM model, ProtT5, and further refined via a CNN block to generate the local sequence representation e l. Both e g and e l are subsequently passed through a SE-Net block, a BiLSTM-Transformer fusion module, and a Bahdanau attention mechanism, yielding the feature representations c g and c l, respectively. These representations are then concatenated and inputted to a linear layer with Softmax activation to produce the final prediction y. (B) Architecture of the CNN block comprising three stacked convolutional layers. (C) Design of the SE-Net block, featuring grouped convolution followed by a squeeze-and-excitation operation. (D–F) LSTM-Transformer fusion module contains two parallel branches: (E) a BiLSTM branch and (F) a Transformer branch. A dynamic fusion gate adaptively balances the contributions of both branches, while Bahdanau’s attention further highlights important feature representations. (G) Implementation details of the site prediction process.
4.2.1. Sequence Encoding
Following the previous work, the global sequence s g and local sequence s l are extracted from the full protein sequence and subsequently converted into numerical vector representations. Here, the global sequence refers to the entire protein sequence containing the potential phosphorylation site, while the local sequence with a fixed length of 21 residues comprises the phosphorylation site itself flanked by ten upstream and downstream amino acids. Comparative analyses using window lengths of 17, 21, and 33 are conducted, and as summarized in Table S3, a window size of 21 residues yields the best performance.
For the global sequence s g, considering the data distribution and the limitations of computational resources, it is standardized to a fixed length N = 500 by truncating sequences that exceed N residues and padding shorter sequences with zeros. Subsequently, each of the 20 amino acids is mapped to a unique integer in the range [1, 20], resulting in the encoded sequence x g = [x 1, x 2, ···, x N ], where x i ∈ {1,2, ···,20}. The integer-encoded sequence is then input to a trainable embedding layer, which transforms each residue index into a dense vector representation. To incorporate positional information and encode the relative order of residues within the global sequence, positional embeddings are added to the output of the amino acid embedding layer. Formally, the encoding process for s g can be represented as
| 1 |
where the δ and π are trainable parameters within embedding layer; Pos = {1,2, ···, N} represents the positional indices. Further details are provided in Material S3. The resulting global sequence representation is denoted as e g ∈ R 500 × 16.
For the local sequence s l, the pretrained pLLM model is leveraged to extract its complex sequence characteristics and evolutionary information. In this work, we utilize ProtT5, as it has demonstrated superior representational power across numerous public benchmarks, including tasks such as secondary structure prediction, contact map/contact prediction, and protein function annotation. The detailed architecture of ProtT5 is depicted in Figure S4. ProtT5 provides a high-dimensional context-aware embedding for each residue in the local sequence, resulting in an initial local sequence representation. To further enhance and condense this representation, a trainable CNN block including three stacked CNN layers is adopted, as shown in Figure B. The calculation is defined as follows:
| 2 |
where ρ and ε are parameters within ProtT5 model and CNN block, and the ρ is kept frozen during training. This process yields a compact local sequence embedding e l ∈ R 21 × 16.
Finally, the stage of sequence encoding yields two outputs: encoded global sequence e g ∈ R 500 × 16 and encoded local sequence e l ∈ R 21 × 16.
4.2.2. Generation of Feature Representation
To effectively extract informative features from encoded global sequence e g and local sequence e l, each is processed through an identical architecture consisting of an SE-Net block, a BiLSTM–Transformer fusion block, and a Bahdanau attention mechanism, facilitating enhanced and comprehensive feature representation learning.
First, the encoded global sequence e g and local sequence e l are individually fed into the SE-Net block to facilitate cross-dimensional information interaction and adaptively recalibrate channel-wise feature responses, as shown in Figure C. The computational procedure of this block is defined as follows:
| 3 |
where the θ denotes the trainable parameters within SE-Net block. The SE-Net block consists of a grouped CNN followed by the squeeze-and-excitation operation. Comprehensive details regarding this computation are provided in Materials S4 and S5. The SE-Net block produces refined global feature representation x g ∈ R 500 × K 1 and local feature representation x l ∈ R 21 × K 1 , where K 1 = 256 denotes the predetermined number of CNN kernels utilized within SE-Net block.
Then, the outputs from the SE-Net block, x g and x l, serve as input to the designed BiLSTM–Transformer fusion block, which further captures bidirectional long-range sequence dependencies essential for phosphorylation site identification, as shown in Figure D. The computational procedure is defined as follows:
| 4 |
| 5 |
where ϑ and φ are the trainable parameters within BiLSTM and Transformer encoder layer, respectively. The BiLSTM hidden size and Transformer model dimension are both configured as 256, as shown in Figure E,F. Computational details for these layers are provided in Materials S6 and S7. Then, the dual-branch fusion operation is designed to integrate the h lstm and h tr as follows:
| 6 |
where ⊕ represents feature concatenation; the α ∈ [0,1] is a trainable scaling parameter that dynamically balances the contributions of the BiLSTM and Transformer feature representations. The block outputs two fused feature representations: the global feature representation h fusion ∈ R 500 × 512, and the local feature representation h fusion ∈ R 21 × 512.
Finally, Bahdanau attention is applied to h fusion and h fusion to further determine the influence of different positions in the sequence on the target site. The calculation is defined as
| 7 |
where τ is the trainable parameters within Bahdanau attention operation. Details for the calculation of the operation are provided in Material S8. Then, the final outputs of the feature representation generation stage are the c g ∈ R 1 × 512 and c l ∈ R 1 × 512.
4.2.3. Phosphorylation Site Prediction
The outputs generated from the feature representation generation stage, c g and c l, are then concatenated and fed into a linear layer followed by a Softmax (·) activation function to produce the final probability distribution over the phosphorylation site, as shown in Figure G. The calculation is defined as
| 8 |
where ω is the trainable parameters within the linear layer and Softmax(·) is the activation function which converts the logits into class probabilities for the phosphorylation site type. Further details regarding this computation are provided in Material S9. The resulting probability vector ŷ constitutes the final output of the ProtPSP model for each candidate phosphorylation site.
5. Implementation
5.1. Model Training and Implementation
The proposed ProtPSP model is optimized using the standard cross-entropy loss function, wherein the predicted probability distribution is evaluated against the corresponding ground truth labels as follows:
| 9 |
where M denotes the total number of training samples, y i represents the true label for the i-th training sample, and ŷ p indicates the predicted probability that the i-th training sample belongs to class p. In the ProtPSP model, our architecture incorporates modules such as grouped convolution, SE-Net, Transformer, and BiLSTM. Specifically, the grouped convolution uses a group number of 4. The Transformer is a single-layer architecture with model dimension of 256, feed-forward network dimension of 512, and 2 attention heads. The BiLSTM is also a single-layer with hidden units of 128. The ProtPSP model is optimized using the Adam optimizer with a learning rate of 1e-3 and default momentum values. Model parameters are initialized by sampling from a normal distribution N(0, 0.02). Training is conducted for 20 epochs using a minibatch size of 64. A validation set comprising 20% of the training data selected randomly is used for model selection. During training, a learning rate decay strategy is adopted, where the learning rate is multiplied by 0.8 every 2 epochs and the optimal model checkpoint saved based on the lowest validation loss. To relieve the problem of category imbalance, each training batch and the validation set are constructed to maintain a 1:1 ratio of positive to negative sites. Further details can be found in Figure S5. The total number of trainable parameters in ProtPSP is approximately 4.73M, with the parameter count for each individual block detailed in Table S4. All models are implemented in TensorFlow 2.x. Training and evaluation are performed on a workstation equipped with an NVIDIA GeForce RTX 2080 GPU.
5.2. Model Evaluation
To comprehensively evaluate the predictive performance of the proposed ProtPSP model for phosphorylation site identification, four widely adopted metrics are utilized, including sensitivity (SN), specificity (SP), F1-Score, and MCC. Detailed formulas and calculation procedures follow as
| 10 |
| 11 |
| 12 |
| 13 |
where TP and TN denote correctly predicted phosphorylation and nonphosphorylation sites, respectively, while FP and FN refer to incorrect predictions of phosphorylation and nonphosphorylation sites, respectively. Furthermore, the receiver operating characteristic (ROC) curves are also plotted to visually inspect the trade-off between the true positive rate and false positive rate across various decision thresholds. The area under the ROC curve (AUC) is then calculated to quantitatively measure the overall discriminative power of the ProtPSP model. The detailed calculation principles and process for the AUC metric are provided in Material S10. Additionally, the calculation formulas for some relevant clustering metrics (SC, DBI, and CHI) used in UMAP visualization are also available in Material S10.
Supplementary Material
Acknowledgments
This work was supported by the National Natural Science Foundation of China (22404024).
The source code of the ProtPSP model together with the data set for testing are available on the https://github.com/Coutht/ProtPSP.
The Supporting Information is available free of charge at https://pubs.acs.org/doi/10.1021/acsomega.5c10634.
Table S1: metrics of the learned features about pLLM ablation experiments; Table S2: metrics of the learned features about model architecture ablation experiments; Table S3: model performance using different window lengths; Table S4: parameter counts for each individual block; Figure S1: ROC curves for all models on both S/T and Y sites; Figure S2: SHAP values (top 30) of pLLM features and other sequence features; Figure S3: distribution of positive and negative samples across proteins of different lengths; Figure S4: detail architecture of ProtT5; Figure S5: process of dividing the validation set and balancing the training set; Material S1: detail of five baseline model; Material S2: data collect and process; Material S3: detail of embedding module; Material S4: implement of grouped convolution; Material S5: detail of SE-Net; Material S6: gating mechanisms of BiLSTM; Material S7: detail of Transformer; Material S8: compute of Bahdanau attention; Material S9: site prediction block implement; and Material S10: performance evaluation (PDF)
B.F.G. wrote the original draft. B.F.G. and L.G. contributed to the conception and design of the study. B.F.G., Z.X., and S.Y.L. conducted the experiments, data acquisition, and data curation, with support from T.K.Y.L. Data analysis and interpretation were performed by B.F.G., L.G., and T.K.Y.L. Manuscript review and editing were performed by L.G., T.K.Y.L., Y.S., and K.Z.C. All authors have given approval to the final version of the manuscript.
The authors declare no competing financial interest.
This article published ASAP on December 16, 2025. The Table of Contents/Abstract graphic and Figure 2 have been replaced due to production error. The corrected version reposted on December 30, 2025.
References
- Forrest A. R., Taylor D. F., Fink J. L., Gongora M. M., Flegg C., Teasdale R. D.. et al. PhosphoregDB: the tissue and sub-cellular distribution of mammalian protein kinases and phosphatases. BMC bioinformatics. 2006;7(1):82. doi: 10.1186/1471-2105-7-82. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Karve T. M., Cheema A. K.. Small changes huge impact: the role of protein posttranslational modifications in cellular homeostasis and disease. J. Amino Acids. 2011;2011(1):207691. doi: 10.4061/2011/207691. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li T., Li F., Zhang X.. Prediction of kinase-specific phosphorylation sites with sequence features by a log-odds ratio approach. Proteins: Structure, Function, and Bioinformatics. 2008;70(2):404–414. doi: 10.1002/prot.21563. [DOI] [PubMed] [Google Scholar]
- Trost B., Kusalik A.. Computational prediction of eukaryotic phosphorylation sites. Bioinformatics. 2011;27(21):2927–2935. doi: 10.1093/bioinformatics/btr525. [DOI] [PubMed] [Google Scholar]
- Ramazi S., Zahiri J.. Post-translational modifications in proteins: resources, tools and prediction methods. Database. 2021;2021:baab012. doi: 10.1093/database/baab012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shu L., Du C., Zuo Y.. Abnormal phosphorylation of protein tyrosine in neurodegenerative diseases. Journal of Neuropathology & Experimental Neurology. 2023;82(10):826–835. doi: 10.1093/jnen/nlad066. [DOI] [PubMed] [Google Scholar]
- Liu Y. B., Wang Q., Song Y. L., Song X. M., Fan Y. C., Kong L.. et al. Abnormal phosphorylation/dephosphorylation and Ca2+ dysfunction in heart failure. Heart Failure Reviews. 2024;29(4):751–768. doi: 10.1007/s10741-024-10395-w. [DOI] [PubMed] [Google Scholar]
- Mu J., Zhang Z., Jiang C., Geng H., Duan J.. Role of Tau Protein Hyperphosphorylation in Diabetic Retinal Neurodegeneration. Journal of Ophthalmology. 2025;2025(1):3278794. doi: 10.1155/joph/3278794. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zhai G., Yang L., Luo Q., Wu K., Zhao Y., Wang F.. Serum phosphopeptide profiling for colorectal cancer diagnosis using liquid chromatography–mass spectrometry. Rapid Commun. Mass Spectrom. 2022;36(15):e9316. doi: 10.1002/rcm.9316. [DOI] [PubMed] [Google Scholar]
- Girod M., Arquier D., Helms A., Juetten K., Brodbelt J. S., Lemoine J.. et al. Characterization of Phosphorylated Peptides by Electron-Activated and Ultraviolet Dissociation Mass Spectrometry: A Comparative Study with Collision-Induced Dissociation. J. Am. Soc. Mass Spectrom. 2024;35(5):1040–1054. doi: 10.1021/jasms.4c00048. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Potel C. M., Lemeer S., Heck A. J.. Phosphopeptide fragmentation and site localization by mass spectrometry: an update. Analytical chemistry. 2019;91(1):126–141. doi: 10.1021/acs.analchem.8b04746. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Alverdi V., Di Pancrazio F., Lippe G., Pucillo C., Casetta B., Esposito G.. Determination of protein phosphorylation sites by mass spectrometry: a novel electrospray-based method. Rapid Commun. Mass Spectrom. 2005;19(22):3343–3348. doi: 10.1002/rcm.2198. [DOI] [PubMed] [Google Scholar]
- Couto N., Davlyatova L., Evans C. A., Wright P. C.. Application of the broadband collision-induced dissociation (bbCID) mass spectrometry approach for protein glycosylation and phosphorylation analysis. Rapid communications in mass spectrometry. 2018;32(2):75–85. doi: 10.1002/rcm.8016. [DOI] [PubMed] [Google Scholar]
- Lyu J., Wang Y., Mao J., Yao Y., Wang S., Zheng Y.. et al. Pseudotargeted MS method for the sensitive analysis of protein phosphorylation in protein complexes. Anal. Chem. 2018;90(10):6214–6221. doi: 10.1021/acs.analchem.8b00749. [DOI] [PubMed] [Google Scholar]
- Wei L., Xing P., Tang J., Zou Q.. PhosPred-RF: a novel sequence-based predictor for phosphorylation sites using sequential information only. IEEE transactions on nanobioscience. 2017;16(4):240–247. doi: 10.1109/TNB.2017.2661756. [DOI] [PubMed] [Google Scholar]
- Dou Y., Yao B., Zhang C.. PhosphoSVM: prediction of phosphorylation sites by integrating various protein sequence attributes with a support vector machine. Amino acids. 2014;46(6):1459–1469. doi: 10.1007/s00726-014-1711-5. [DOI] [PubMed] [Google Scholar]
- Gao J., Thelen J. J., Dunker A. K., Xu D.. Musite, a tool for global prediction of general and kinase-specific phosphorylation sites. Molecular & Cellular Proteomics. 2010;9(12):2586–2600. doi: 10.1074/mcp.M110.001388. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Luo F., Wang M., Liu Y., Zhao X. M., Li A.. DeepPhos: prediction of protein phosphorylation sites with deep learning. Bioinformatics. 2019;35(16):2766–2773. doi: 10.1093/bioinformatics/bty1051. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Guo L., Wang Y., Xu X., Cheng K. K., Long Y., Xu J.. et al. DeepPSP: a global–local information-based deep neural network for the prediction of protein phosphorylation sites. Journal of Proteome Research. 2021;20(1):346–356. doi: 10.1021/acs.jproteome.0c00431. [DOI] [PubMed] [Google Scholar]
- Pakhrin S. C., Pokharel S., Pratyush P., Chaudhari M., Ismail H. D., Kc D. B.. LMPhosSite: a deep learning-based approach for general protein phosphorylation site prediction using embeddings from the local window sequence and pretrained protein language model. J. Proteome Res. 2023;22(8):2548–2557. doi: 10.1021/acs.jproteome.2c00667. [DOI] [PubMed] [Google Scholar]
- Song T., Yang Q., Qu P., Qiao L., Wang X.. Attenphos: general phosphorylation site prediction model based on attention mechanism. International Journal of Molecular Sciences. 2024;25(3):1526. doi: 10.3390/ijms25031526. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Soylu N. N., Sefer E.. DeepPTM: protein post-translational modification prediction from protein sequences by combining deep protein language model with vision transformers. Current Bioinformatics. 2024;19(9):810–824. doi: 10.2174/0115748936283134240109054157. [DOI] [Google Scholar]
- Soylu N. N., Sefer E.. Bert2ome: Prediction of 2′-O-methylation modifications from rna sequence by transformer architecture based on bert. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2023;20(3):2177–2189. doi: 10.1109/TCBB.2023.3237769. [DOI] [PubMed] [Google Scholar]
- Cetin S., Sefer E.. A Graphlet-based Explanation Generator for Graph Neural Networks Over Biological Datasets. Current Bioinformatics. 2025;20(9):840–851. doi: 10.2174/0115748936355418250114104026. [DOI] [Google Scholar]
- Elnaggar A., Heinzinger M., Dallago C., Rehaw i G., Wang Y., Jones L.. et al. Prottrans: Toward understanding the language of life through self-supervised learning. IEEE Trans. Pattern Anal. Mach. Intell. 2021;44(10):7112–7127. doi: 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
- Vaswani A., Shazeer N., Parmar N., Uszkoreit J., Jones L., Gomez A. N.. et al. Attention is all you need. arXiv. 2017 doi: 10.48550/arXiv.1706.03762. [DOI] [Google Scholar]
- Suzek B. E., Wang Y., Huang H., McGarvey P. B., Wu C. H., UniProt Consortium. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31(6):926–932. doi: 10.1093/bioinformatics/btu739. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hu, J. ; Shen, L. ; Sun, G. . Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 2018; pp 7132–7141. [Google Scholar]
- Bahdanau D., Cho K., Bengio Y.. Neural machine translation by jointly learning to align and translate. arXiv. 2014 doi: 10.48550/arXiv.1409.0473. [DOI] [Google Scholar]
- McInnes L., Healy J., Melville J.. Umap: Uniform manifold approximation and projection for dimension reduction. arXiv. 2018 doi: 10.48550/arXiv.1802.03426. [DOI] [Google Scholar]
- Wang X., Zhang Z., Zhang C., Meng X., Shi X., Qu P.. Transphos: A deep-learning model for general phosphorylation site prediction based on transformer-encoder architecture. International Journal of Molecular Sciences. 2022;23(8):4263. doi: 10.3390/ijms23084263. [DOI] [PMC free article] [PubMed] [Google Scholar]
- UniProt Consortium. Activities at the universal protein resource (UniProt) Nucleic Acids Res. 2014;42(D1):D191–D198. doi: 10.1093/nar/gkt1140. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chung C. R., Tang Y., Chiu Y. P., Li S., Hsieh W. K., Yao L.. et al. dbPTM 2025 update: comprehensive integration of PTMs and proteomic data for advanced insights into cancer research. Nucleic acids research. 2025;53(D1):D377–D386. doi: 10.1093/nar/gkae1005. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Diella F., Cameron S., Gemünd C., Linding R., Via A., Kuster B.. et al. Phospho. ELM: a database of experimentally verified phosphorylation sites in eukaryotic proteins. BMC Bioinf. 2004;5(1):79. doi: 10.1186/1471-2105-5-79. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hornbeck P. V., Kornhauser J. M., Latham V., Murray B., Nandhikonda V., Nord A.. et al. 15 years of PhosphoSitePlus®: integrating post-translationally modified sites, disease variants and isoforms. Nucleic acids research. 2019;47(D1):D433–D441. doi: 10.1093/nar/gky1159. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shi Y.. Serine/threonine phosphatases: mechanism through structure. Cell. 2009;139(3):468–484. doi: 10.1016/j.cell.2009.10.006. [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The source code of the ProtPSP model together with the data set for testing are available on the https://github.com/Coutht/ProtPSP.






