Abstract
Neuropeptides are essential signaling molecules produced in the nervous system that regulate diverse physiological processes and are closely implicated in the pathogenesis of neurodegenerative and neuropsychiatric disorders. Investigating neuropeptides contributes to a better understanding of their regulatory mechanisms and offers new insights into therapeutic strategies for related diseases. Therefore, accurate identification of neuropeptides is crucial for advancing biomedical research and drug development. Due to the high cost of experimental validation, various artificial intelligence methods have been developed for rapid neuropeptide identification. However, existing approaches often suffer from high computational resource consumption, slow processing speed, and poor deploy ability. Moreover, a user-friendly web server for practical application is still lacking. To this end, we propose MSKDNP, a neuropeptide prediction model based on a multi-stage knowledge distillation framework. With only 1.2% of the parameters, MSKDNP attains performance comparable to a fully fine-tuned protein language model while achieving state-of-the-art results in neuropeptide recognition. Moreover, MSKDNP provides favorable interpretability, facilitating biological understanding. A freely accessible web server is available at https://awi.cuhk.edu.cn/∼biosequence/MSKDNP/index.php.
Keywords: multi-stage knowledge distillation, neuropeptide, protein language model, bioinformatics
Introduction
Neuropeptides are mostly linear polypeptides typically composed of fewer than 100 amino acids and synthesized as larger precursors that undergo proteolytic cleavage and a series of post-translational modifications (Fig. 1A). Acting as critical first messengers in both the central and peripheral nervous systems, neuropeptides differ from classical neurotransmitters such as glutamate and gamma-aminobutyric acid, which mediate excitatory and inhibitory synaptic transmission, respectively. Instead, neuropeptides engage in diverse signaling mechanisms, functioning in endocrine, autocrine, paracrine, or neurotransmitter modes [1]. Within synaptic transmission, they belong to a superfamily of neuromodulators, released from presynaptic terminals in response to action potentials to influence downstream neurons [2]. Beyond their roles in neural signaling, certain neuropeptides also regulate a wide range of physiological processes, including social behavior (e.g. oxytocin and vasopressin), immune responses (e.g. bradykinin, substance P and resistin), pain perception (e.g. substance P and oxytocin), memory, and learning (e.g. vasopressin and ghrellin) [1, 3] (Fig. 1B). Owing to their extensive involvement in physiological regulation, neuropeptides have emerged as promising therapeutic targets for a variety of nervous system disorders. In recent years, the number of identified neuropeptides has increased significantly, highlighting their growing importance in biomedical research.
Figure 1.
Overview of neuropeptide biosynthesis and functional diversity. (A) The precursor neuropeptide is synthesized in the endoplasmic reticulum (ER) and transported to the Golgi apparatus via small vesicles for further processing. Within secretory vesicles, it undergoes maturation and is subsequently released upon stimulation through vesicle fusion with the cell membrane. (B) Representative neuropeptides and their associated physiological functions.
Traditional experimental techniques, such as mass spectrometry and high-performance liquid chromatography, provide high accuracy in neuropeptide identification [4, 5], but they are often costly, time-consuming, and labor-intensive. With the advancement of artificial intelligence (AI), an increasing number of machine learning and deep learning approaches have been applied to neuropeptide identification tasks, achieving promising results. AI may expedite identification and reduce costs, though traditional techniques such as mass spectrometry and assays are still required for confirmation. Agrawal et al. proposed NeuroPIpred [6], which utilized a support vector machine combined with amino acid composition (AAC), dipeptide composition, split composition, and binary profiles for neuropeptide classification. However, their dataset was limited to insect neuropeptides, restricting its generalizability. To address this limitation, Bin et al. constructed a more comprehensive dataset based on the NeuroPep database and developed the PredNeuroP model by integrating nine physicochemical features with ensemble learning techniques [7]. Subsequently, to enhance model interpretability, Hasan et al. introduced NeuroPred-FRL [8], an interpretable framework combining 11 physicochemical descriptors with XGBoost. Later, Jiang et al. developed NeuroPpred-Fuse [9], another interpretable model that integrates six conventional protein features with feature selection and stacking strategies. As the complexity of neuropeptide classification tasks increased, the performance of traditional machine learning models became insufficient. To overcome this bottleneck, Wang et al. proposed NeuroPred-PLM, which used Evolutionary Scale Modeling (ESM), a 12-layer transformer-based protein language model (PLM), for feature representation, coupled with a convolutional neural network for neuropeptide recognition [10]. The integration of protein language modeling significantly enhanced classification performance, enabling NeuroPred-PLM to achieve higher accuracy compared to previous traditional models.
Despite the significant progress in neuropeptide classification, several challenges remain unresolved. First, although existing models have demonstrated good predictive accuracy, there remains room for further improvement. Second, although deep learning frameworks based on protein language models have enhanced classification performance, their high memory requirements, substantial computational costs, and slow inference speed hinder practical deployment. Finally, there is currently no publicly available web server for easy model access. Local deployment typically demands considerable expertise in AI and high computational resources, posing a barrier for experimental researchers. Therefore, developing an efficient and user-friendly web service could substantially improve model accessibility and promote broader application in biomedical research.
To address these challenges, a knowledge distillation-based framework, MSKDNP (Multi-Stage Knowledge Distillation for Neuropeptide Prediction), is proposed. Leveraging a multi-stage distillation strategy, the model achieves the same classification accuracy of a large-scale protein language model while utilizing only 1.2% of its parameters. MSKDNP also outperforms all existing methods in overall performance. Owing to its lightweight architecture, MSKDNP exhibits excellent memory efficiency and fast inference speed. These advantages enable the development of a stable web server, substantially lowering the technical barrier and enhancing accessibility for broader application. Finally, MSKDNP demonstrates strong interpretability, further supporting its potential for practical deployment in neuropeptide classification tasks.
Material and methods
Benchmark dataset
To ensure fair comparison with existing models, the benchmark dataset curated by Wang et al. was adopted [10]. This dataset is derived from the NeuroPep 2.0 database and is considered the most comprehensive and widely used resource for neuropeptide classification. The original dataset comprises 11 282 experimentally validated neuropeptide sequences. Upon removing the redundant sequences with more than 90% identity using the CD-HIT tool [11], the dataset then comprises 4463 unique neuropeptides. To construct a balanced dataset, 4463 non-neuropeptide sequences with a similar length distribution were directly adopted from the dataset curated by Wang et al. [10], in which negative samples were selected from UniProt after excluding entries containing neuropeptide-related keywords to reduce potential label noise. The entire dataset was partitioned into training and test sets at a 9:1 ratio to evaluate model robustness and generalization capability. The distributions of sequence length, AAC, and net charge in the dataset are shown in Supplementary Fig. S1. This conventional random split is hereafter referred to as
.
In addition to this conventional partitioning, an alternative and more challenging data split, termed
, was introduced to mitigate potential data leakage. Specifically, a hierarchical clustering strategy based on Levenshtein distance was employed to group sequences according to their similarity [12]. Peptides within the same cluster were assigned to either the training or test set as a whole, thereby reducing sequence redundancy across sets. Approximately 90% of the clusters (covering 90% of total sequences) were allocated to the training set, while the remaining 10% were reserved for testing. This cluster-aware partitioning strategy ensures lower sequence-level similarity between training and test sets, providing a more rigorous and fair evaluation of model generalization to novel peptide sequences.
The overall framework of MSKDNP architecture
The distillation framework is a model compression technique, first introduced by Hinton et al. [13] to transfer the knowledge from a high-performance, large, complex model (the so-called teacher model) to a small model with fewer parameters (the student model). In the conventional distillation model, only the final pseudo-label knowledge is distilled. In the MSKDNP, we further introduced the feature knowledge distillation to improve the performance of the student model. The architecture of the MSKDNP model is depicted in Fig. 2. Initially, input sequences are encoded by a Tokenizer and fed into a 33-layer transformer encoder in the teacher model, which is trained to represent neuropeptide features as 1280-dimensional vectors. These vectors are processed by an MLP to yield the teacher’s logits. The same Tokenizer is then shared with the student model, which uses a 6-layer transformer encoder to generate 320-dimensional feature vectors. These features are linearly mapped to a 1280-dimensional space to align with the teacher’s representations and further refined through feature knowledge distillation. An additional MLP converts the mapped features into the student’s logits, with pseudo-label knowledge distillation employed to capture the teacher’s two-dimensionalneuropeptide expression pattern. This architecture allows the studentmodel to efficiently approximate the teacher model’s performance with significantly fewer parameters.
Figure 2.
Overall architecture of MSKDNP. The framework consists of a high-capacity teacher model and a lightweight student model, both utilizing a shared tokenizer. The teacher model contains 33 Transformer layers, while the student model comprises 6 layers. Multi-stage knowledge distillation is employed to transfer both feature-level and pseudo-label knowledge from the teacher to the student. Feature knowledge distillation is performed via mean-pooled embeddings followed by dimensional alignment. Pseudo-label distillation is conducted using the teacher’s classification logits as soft targets. The student model is trained to replicate both the feature representations and classification behavior of the teacher, enabling efficient and interpretable neuropeptide prediction with reduced model complexity.
Sequence embedding
In order to convert protein sequences into numerical feature representations, the peptide sequence is first passed through a tokenizer that maps each amino acid
to a numeric index according to a predefined vocabulary. This process produces a discrete token sequence:
![]() |
(1) |
Position embeddings are then introduced to capture the sequence order. For the
th amino acid, the position embedding is defined as a learnable parameter:
![]() |
(2) |
Hence, each amino acid
in the peptide sequence is encoded as:
![]() |
(3) |
Transfer learning of protein language models with full-parameter fine-tuning
Numerous peptide recognition studies have employed protein language models for feature extraction and designed dedicated network architectures [14, 15]. Although these approaches have achieved good classification performance, they typically rely on default Transformer parameters, treating the protein language model solely as a feature extractor. However, it has been shown that excellent reasoning ability can be achieved using only large language models based on Transformer architectures. Consequently, using solely the structurally similar protein language model based on the Transformer architecture should also achieve excellent performance in neuropeptide recognition. In other words, by fine-tuning all parameters of the protein language model, excellent performance for a specific task can be achieved, without using any other neural networks. Building on this idea, the teacher model is trained through full-parameter fine-tuning of the protein language model ESM [16]. The ESM series is selected for its capability to automatically encode rare amino acids found in neuropeptides [17].
Specifically, the teacher model is based on the ESM2 650 M architecture and is adapted for neuropeptide classification employing transfer learning [18]. As described previously, peptide sequences are converted into an embedding
. The underlying Transformer architecture comprises multiple layers of self-attention and feed-forward networks [19]. For each Transformer layer
(where
), the input
(with
) is projected into queries, keys, and values:
![]() |
(4) |
![]() |
(5) |
![]() |
(6) |
where
,
, and
are learnable matrices. The self-attention mechanism is computed as:
![]() |
(7) |
followed by a residual connection and layer normalization:
![]() |
(8) |
This Transformer formulation effectively captures protein contextual information and semantic rules from large-scale data via the self-attention mechanism, thereby accelerating neuropeptide sequence pattern recognition. The teacher model’s parameters are initialized with weights from an ESM2 model pretrained on the UR50D dataset and are fine-tuned for the neuropeptide classification task through full-parameter fine- tuning.
In contrast, the student model utilizes transformer layers based on the ESM2 8 M architecture and employs the same Tokenizer layer as the teacher model. Despite its smaller size, the student’s forward propagation closely mirrors that of the teacher, ensuring consistency in feature extraction and representation. This architectural congruence is critical for effective knowledge distillation.
Multi-stage knowledge distillation
Conventional knowledge distillation leverages pseudo-labels generated by a teacher model in addition to ground-truth labels. Let
and
be the teacher and student logits, respectively, and
the temperature scaling factor. The softened output distributions are defined as:
![]() |
(9) |
The soft distillation loss
is computed via the Kullback–Leibler divergence:
![]() |
(10) |
Simultaneously, the student learns from hard labels using the standard cross-entropy loss:
![]() |
(11) |
where
is the one-hot ground-truth label for class
, and
is the student’s predicted probability for class
. The total distillation loss is then given by:
![]() |
(12) |
where
is a hyperparameter balancing the contributions of the hard labels and the teacher’s pseudo-labels. This framework enables the student model to learn the teacher’s classification patterns.
Yet, the conventional knowledge distillation may be insufficient to fully capture the teacher model’s comprehensive knowledge. Therefore, feature knowledge distillation is incorporated to guide the student model in mimicking the high-dimensional representations learned by the teacher model. In this distillation framework, the student’s transformer-encoded features are mean-pooled and then passed through a linear mapping layer to match the teacher’s feature dimensionality. These mapped features are trained with an MSE loss to learn the teacher’s high-dimensional neuropeptide expressions, thereby further enhancing the student’s recognition capabilities.
The final outputs of the teacher and student Transformers are denoted as
and
, respectively, where
is the maximum sequence length, and
and
are the teacher and student model dimensions. After mean pooling, the teacher’s feature is
![]() |
while the student’s feature is
![]() |
To match the teacher’s dimensionality,
is further transformed by a linear mapping
:
![]() |
The feature alignment loss
is defined as the mean-squared error between
and
:
![]() |
(13) |
where
is the number of training samples.
In this multi-stage knowledge distillation approach, the total loss
combines the pseudo-label distillation loss
and the feature alignment loss
:
![]() |
(14) |
where
is a hyperparameter balancing the contributions of the two objectives. By jointly pseudo label outputs and high-dimensional features, this framework enables more comprehensive knowledge transfer from the teacher to the student model.
Performance evaluation and pipeline construction
This study utilized Accuracy, Recall, Precision, and the Matthews Correlation Coefficient (MCC) as the metrics for assessing the performance of the model [20, 21]. They are defined as follows:
![]() |
(15) |
![]() |
(16) |
![]() |
(17) |
![]() |
(18) |
![]() |
(19) |
where TP, TN, FP and FN represent the number of true positives, true negatives, false positives and false negatives, respectively [21]. Area Under the Curve (AUC) is defined as the area under the Receiver Operating Characteristic (ROC) curve, with values ranging from 0.5 to 1. In this study, both ROC and Precision-Recall (PR) curves, along with their respective AUCs, were used to provide a comprehensive evaluation of model performance.
The teacher model training uses NVIDIA A100 40G for full parameter fine-tuning training, and the student model uses NVIDIA T4 16G for multi-stage distillation training. The final hyperparameters were determined using grid search combined with cross-validation, as summarized in Supplementary Table S1. The effects of different combinations of the loss weights
and
, as well as the temperature, on model performance are illustrated in Supplementary Fig. S2, with F1-score used as the evaluation metric.
Result and discussion
Performance comparison with existing methods
Table 1 presents the test set performance of the teacher model, the distilled student model, and existing AI methods for neuropeptide prediction. A comparison between the teacher and student models indicates that the distillation process did not compromise performance. Specifically, on
, while maintaining the same recall as the teacher, the student model achieved an accuracy of 0.938, a precision of 0.947, an F1 score of 0.938, and an MCC of 0.876, showing slight improvements across all metrics. On
, the student model also exhibited similarly strong performance, achieving results nearly equivalent to those of the teacher model, thus demonstrating its robustness under more challenging, low-similarity evaluation conditions.
Table 1.
Benchmark performance comparison between MSKDNP and other existing methods
| Dataset | Method | ACC | Precision | Recall | F1 | MCC |
|---|---|---|---|---|---|---|
|
PredNeuroP | 0.864 | 0.935 | 0.782 | 0.852 | 0.738 |
| NeuroPred-FRL | 0.861 | 0.960 | 0.757 | 0.847 | 0.740 | |
| NeuroPpred-Fuse | 0.905 | 0.906 | 0.908 | 0.907 | 0.813 | |
| NeuroPred-PLM | 0.922 | 0.907 | 0.941 | 0.924 | 0.845 | |
| MSKDNP-Teacher (this study) | 0.935 | 0.941 | 0.928 | 0.934 | 0.869 | |
| MSKDNP-Student (this study) | 0.938 | 0.947 | 0.928 | 0.938 | 0.876 | |
|
PredNeuroP | 0.808 | 0.813 | 0.800 | 0.807 | 0.615 |
| NeuroPred-FRL | 0.778 | 0.776 | 0.783 | 0.779 | 0.556 | |
| NeuroPpred-Fuse | 0.850 | 0.870 | 0.823 | 0.846 | 0.701 | |
| NeuroPred-PLM | 0.874 | 0.854 | 0.904 | 0.878 | 0.751 | |
| MSKDNP-Teacher (this study) | 0.903 | 0.926 | 0.874 | 0.900 | 0.806 | |
| MSKDNP-Student (this study) | 0.903 | 0.922 | 0.879 | 0.900 | 0.806 |
The best performances are marked in bold.
Evaluations were conducted on
to compare the proposed models with existing methods. Compared to existing approaches, both MSKDNP models outperformed all previous approaches in terms of accuracy, F1, and MCC. The student model, with fewer parameters, achieved state-of-the-art performance in these metrics, surpassing the best existing model by 1.6% in accuracy, 1.4% in F1, and 0.031 in MCC. In terms of precision, the MSKDNP models ranked second only to NeroPred-FRL but showed substantially better results in the other four metrics. Although the MSKDNP model showed slightly lower recall compared to NeuroPred-PLM, the student model achieved higher accuracy and precision, along with superior F1-score and MCC, highlighting its advantage in overall predictive performance. When evaluated on
, a noticeable drop in overall accuracy was observed across all methods, confirming the increased difficulty of this partition. Despite the more challenging setting, both the teacher and student versions of MSKDNP consistently outperformed all competing approaches, achieving the best performance across all evaluation metrics.
These results confirm the effectiveness of the MSKDNP architecture for neuropeptide prediction, with the student model achieving state-of-the-art performance across multiple evaluation criteria.
Visualization of MSKDNP classification and distillation performance
To intuitively illustrate the discriminative capability of MSKDNP and the effectiveness of knowledge distillation in neuropeptide classification, t-distributed stochastic neighbor embedding (t-SNE) was employed to project the high-dimensional features of the teacher and student models on the training set into a two-dimensional space for visualization. Figure 3A shows the feature distribution of the teacher model before training. Figure 3 illustrates the embedding distributions of different models throughout the training process. Panel A shows the teacher model before full-parameter fine-tuning, and panel B shows the same model after fine-tuning. Panels C, D, and E correspond to the student model before distillation, with pseudo-label distillation only, and with feature-based distillation only, respectively. Panel F presents the student model after multi-stage distillation. Orange dots represent neuropeptides, and blue dots represent non-neuropeptides. As shown in the plots, the teacher model prior to training shows overlapping distributions between neuropeptides and non-neuropeptides, indicating poor classification ability. After training, a distinct boundary between the two classes emerges, reflecting strong discriminative power. Similarly, the student model exhibits limited class separation before distillation. When only pseudo-label distillation is applied, a more structured distribution is observed compared to the undistilled version, but the overall feature space remains scattered with significant class overlap, suggesting limited capacity to capture the teacher model’s feature distribution. In contrast, feature-based distillation enables the student model to better approximate the teacher’s representation structure; however, some overlap between classes persists near the decision boundary. The distributions become substantially more separable after full multi-stage distillation, demonstrating effective knowledge transfer of high-level neuropeptide feature representations.
Figure 3.
t-SNE-based 2D visualization of MSKDNP classification and distillation performance. (A) Teacher model before training with all-parameter fine tuning. (B) Teacher model after training with all-parameter fine tuning. (C) Student model before distillation training. (D) Student model is distilled using only pseudo-labeling. (E) Student model uses only feature knowledge distillation. (F) Student model after muti-stage distillation training. orange dots represent neuropeptides and blue dots represent non-neuropeptides.
A similar visualization was conducted on the test set features (Supplementary Fig. S3). The student model, after distillation, exhibited a feature distribution pattern comparable to that of the teacher model, further confirming its generalization ability and effective knowledge transfer to unseen data.
Despite the overall separation observed in the final scatter plots of both teacher and student models, a subset of overlapping points remains. To investigate the underlying causes of this confusion, the average sequence similarity was computed between each neuropeptide and all non-neuropeptides, and vice versa. The results were visualized using a violin plot (Supplementary Fig. S4), with similarity scores normalized to highlight distributional differences. Overlapping points labeled as neuropeptides exhibited higher sequence similarity to non-neuropeptides, while those labeled as non-neuropeptides showed greater similarity to neuropeptides. This mutual sequence resemblance likely contributes to the observed classification ambiguity.
Ablation analysis
To further investigate the contribution of each component within the MSKDNP framework, a comprehensive ablation study was conducted. The ablation settings include four configurations. First, for the teacher model, full-parameter fine-tuning was replaced with a frozen protein language model, a strategy commonly used in previous studies for feature extraction. This configuration is denoted as Teacher_wo_FPFT. For the student model, three variants were evaluated: without any distillation (Student_wo_Distill), with only pseudo-label knowledge distillation (Student_wo_PLKD), and with only feature knowledge distillation (Student_wo_FKD). The corresponding results are summarized in Table 2.
Table 2.
Ablation study results. Comparison of the MSKDNP model and its variants, each with a key component removed: Teacher model without full-parameter fine-tuning (Teacher_wo_FPFT); student model without any distillation (Student_wo_Distill); student model without pseudo-label knowledge distillation (Student_wo_PLKD); and student model without feature knowledge distillation (Student_wo_FKD)
| Dataset | Method | ACC | Precision | Recall | F1 | MCC |
|---|---|---|---|---|---|---|
|
Teacher_wo_FPFT | 0.837 | 0.779 | 0.939 | 0.852 | 0.688 |
| Student_wo_Distill | 0.905 | 0.905 | 0.905 | 0.905 | 0.811 | |
| Student_wo_FKD | 0.911 | 0.899 | 0.926 | 0.912 | 0.822 | |
| Student_wo_PLKD | 0.922 | 0.931 | 0.912 | 0.922 | 0.845 | |
| MSKDNP-Teacher (this study) | 0.935 | 0.941 | 0.928 | 0.934 | 0.869 | |
| MSKDNP-Student (this study) | 0.938 | 0.947 | 0.928 | 0.938 | 0.876 | |
|
Teacher_wo_FPFT | 0.853 | 0.853 | 0.854 | 0.852 | 0.707 |
| Student_wo_Distill | 0.854 | 0.858 | 0.854 | 0.854 | 0.712 | |
| Student_wo_FKD | 0.879 | 0.866 | 0.897 | 0.881 | 0.759 | |
| Student_wo_PLKD | 0.886 | 0.916 | 0.848 | 0.881 | 0.776 | |
| MSKDNP-Teacher (this study) | 0.903 | 0.926 | 0.874 | 0.900 | 0.806 | |
| MSKDNP-Student (this study) | 0.903 | 0.922 | 0.879 | 0.900 | 0.806 |
The best performances are marked in bold.
On
, the results reveal that the removal of FPFT from the teacher model leads to the most substantial performance drop, with accuracy decreasing by nearly 10% compared to the fully fine-tuned model. A similar trend is observed on
, where accuracy drops by
5%, further confirming the critical role of FPFT in maintaining model performance. Although protein language models are pretrained on large-scale protein corpora and capable of capturing contextual information, the absence of task-specific fine-tuning substantially restrains their ability to recognize the patterns of neuropeptide. This also explains why traditional research often requires additional complicated downstream models to fully exploit the task-related information in the high-dimensional features extracted by the protein language pre-training model. In contrast, task-specific full-parameter fine-tuning significantly enhances the model’s capacity to identify sequence-specific features, thereby improving overallperformance.
Regarding the student model, the absence of any distillation mechanism results in an accuracy of 0.905. Incorporating pseudo-label knowledge distillation improves accuracy to 0.911, while feature knowledge distillation alone further increase accuracy to 0.922. A comparable pattern is observed on
, where the exclusion of either distillation component similarly leads to noticeable drops in performance, further confirming that both single-distillation strategies, while individually beneficial, are insufficient on their own. Omitting either branch in the multi-stage knowledge distillation framework consistently results in suboptimal outcomes. This can be attributed to the limited capacity of the student model, which makes it challenging to capture the complex sequence patterns of neuropeptides. Pseudo-label distillation allows the student to mimic the teacher’s decision but lacks the ability to capture the underlying high-level knowledge on neuropeptide representations, thus offering only marginal gains. In contrast, feature knowledge distillation facilitates the student’s learning of high-dimensional feature representations, resulting in improved abstraction capability. Nevertheless, it remains insufficient to fully reproduce the teacher model’s decision rationale. Thus, the multi-stage knowledge distillation framework integrates feature knowledge distillation and pseudo-label distillation to allow the student model better approximate, and in some cases surpass, the teacher model’s performance.
In addition to the ablation results, ROC and PR curves were plotted (Fig. 4) to provide a more intuitive illustration of MSKDNP’s discriminative ability and positive class detection performance across different decision thresholds. Both the teacher and student models consistently outperformed all ablation variants in these curve-based evaluations. Notably, On
, the student model achieved an identical AUC to the teacher model in the ROC curve, indicating that the overall discriminative ability was effectively transferred. In the PR curve, the student model exhibited a slightly lower AUC than the teacher model. While this is the typical efficiency-performance trade-off that one would expect when constructing the student model, it is noteworthy that this AUC is the only metric where a minor performance gap was observed between the two models. On
, both teacher-student models again demonstrated the best overall performance compared to other baselines, with the student model achieving AUC values in both the ROC and PR curves that closely matched those of the teacher model, further validating the effectiveness of the distillation strategy under more challenging conditions.
Figure 4.
Performance comparison of MSKDNP and ablated variants using ROC and PR curves. (A) ROC and (B) PR curves for the
dataset, and (C) ROC and (D) PR curves for the
dataset, from ablation experiments evaluating MSKDNP. Comparison of the MSKDNP model and its ablated variants, each omitting a key component.
Efficiency analysis
Benefiting from the distillation-based architecture, the MSKDNP student model demonstrates significantly improved memory efficiency and inference speed compared to the teacher model. To further quantify its computational advantages, a performance evaluation was conducted, including comparisons among NeuroPred-PLM, the MSKDNP teacher model, and the student model. As NeuroPred-PLM also adopts a protein language model, it serves as a relevant baseline for comparison. All experiments were performed under consistent hardware conditions, with GPU testing conducted on NVIDIA T4 and CPU testing on Intel Xeon @ 2.20 GHz, using a fixed batch size of 8.
The results summarized in Table 3 show that the MSKDNP student model outperforms all other models in terms of computational efficiency. With only 8 M parameters, it accounts for merely 9.8% of the size of NeuroPred-PLM and just 1.2% of the teacher model. In GPU-based inference, the student model achieves
5
speed up compared to NeuroPred-PLM and nearly 20
speed up compared to the teacher model. This advantage is further amplified in CPU-based inference, where the student model reaches speeds
5.5
faster than NeuroPred-PLM and 117
faster than the teacher model.
Table 3.
Efficiency analysis results.
| Method | Number of parameters | Time(gpu) | Time(cpu) |
|---|---|---|---|
| NeuroPred-PLM | 81 M | 18.01 batch/s | 1.27 batch/s |
| MSKDNP-Teacher | 650 M | 4.38 batch/s | 0.06 batch/s |
| MSKDNP-Student | 8 M | 87.05 batch/s | 7.03 batch/s |
Bold indicates the model with the fewest parameters or the best efficiency.
The computational performance surpassing other models demonstrates that the student model efficiently leverages features from the small protein language model to achieve rapid classification using substantially less memory consumption. In addition, its excellent inference performance on CPU also suggests its potential to be deployed in the real-life usage as a web server to increase the accessibility of the model to researchers.
Interpretability analysis
The student model also exhibits advantages in interpretability. In contrast to other approaches that rely on protein language model feature extraction followed by complex architectures, the student model adopts a streamlined design with Transformer as the sole core component, offering improved structural simplicity and interpretability without compromising performance. Although attention-based interpretability has been widely explored, the presence of additional downstream modules in many models often diminishes the clarity of attention signals. By avoiding such architectural complexity, the student model preserves the inherent interpretability of the attention mechanism.
To further investigate the student model’s understanding of neuropeptide sequences, attention maps were visualized (Supplementary Fig. S5). Consistent patterns were observed across the six Transformer layers, where attention distributions typically focused on 8–15 adjacent residues surrounding a given amino acid. In several layers, distinct attention hotspots appeared within specific regions, suggesting potential relevance to functional or structural fragments. To validate this hypothesis, motif analysis was performed using the MEME suite (Supplementary Fig. S6) [23]. The results revealed a substantial number of conserved motifs spanning 8–15 residues, many of which are identical to those discovered by the hotspots on the attention maps. These motifs are generally associated with G protein-coupled receptor binding or secretory protein functions. For instance, the motif “HSDGTFTSDY,” frequently highlighted by the model, is also present in biologically active peptides such as glucagon and GIP, both of which interact with GPCRs [24, 25]. These findings indicate that the local context captured by the attention mechanism may reflect biologically meaningful functional segments. To validate the functional relevance of the identified motifs and assess the sensitivity of MSKDNP to these regions, a perturbation-based analysis was conducted. A total of 50 neuropeptides containing the discovered motifs were selected from the test set. For each sequence, the motif regions were randomly mutated to disrupt their original composition, and changes in MSKDNP prediction probabilities were analyzed. The results are shown in Supplementary Fig. S7. A consistent decrease in predicted probability was observed across all perturbed peptides, including those with a critical role in MSKDNP’s prediction and maybe essential to neuropeptide function.
Although motif analysis provides supporting evidence, it may not fully confirm whether the attention mechanism accurately identifies functionally critical regions. To further validate this, associated structures of neuropeptide–receptor complexes were examined. Shown in Fig. 5 are two experimentally determined neuropeptide-receptor complexes, viz. NPY-Y1R (PDB: 7VGX) [26] and 26RFa-QRFPR (PDB: 8WZ2) [27], as well as an AlphaFold3-predicted OXT-OTR complex [28, 29]. Structures correspond to the regions highlighted by the student model’s attention mechanism are also depicted. As shown in the attention maps, the hotspots correspond to residues located at the binding interfaces of the protein complexes. The crystal data from NPY-Y1R and 26RFa-QRFPR complexes revealed that attention maxima coincided precisely with intermolecular contact residues, while the computationally derived OXT-OTR model (characterized by high confidence metrics: pLDDT >90, ipTM >0.80) exhibited attention distributions that corresponded with predicted hydrogen-bonding networks at the peptide-receptor interface, as illustrated by the AlphaFold-generated structure shown in Supplementary Fig. S8. This consistent correlation between attention mechanisms and binding interfaces across both experimental and in silico structural models substantiates that MSKDNP indeeds can recognize the biologically relevant residues.
Figure 5.
Structural validation of model attention predictions with neuropeptide-receptor complexes. Complexes with zoomed views and attention maps for (A) Neuropeptide Y and Y1 Receptor, (B) 26RFa and QRFPR, and (C) Oxytocin and Oxytocin Receptor. All structures were visualized using VMD [22].
Case study on negative samples
To assess the robustness of MSKDNP in predicting negative samples under extreme conditions, two additional experiments were conducted. The first aimed to evaluate model performance on long out-of-distribution negative peptides. Specifically, 75 experimentally validated non-neuropeptides were randomly selected from UniProt, each confirmed to lack neuropeptide activity, with an average sequence length of 142 amino acids. The second experiment examined the model’s predictive ability on synthetic negative samples generated at varying positive-to-negative ratios (10:90, 20:80, and 30:70). For each ratio, 100 peptides were constructed by embedding short fragments derived from positive neuropeptides into longer negative sequences at defined proportions, while deliberately avoiding known motif regions to ensure the resulting sequences remained functionally negative.
In the first experiment, all 75 long non-neuropeptides were correctly predicted as negative, indicating that MSKDNP maintains reliable performance on out-of-distribution sequences with extended lengths. The selected peptides are listed in Supplementary Table S2. In the second experiment, the accuracy for the newly generated negative samples was 96%, 91%, and 88% for the 10:90, 20:80, and 30:70 ratios, respectively. The corresponding results are summarized in Supplementary Table S3, with sequences available in the Data Availability section. While accuracy decreased with increasing positive fragment proportions, the model maintained strong performance across all settings, demonstrating resilience to partial positive contamination in negative samples.
Web interface
As described in the Efficiency Analysis section, the memory and inference efficiency of the MSKDNP student model enables its deployment as a web server. To facilitate easy access for other researchers, a user-friendly web server has been developed. This platform is the first neuropeptide classification web service that integrates a protein language model, requires no local deployment, and supports one-click usage. The web server is available at https://awi.cuhk.edu.cn/∼biosequence/MSKDNP/index.php. The user interface is shown in Fig. 6. Prediction can be initiated by clicking the “Start Prediction” button on the homepage, directing users to a submission page that supports both direct input and FASTA-format file upload of peptide sequences. Upon submission, the prediction process is automatically executed, and the results are displayed on a separate results page. The output is divided into two sections: the “Prediction Summary” provides an overview of the number of predicted neuropeptide and non-neuropeptide fragments, while the “Prediction Details” presents sequence-level predictions with corresponding probabilities. Additionally, the platform offers peptide features extracted by the student model’s Transformer architecture, including the original 320-dimensional representations and the linearly mapped 1280-dimensional features, which are available for download to support further neuropeptide-related analyses.
Figure 6.
Demonstration of the MSKDNP web interface with the example sequences.
To ensure stable performance and fair resource allocation, the web server currently supports up to 100 input sequences per submission, with a maximum length of 100 amino acids per sequence. Inputs can be provided in multi-FASTA format or through direct text entry. Invalid inputs automatically trigger an error message and return the user to the submission page. For larger-scale or long-sequence predictions, the source code is available on GitHub for local deployment without these limitations.
Conclusion
A multi-stage knowledge distillation framework, MSKDNP, was proposed for neuropeptide classification, offering high efficiency, low resource consumption, and improved interpretability. Predictive accuracy was improved through the integration of feature-level and pseudo-label knowledge distillation, enabling the student model to match or surpass the teacher model’s performance. Unlike conventional single-stage distillation approaches, MSKDNP introduces a novel sequential combination of pseudo-label and feature-level distillation. This structured design allows the student model to progressively acquire both semantic and representational knowledge, enhancing the overall learning process. With only 1.2% of the entire teacher model’s parameters, MSKDNP’s student model significantly reduces memory usage and accelerates the inference, overcoming the common limitations of protein language model. Furthermore, the interpretability analysis confirmed that the attention mechanism identified neuropeptide motifs, which are associated with the peptide-receptor binding interface. This results suggests that the model successfully learned biologically meaningful sequence patterns. Finally, a publicly accessible and user-friendly web server is provided to avoid the need for local deployment, thus enhancing the model’s accessibility. To sum up, MSKDNP enables more accurate, efficient, interpretable, and accessible neuropeptide classification for biomedical research.
Key Points
A novel multi-stage distillation approach is developed to transfer neuropeptide knowledge from a high-capacity teacher to a lightweight student model with over 98% parameter reduction.
Enables high-efficiency, low-resource neuropeptide prediction with minimal computational cost.
Demonstrates strong interpretability, with attention regions aligning with experimentally validated and computationally predicted functional sites.
A publicly accessible web server incorporating protein language model knowledge is provided for broad research use.
Supplementary Material
Acknowledgements
The authors sincerely appreciate Kobilka Institute of Innovative Drug Discovery, The Chinese University of Hong Kong (Shenzhen) and National Yang Ming Chiao Tung University for financially supporting this research.
Contributor Information
Peilin Xie, Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen 518172, China; School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China.
Jiahui Guan, Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen 518172, China; School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China.
Zhihao Zhao, Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen 518172, China.
Yulan Liu, Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen 518172, China.
Zhang Cheng, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China.
Xuxin He, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China.
Xingchen Liu, Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen 518172, China.
Yun Tang, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, No. 75, Boai Street, Hsinchu 300, Taiwan; Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, No. 75, Boai Street, Hsinchu 300, Taiwan.
Zhenglong Sun, School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China.
Tzong-Yi Lee, Institute of Bioinformatics and Systems Biology, National Yang Ming Chiao Tung University, No. 75, Boai Street, Hsinchu 300, Taiwan; Center for Intelligent Drug Systems and Smart Bio-devices (IDS2B), National Yang Ming Chiao Tung University, No. 75, Boai Street, Hsinchu 300, Taiwan.
Lantian Yao, Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen 518172, China; School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China.
Ying-Chih Chiang, Kobilka Institute of Innovative Drug Discovery, School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen 518172, China; School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China; School of Medicine, The Chinese University of Hong Kong, Shenzhen, 2001 Longxiang Boulevard, Longgang District, Shenzhen, 518172, China.
Author contributions statement
P.L.X. and Y.-C.C. presented the idea. P.L.X. and J.H.G. implemented the framework. J.H.G. and Z.C collected the data. P.L.X., Z.H.Z, X.X.H., and X.C.L analyzed the results. P.L.X., Y.L.L., and Y.T. write this paper. Y.-C.C., L.T.Y. T.-Y.L., and Z.L.S. reviewed this paper. Y.-C.C., L.T.Y., and T.-Y.L. supervised the research project.
Conflict of interest: No competing interest is declared.
Funding
This work was supported by Shenzhen Science and Technology Innovation Commission (JCYJ20230807114206014), Guangdong Province Basic and Applied Research Fund (2025A1515011753) and the Kobilka Institute of Innovative Drug Discovery, The Chinese University of Hong Kong, Shenzhen, China. This work was also financially supported by the Center for Intelligent Drug Systems and Smart Biodevices (IDS2B) from The Featured Areas Research Center Program within the framework of the Higher Education Sprout Project and Yushan Young Fellow Program (113C51N055) by the Ministry of Education (MOE) and National Science and Technology Council (NSTC 113-2321-B-A49-025-, 113-2634-F-039-001, 113-2221-E-A49-160-MY3, and 112-2740-B-400-005) in Taiwan and The National Health Research Institutes (NHRI-EX114-11320BI) in Taiwan.
Data availability
Data and code are available at https://github.com/Cpillar/MSKDNP
References
- 1. Eiden LE, Hernández VS, Jiang SZ. et al. Neuropeptides and small-molecule amine transmitters: cooperative signaling in the nervous system. Cell Mol Life Sci 2022;79:492. 10.1007/s00018-022-04451-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Romanov RA, Harkany T. Grabbing neuropeptide signals in the brain. Science 2023;382:764–5. 10.1126/science.adl1788 [DOI] [PubMed] [Google Scholar]
- 3. Burbach JPH. Neuropeptides from concept to online database www.neuropeptides.nl. Eur J Pharmacol 2010;626:27–48. 10.1016/j.ejphar.2009.10.015 [DOI] [PubMed] [Google Scholar]
- 4. Boonen K, Landuyt B, Baggerman G. et al. Peptidomics: the integrated approach of ms, hyphenated techniques and bioinformatics for neuropeptide analysis. J Sep Sci 2008;31:427–45. 10.1002/jssc.200700450 [DOI] [PubMed] [Google Scholar]
- 5. Secher A, Kelstrup CD, Conde-Frieboes KW. et al. Analytic framework for peptidomics applied to large-scale neuropeptide identification. Nat Commun 2016;7:11436. 10.1038/ncomms11436 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Agrawal P, Kumar S, Singh A. et al. NeuroPIpred: a tool to predict, design and scan insect neuropeptides. Sci Rep 2019;9:5129. 10.1038/s41598-019-41538-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Bin Y, Zhang W, Tang W. et al. Prediction of neuropeptides from sequence information using ensemble classifier and hybrid features. J Proteome Res 2020;19:3732–40. 10.1021/acs.jproteome.0c00276 [DOI] [PubMed] [Google Scholar]
- 8. Hasan MM, Alam MA, Shoombuatong W. et al. NeuroPred-FRL: an interpretable prediction model for identifying neuropeptide using feature representation learning. Brief Bioinform 2021;22:1–12. [DOI] [PubMed] [Google Scholar]
- 9. Jiang M, Zhao B, Luo S. et al. NeuroPpred-fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief Bioinform 2021;22:1–11. [DOI] [PubMed] [Google Scholar]
- 10. Wang L, Huang C, Wang M. et al. NeuroPred-PLM: an interpretable and robust model for neuropeptide prediction by protein language model. Brief Bioinform 2023;24:1–9. [DOI] [PubMed] [Google Scholar]
- 11. Limin F, Niu B, Zhu Z. et al. CD-HIT: Accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12. Myronov A, Mazzocco G, Król P. et al. BERTrand—Peptide: TCR binding prediction using bidirectional encoder representations from transformers augmented with random TCR pairing. Bioinformatics 2023;39:btad468. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network arXiv preprint arXiv:1503.02531. 2015.
- 14. Jin J, Yingying Y, Wang R. et al. iDNA-ABF: multi-scale deep biological language learning model for the interpretable prediction of dna methylations. Genome Biol 2022;23:219. 10.1186/s13059-022-02780-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15. Yao L, Xie P, Guan J. et al. ACP-CapsPred: an explainable computational framework for identification and functional prediction of anticancer peptides based on capsule network. Brief Bioinform 2024;25:bbae460. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Christophe C, Kanithi PK, Munjal P. et al. Med42–evaluating fine-tuning strategies for medical LLMs: full-parameter vs. parameter-efficient approaches arXiv preprint arXiv:2404.14779. 2024.
- 17. Lin Z, Akin H, Rao R. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 2023;379:1123–30. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
- 18. Weiss K, Khoshgoftaar TM, Wang DD. A survey of transfer learning. J Big Data 2016;3:1–40. [Google Scholar]
- 19.Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017,30. [Google Scholar]
- 20. Le N-Q-K, Yu-Yen O. Incorporating efficient radial basis function networks and significant amino acid pairs for predicting gtp binding sites in transport proteins. BMC Bioinf 2016;17:183–92. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Le NQK, Li W, Cao Y. Sequence-based prediction model of protein crystallization propensity using machine learning and two-level feature selection. Brief Bioinform 2023;24:bbad319. [DOI] [PubMed] [Google Scholar]
- 22. Humphrey W, Dalke A, Schulten K. VMD: visual molecular dynamics. J Mol Graph 1996;14:33–8. [DOI] [PubMed] [Google Scholar]
- 23. Bailey TL, Johnson J, Grant CE. et al. The MEME suite. Nucleic Acids Res 2015;43:W39–49. 10.1093/nar/gkv416 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Jiang G, Zhang BB. Glucagon and regulation of glucose metabolism. Am J PhysiolEndocrinol Metabo 2003;284:E671–8. 10.1152/ajpendo.00492.2002 [DOI] [PubMed] [Google Scholar]
- 25. Qi Y, Zhao W, Zhao Y. et al. Chromosome-level genome assembly of phrynocephalus forsythii using third-generation DNA sequencing and Hi-C analysis. DNA Res 2023;30:dsad003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Park C, Kim J, Ko S-B. et al. Structural basis of neuropeptide Y signaling through Y1 receptor. Nat Commun 2022;13:853. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Jin S, Guo S, Youwei X. et al. Structural basis for recognition of 26RFa by the pyroglutamylated RFamide peptide receptor. Cell Dis 2024;10:58. 10.1038/s41421-024-00670-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Jones DS, Flint AP. Nucleotide sequence of a full length cDNA clone encoding the oxytocin-neurophysin I precursor isolated from the ovine corpus luteum. Nucleic Acids Res 1989;17:7990. 10.1093/nar/17.19.7990 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Abramson J, Adler J, Dunger J. et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature 2024;630:493–500. 10.1038/s41586-024-07487-w [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data and code are available at https://github.com/Cpillar/MSKDNP




























