CELA-MFP: a contrast-enhanced and label-adaptive framework for multi-functional therapeutic peptides prediction

Yitian Fang; Mingshuang Luo; Zhixiang Ren; Leyi Wei; Dong-Qing Wei

doi:10.1093/bib/bbae348

. 2024 Jul 22;25(4):bbae348. doi: 10.1093/bib/bbae348

CELA-MFP: a contrast-enhanced and label-adaptive framework for multi-functional therapeutic peptides prediction

Yitian Fang ^1,², Mingshuang Luo ^3,^✉, Zhixiang Ren ⁴, Leyi Wei ^5,^6,^✉, Dong-Qing Wei ^7,^8,^✉

PMCID: PMC11262836 PMID: 39038935

Abstract

Functional peptides play crucial roles in various biological processes and hold significant potential in many fields such as drug discovery and biotechnology. Accurately predicting the functions of peptides is essential for understanding their diverse effects and designing peptide-based therapeutics. Here, we propose CELA-MFP, a deep learning framework that incorporates feature Contrastive Enhancement and Label Adaptation for predicting Multi-Functional therapeutic Peptides. CELA-MFP utilizes a protein language model (pLM) to extract features from peptide sequences, which are then fed into a Transformer decoder for function prediction, effectively modeling correlations between different functions. To enhance the representation of each peptide sequence, contrastive learning is employed during training. Experimental results demonstrate that CELA-MFP outperforms state-of-the-art methods on most evaluation metrics for two widely used datasets, MFBP and MFTP. The interpretability of CELA-MFP is demonstrated by visualizing attention patterns in pLM and Transformer decoder. Finally, a user-friendly online server for predicting multi-functional peptides is established as the implementation of the proposed CELA-MFP and can be freely accessed at http://dreamai.cmii.online/CELA-MFP.

Keywords: multi-functional therapeutic peptides prediction, contrastive learning, protein language model, Transformer decoder, deep learning

Introduction

Functional peptides, typically composed of 5–50 amino acids, are indispensable in numerous physiological processes, such as cell signaling, immune regulation, neurotransmission, metabolism, growth, and development [1, 2]. Due to their specificity, efficacy, and low toxicity, functional peptides have garnered significant interest in the field of pharmaceutical development [3–5]. Notably, a variety of peptides with broad functionalities have been discovered, including anti-microbial peptides, anti-cancer peptides, anti-viral peptides, and so on [6, 7]. Moreover, an increasing number of peptides have been found to possess multiple functions. For example, certain anti-microbial peptides not only exhibit anti-microbial properties but also demonstrate cytotoxic effects on cancer cells [8]. Host defense peptides derived from frog skin exhibit a spectrum of therapeutic effects, including anti-viral, anti-cancer, and immunomodulatory activities. However, traditional wet lab methods often consume excessive time and labor to predict peptide functions, significantly slowing down the development process. Therefore, it is significant and necessary to explore peptides' functions by an efficient and accurate method. Sequence-based computational approaches provide an efficient means for predicting functional therapeutic peptides and have been used as a primary screening method by many researchers.

Numerous computational methods forecast peptides with single or multiple functions. Machine learning-based methods, such as AVPpred [9], AIPpred [10], THPep [11], PSBP-SVM [12], NeuroPpred-Fuse [13], iQSP [14], PredAPP [15], and BBPpred [16], utilize algorithms such as Support Vector Machines and Random Forest. These methods typically employ feature engineering to extract biological properties of peptides, such as amino-acid frequency-based features, position-specific features, PSI-BLAST-profile features, and physicochemical-based features, as inputs. In contrast, deep learning-based methods, such as Deep-AntiFP [17], Deep-ABPpred [18], iUmami-SCM [19], PreAIP [20], DeepACP [21], ITP-Pred [22], and CACPP [23], utilize structures such as Convolutional Neural Networks or Recurrent Neural Networks to enhance predictive performance. However, supervised deep learning methods struggle with the limited size of functional peptide datasets. Recently, large language models, particularly protein language models (pLMs), have emerged as powerful tools for predicting various peptide functionalities, such as anti-microbial peptide [24, 25], anti-fungal peptide [26], bitter peptide [27], cell-penetrating peptide [28], and more. These models are trained on known peptide sequences with specific functionalities, mainly for predicting single-function peptides. Yet, as research progresses, more peptides are found to be multifunctional, challenging current methods.

Compared with the identification of peptides with a single function, identifying peptides with multiple functions is a multi-label classification task. Multi-label classification is more complex as it requires assigning multiple relevant labels to each sample, capturing potential relationships between labels, and addressing imbalanced data. Several studies have focused on discovering multi-functional peptides. MPMABP [29], MLBP [30], and PrMFTP [31] employ multi-label deep learning methods that combine different networks to extract features from peptides and assign functional labels, enabling the identification of multi-functional peptides. MLBP successfully predicts five classes of bioactive peptides. PrMFTP utilizes multi-head self-attention to extract features from peptide sequences, identifying multi-functional peptides with 21 functional classes. Additionally, PrMFTP addresses the challenge of imbalanced data by using class-weighting optimization. ETFC [32] utilizes a loss function to alleviate the limitation of the imbalanced data. Additionally, iMFP-LG [33] proposes a method that utilizes a pLM and a graph attention mechanism to enhance the discovery of multi-functional peptides. This approach considers the correlations between functional labels and utilizes a graph-based method to model these correlations. Although the aforementioned methods have achieved certain accomplishments in peptide prediction, there are still challenges and issues that need further improvement. For example, there is a need to acquire more robust features, mitigate the impact of data imbalance on peptide sequence function prediction, and improve the accuracy of prediction models.

To overcome the challenges mentioned above, this study introduces a Contrast-Enhanced and Label-Adaptive framework for predicting Multi-Functional therapeutic Peptides, named CELA-MFP. In CELA-MFP, we utilize a pLM to extract sequence features from peptides. To effectively capture the correlations between different label categories, we employ a Transformer decoder module, which can learn attention between different functions adaptively. Additionally, to enhance the representation of each individual peptide sequence, contrastive learning is employed during the training process of CELA-MFP. Experimental results demonstrate that CELA-MFP outperforms existing methods on the Multi-functional Bioactive Peptides (MFBP) [30] and Multi-functional Therapeutic Peptides (MFTP) [31] datasets. Furthermore, we demonstrate the interpretability of CELA-MFP from three perspectives: the t-SNE distribution of peptide representations extracted from the pLM, the relationship between discriminative amino acid fragments highlighted by the attention mechanism and functionally relevant amino acid fragments obtained through statistical analysis, and the associations between different functions captured by the Transformer decoder. Finally, we develop a web server for predicting multi-functional peptides, aiming to simplify its usage for researchers.

Materials and methods

Datasets

To evaluate the effectiveness of the proposed CELA-MFP, we use the same benchmark datasets: Multi-functional Bioactive Peptides (MFBP) [30] and Multi-functional Therapeutic Peptides (MFTP) [31].

The MFBP dataset, collected by searching the keyword `bioactive peptides' on Google Scholar in June 2020, includes 5986 peptides categorized into five functions: anti-cancer peptide (ACP), anti-diabetic peptide (ADP), anti-hypertensive peptide (AHP), anti-inflammatory peptide (AIP), and anti-microbial peptide (AMP). Sequences with over 90% similarity in each category were removed using CD-HIT [34] to eliminate redundancy and homology bias.

Similarly, the acquisition of the MFTP dataset was carried out by conducting a search on Google Scholar in July 2021 using the keyword `therapeutic peptides'. The data are preprocessed through the following steps: (i) remove sequences that contain non-standard amino acids; (ii) remove peptides that are less than 5 amino acids or more than 50 amino acids long; and (iii) delete datasets with less than 40 samples. A total of 9874 therapeutic peptides were obtained, covering 21 different functional attributes, including anti-angiogenic peptide, anti-bacterial peptide (ABP), anti-cancer peptide (ACP), anti-coronavirus peptide (ACVP), anti-diabetic peptide (ADP), anti-endotoxin peptide (AEP), anti-fungal peptide (AFP), anti-HIV peptide (AHIVP), anti-hypertensive peptide (AHP), anti-inflammatory peptide (AIP), anti-MRSA peptide (AMRSAP), anti-parasitic peptide (APP), anti-tubercular peptide (ATP), anti-viral peptide (AVP), blood–brain barrier peptide (BBP), biofilm-inhibitory peptide (BIP), cell-penetrating peptide (CPP), dipeptidyl peptidase IV peptide (DPPIP), quorum-sensing peptides (QSP), surface-binding peptides (SBP), and tumor homing peptides (THP).

Detailed information about the two datasets can be found in Tables S1 and S2. To ensure fairness and consistency, we use the same training and testing sets as iMFP-LG [33], which is the latest state-of-the-art method.

Overall framework of CELA-MFP

As shown in Fig. 1, the proposed CELA-MFP framework comprises three core modules: the peptide representation module, the Transformer decoder module, and the unsupervised contrastive learning module. Initially, peptide sequences are transformed by the peptide representation module into expressive and robust representations. These are input into the Transformer decoder module for function prediction, where the decoder leverages attention mechanisms to capture correlations among different functional labels. Simultaneously, during training, these representations are utilized in the unsupervised contrastive learning module to enhance discriminative capability, effectively reducing model bias toward dominant categories and curtailing overfitting. The model is optimized using both the contrastive learning loss from the unsupervised module and the classification loss from the Transformer decoder. After training, logits from the Transformer decoder are processed with a sigmoid function to compute scores. To assess the impact of the Transformer decoder compared to a basic linear classifier on the predictive performance of multi-functional peptides, we also explore an alternative model with a simple multi-layer linear classifier, as illustrated in Fig. S1. Here, peptide representations are fed into the linear classifier, with model training optimized using both classification and contrastive learning losses. The input and output shapes of each layer in the CELA-MFP framework are provided in Table S3.

Peptide representation module

In this study, similar to previous studies [33, 35], we utilize a pre-trained pLM, Tasks Assessing Protein Embeddings (TAPE) [36], to extract expressive and robust representations from peptide sequences. TAPE is a language model specifically designed for modeling and learning representations of protein sequences. It leverages the Pfam database [37], which contains more than 31 million protein domains, and undergoes pretraining through a masked language model. As a result, TAPE can capture comprehensive relationships among amino acids. The TAPE model consists of 12 Transformer encoder layers, each with 12 attention heads. The attention mechanism of the TAPE model is illustrated in Fig. 1A and can be described as follows:

(1)

(2)

where Inline graphic refers to the embedding of a peptide sequence. The attention layer transforms into the query matrix , the key matrix , and the value matrix by a linear transformation. , , refer to the weights of the attention layer. The multi-head attention can be represented as follows:

(3)

(4)

where Inline graphic are the learnable parameters in the ith attention head; h represents the number of attention heads. The outputs of all attention heads are concatenated and linearly transformed using to get the final result of the multi-head attention. We utilize the output from the pooling layer in TAPE as the representation for peptide sequences, which is characterized by a dimensionality of 768.

Transformer decoder module

Once we have obtained the peptide representation, we leverage label embeddings as queries, denoted as Inline graphic , to conduct cross-attention and derive category-related features from the peptide representations, where K represents the number of categories. The Transformer architecture in this paper follows the standard design, including three main modules: a self-attention module, a cross-attention module, and a position-wise feed-forward network (FFN). At the Transformer decoder layer i, the query Inline graphic is updated based on the output of the previous layer. We can describe this process as follows:

(5)

(6)

(7)

where Inline graphic and represent two intermediate variables. The functions MultiHead (query, key, value) and FFN(x) are consistent with the definitions in the standard Transformer decoder [38]. Since the model does not require autoregressive prediction, attention masks are not used here, enabling parallel decoding for K categories in each layer.

The MultiHead function is employed to implement both the self-attention and cross-attention modules. Nevertheless, there is a difference regarding the origin of the key and value inputs. In the self-attention module, all three inputs (query, key, and value) originate from the label embeddings. To provide a more intuitive understanding of the cross-attention process, we can describe it as follows: each label embedding, denoted as Inline graphic , examines the features F to determine where to focus its attention. It selectively combines the relevant features, resulting in an enhanced category-related representation for each label embedding. This process allows each label embedding to integrate contextualized information from the peptide sequence via cross-attention.

Considering a total of L layers, we generate query feature vectors Inline graphic for K classes at the final layer. Here, we regard each label prediction as an individual binary classification task. For each class, we utilize a linear projection layer followed by a sigmoid function, for mapping the features of that class, denoted as :

(8)

where Inline graphic and are the parameters of the linear layer, and represents the prediction probabilities for each category.

Unsupervised contrastive learning

Unsupervised contrastive learning optimizes models by leveraging sample similarities, drawing samples from the same category closer together in the embedding space, and distancing those from different categories. This method offers increased resilience to noisy labels and a wider margin between categories, thereby improving model generalization through better capturing of data's inherent structure and patterns.

In commonly used multi-functional peptide prediction datasets, a notable class sample imbalance exists, with some classes overrepresented and others underrepresented. This imbalance in training data can bias the model toward classes with more samples. To mitigate overfitting and bias, we integrate a contrastive learning module that focuses on each individual sample, enhancing the model's overall generalization across the dataset.

The unsupervised contrastive learning method similar to the previous study [39] is adopted. Consider a set of samples Inline graphic , where represents the index of the sample. For each sample , we introduce a positive sample and a negative sample . In this study, we generate positive samples for the same sample by applying random Gaussian noise with different distributions, while others are considered as negative samples. The computation of the loss function for unsupervised contrastive learning can be expressed as follows:

(9)

where Inline graphic represents the similarity measure between samples a and b, and τ is the temperature parameter that controls the scale of similarity measurement.

Multi-label classification loss function

In the prediction of multi-functional peptides, most methods typically utilize Binary Cross-Entropy Loss (BCELoss) for model training. The BCELoss can be defined as follows:

(10)

where Inline graphic is the number of training samples in a batch; is the actual label of the sample; is the predicted output of the model for the sample. Log represents the natural logarithm.

To better tackle the issue of sample imbalance, we employ a multi-label focal dice loss (MLFDLoss), which is also used in ETFC [32]. According to our experiments and results, we find it can work well. The MLFDLoss can be formulated as follows:

(11)

(12)

where Inline graphic and , respectively, denote the foreground probabilities of the mth label and the background probabilities of the mth label for the nth sample. The predicted probability is defined as the foreground probability and is defined as the background probability for the mth label of the nth sample Inline graphic . represents the probability of the peptide sequence having the mth label. represents the probability of the peptide sequence not having the mth label. and represent the tunable focal factors for the foreground and background probabilities, respectively.

(13)

(14)

(15)

where Inline graphic is a balanced factor used to equalize the foreground loss and the background loss .

In this study, we use BCELoss or MLFDLoss as the loss function for CELA-MFP and CELA-MFP-classifier. We perform experiments to assess and compare the performance of BCELoss and MLFDLoss.

Evaluation metrics

To assess the effectiveness of our proposed CELA-MFP, we employ five widely used metrics. These metrics are defined as follows:

(16)

(17)

(18)

(19)

(20)

where Inline graphic represents the total number of peptide sequences in the dataset, denotes the number of labels, and are the intersection and union operations in set theory, indicates the operation of calculating the number of elements, represents the true label subset of the ith peptide sample, represents the predicted label subset based on the classifier for the ith sample, and

(21)

Among these metrics, higher values for Precision, Coverage, Accuracy, and Absolute true indicate better model performance, while a lower value for Absolute false also signifies improved model performance.

Implementation details

The proposed CELA-MFP is constructed with PyTorch in this study. All experiments are conducted on a computational server equipped with an NVIDIA Tesla 32G V100 GPU. Similar to the previous study [33], the proposed model is trained for 100 epochs with a batch size of 32 and optimized using the AdamW optimizer [40]. The CELA-MFP model is fine-tuned with a learning rate of 5e−5 for the pretrained language model, while a learning rate of 5e−4 is used for optimizing the Transformer decoder. In the Transformer decoder module, a single-block Transformer decoder is employed to model the correlations between functions. The dimensions of the learned embeddings, hidden states, and key-value vectors are set to 768, 1024, and 64, respectively. Additionally, the model employs 8 attention heads in the multi-head attention mechanism to thoroughly investigate the relationships between functions. To mitigate the impact of randomness introduced by the random initialization of deep learning frameworks and ensure consistent comparison with methods described in previous studies [29–33], our models undergo 10 repeated training sessions. The prediction outcomes from each iteration are then averaged to derive the final prediction for the testing samples. This approach enhances the robustness of the model by addressing the variability introduced during initialization and establishes a more reliable baseline for performance evaluation in line with established methodologies (Tables S4–S19). The training runtime for MFBP and MFTP is shown in Table S20.

In this research, to ensure that the values of the contrastive learning loss and the multi-classification loss are within the same order of magnitude, the weight ( Inline graphic ) of the contrastive learning loss was set to 0.001. This adjustment allows the model to better balance feature representation learning and classification prediction. In this study, is the loss which is from the Transformer decoder module in CELA-MFP, and is the loss which is from the linear classifier module in CELA-MFP-classifier. As mentioned above, we use BCELoss or MLFDLoss as Inline graphic and . is employed for optimizing CELA-MFP, and when using a linear classifier for multi-functional peptide prediction, is applied for CELA-MFP-classifier. and are defined as follows:

(22)

(23)

Results

Model ablation study

To evaluate the efficacy of CELA-MFP, we performed comparisons between the model employing a Transformer decoder architecture (CELA-MFP, shown as Fig. 1) and the one utilizing a linear classifier architecture (CELA-MFP-classifier, shown as Fig. S1). CELA-MFP utilizes a pLM for peptide representation extraction and employs a Transformer decoder for score prediction. It incorporates an unsupervised contrastive learning module to enhance features and is optimized using MLFDLoss when training. In contrast, CELA-MFP-classifier follows a similar approach but uses a linear classifier for score prediction.

In order to assess the efficacy of each module, we conducted independent ablation studies on both CELA-MFP and CELA-MFP-classifier. The previous study [33] has shown a significant decrease in performance when the pre-trained pLM is not utilized. Therefore, all experiments in this study were carried out with the inclusion of the pretrained pLM. We designed the following ablation variants for experimentation:

BCELoss: A variant that uses BCELoss to optimize the model.
MLFDLoss: A variant that uses MLFDLoss to optimize the model.
CL + BCELoss: A variant that uses the unsupervised contrastive learning module to enhance features and BCELoss to optimize the model.
CL + MLFDLoss: A variant that uses the unsupervised contrastive learning module to enhance features and MLFDLoss to optimize the model.

As depicted in Fig. 2A, the CELA-MFP achieves the best performance on the MFBP dataset when incorporating both the contrastive learning module and MLFDLoss simultaneously, demonstrating a Precision of 0.804, a Coverage of 0.813, an Accuracy of 0.799, an Absolute true of 0.781, and an Absolute false of 0.078. The removal of the contrastive learning module or the substitution of MLFDLoss with BCELoss result in a decline in model performance. Similarly, CELA-MFP-classifier achieves its best results across most metrics on the MFBP dataset when both the contrastive learning module and MLFDLoss are employed simultaneously, with a Precision of 0.791, a Coverage of 0.800, an Accuracy of 0.790, and an Absolute true of 0.778. Eliminating the contrastive learning module leads to a reduction in Precision, Coverage, Accuracy, and Absolute true, while replacing MLFDLoss with BCELoss also diminishes the model's performance. Similar results are observed on the MFTP dataset, as indicated in Fig. 2B. Detailed experimental results are available in Tables S21–S24. These findings demonstrate the effectiveness of the contrastive learning module and MLFDLoss within the CELA-MFP and CELA-MFP-classifier. When evaluating the predictive performance of CELA-MFP and CELA-MFP-classifier, as the results shown in Fig. 2A–D and Tables S25–S26, the results indicate that CELA-MFP performs better in most indicators.

Comparison of performance between different methods for identifying multi-functional peptides. (A and B) Tthe impact of contrastive learning and MLFDLoss on the MFBP and MFTP datasets based on CELA-MFP and CELA-MFP-classifier. (C and D) The performance of all variants on the MFBP and MFTP datasets.

We conducted an ablation study on the MFTP dataset to evaluate the effects of different contrastive learning hyperparameters, specifically varying values of Inline graphic , among 100 training epochs. At each epoch, we systematically tested the model's performance on four key metrics: Precision, Coverage, Accuracy, and Absolute true, as shown in Fig. S2. The model demonstrated optimal performance at a value of 0.001.

To explore the effectiveness of MLFDLoss in addressing data imbalance issues, we conducted a comparative analysis of prediction performances using various methods on MFTP dataset. In addition to BCELoss, we employed other approaches specifically designed to tackle data imbalance. These methods include: (i) Data ReSampling, where classes with fewer samples were oversampled during training. (ii) DCSLoss, a dice coefficient-based loss function tailored for data imbalance [41]. (iii) BinaryDiceLoss, a variation of the Dice coefficient [42]. These approaches have been utilized in some prior studies to mitigate data imbalance issues. To ensure a fair comparison, the same model architecture is used across all experimental setups. The performance comparisons of different methods, as shown in Fig. S3, indicate that MLFDLoss consistently outperforms other metrics such as Precision, Accuracy, and Absolute true, demonstrating its superior efficacy in managing data imbalance compared with other methods.

Moreover, we evaluated the performance of each variant on the test set to verify whether different variants of the framework exhibit overfitting. We presented the performance of four variant models (CL + MLFDLoss, CL + BCELoss, MLFDLoss, and BCELoss) on the MFBP and MFTP datasets, specifically examining changes in Precision, Coverage, Accuracy, and Absolute true on the test set over 100 training epochs. According to Fig. S4, we can observe that with an increasing number of training epochs, the three variant methods—CL + MLFDLoss, CL + BCELoss, and MLFDLoss—did not show signs of overfitting across the four main metrics: Precision, Coverage, Accuracy, and Absolute true on MFBP and MFTP. In contrast, BCELoss exhibited slight overfitting in terms of Precision and Accuracy, and severe overfitting in terms of Absolute true on MFTP. These experimental results fully demonstrate the superior generalization ability of our method and highlight the effectiveness of the proposed contrastive learning module and MLFDLoss in preventing model overfitting.

Comparison with existing methods

We compared the proposed CELA-MFP with various existing methods, including four machine learning-based methods (CLR [43], RAKEL [44], RBRL [45], and MLDF [46]) and five deep learning-based methods (MPMABP [29], MLBP [30], PrMFTP [31], ETFC [32], and iMFP-LG [33]). Based on the performance evaluation conducted on the MFBP dataset (Fig. 3A, Table S27) and the MFTP dataset (Fig. 3B, Table S28), CELA-MFP exhibits superior performance compared with state-of-the-art methods across a majority of evaluation metrics. Specifically, on the MFBP dataset, CELA-MFP achieves a Precision of 0.804, a Coverage of 0.813, and an Accuracy of 0.799, surpassing iMFP-LG by 0.7%, 1.0%, and 0.3%, respectively. On the MFTP dataset, CELA-MFP achieves a Precision of 0.739, a Coverage of 0.754, and an Accuracy of 0.700, surpassing iMFP-LG by 0.9%, 2.4%, and 1.1%, respectively. We utilize the pretrained language model TAPEbert and the enhanced capabilities of contrastive learning to develop robust and expressive representations for all categories. The pretrained language model, trained on extensive datasets, provides a favorable initial state, enhancing the model's generalization ability. Additionally, contrastive learning improves feature distinguishability by metric optimization in the feature space. We also integrate multi-label focal dice loss to refine our approach. Simultaneously, we utilize a learnable decoder that captures and learns the correlations and dependencies between different categories through self-attention and cross-attention mechanisms, as well as learnable category embeddings. These results indicate the competitiveness of our method in the task of predicting multi-label functional peptides, with notable improvements observed in key metrics.

Comparision of CELA-MFP with other methods on the MFBP (A) and MFTP (B) datasets, respectively.

Interpretable analysis for CELA-MFP

The original intent of CELA-MFP is to enhance the representation of peptide sequences through an unsupervised contrastive learning module and model the relationships between different functions with a Transformer decoder, thereby improving the performance of multi-functional peptide prediction. Hence, the objective of this section is to elucidate the decision-making process for predicting multifunctional peptides by visualizing the distribution of peptides' representation, sequence motifs, and the interplay among diverse functionalities.

Visualization of peptide sequence representations

In this section, we show the t-SNE [47] visualization of the distribution of peptide representations for the testing samples in MFBP and MFTP, showcasing both the initial feature vectors and the latent feature vectors learned by CELA-MFP. The initial feature vectors were generated by pretrained Tasks Assessing Protein Embeddings (TAPE) with a dimensionality of 768. Similarly, the latent feature vectors learned by the fine-tuned pLM in CELA-MFP also have a dimensionality of 768, and t-SNE [47] is utilized to project the high-dimensional feature vectors into the 2D space. Subparts A–D in Figure 4, respectively, illustrate the distribution of samples encoded by the initial and latent feature vectors for MFBP and MFTP.

T-SNE visualization for the distribution of peptide representations. Peptides’ representations are obtained from the pretrained and fine-tuned pLM in CELA-MFP on MFBP dataset (A, B) and MFTP dataset (C, D), respectively.

As shown in Fig. 4A and C, the pretrained pLM can extract discriminative representations for peptide sequences. In contrast, Fig. 4B and D show that the pLM fine-tuned through CELA-MFP training effectively clusters the feature representations of peptides with similar functions and enhances the differentiation between sequences with distinct functionalities. This suggests that CELA-MFP can extract more expressive feature vectors. Interestingly, peptide clusters exhibiting multiple functions occupy the intermediate domain between clusters characterized by singular functions. In the MFBP distribution, peptide clusters with ADP_AHP functionality are situated between clusters with ADP and AHP functionalities. Clusters with ADP_AIP functionality are also close to clusters with ADP and AIP functionalities (Fig. 4B). Similar observations are made in the MFTP distribution, where clusters with ABP_ACP functionality are positioned between clusters with ABP and ACP functionalities (Fig. 4D). These results indicate that the latent representations learned by CELA-MFP significantly enhance the discriminative capabilities of peptides.

Visualization of peptide sequence motifs

As shown in Fig. 5, to explore the relationship between discriminative amino acid segments, as emphasized by the attention mechanism in CELA-MFP, and the function-related amino acid segments obtained through statistical analysis, we conducted visualizations on four cases. These sequences are extracted from the ACP, ADP, and AHP test data of the MFBP dataset. Each test peptide sequence is input into the pLM, and the attention weight matrices from 12 attention heads in 12 layers are obtained. For every attention matrix, each block's weight Inline graphic represents the total attention of the amino acid at position in the peptide sequence. Darker colors in the attention matrix blocks indicate a more significant contribution of the corresponding amino acid to distinguishing peptide functional categories. Additionally, Figure 5 presents ACP, ADP, and AHP motifs obtained through STREME [48], evaluating the correspondence between the confidence distribution of amino acids learned via attention mechanisms and the motifs. From Fig. 5, we observe that in the attention map, individual amino acids or segments with higher attention weights correspond to those with higher frequencies in the motifs. The parameters for STREME are detailed in Table S29. Furthermore, we utilized the attention visualization tool bertviz [49] to show the details of attention patterns (Fig. S5). The attention density distributions of these sequences correspond to the heatmap representations in the attention maps shown in Fig. 5. These results indicate the capability of CELA-MFP to identify functional regions within peptide sequences.

Visualization of attention maps for several test samples on MFBP. (A–C) Show 3 cases of ACP, (D–F) show 3 cases of ADP, and (G–I) show 3 cases of AHP, where the attention patterns captured by pLM are match the motifs discovered by STREME. The head refers to one of the attention matrix for Transformer-based encoder layers.

Visualization of attention relationships among multiple functionalities

To observe the relationship between different functionalities, we visualized the attention weight matrices from eight attention heads in the self-attention layer of the Transformer decoder. The inter-class relationship distribution is represented using the attention matrix, given by Inline graphic , where each element signifies the relationship between functionality classes and . As shown in Fig. 6A–C, the attention matrices for three different samples show that darker colors indicate stronger correlations between distinct functionalities. For instance, considering a sample sequence `GPFPILV' with functionalities ADP and AHP, the highest attention weight appears at the intersection of ADP and AHP in its attention heatmap matrix. This observation suggests a strong correlation between ADP and AHP, aligning with their actual functional labels. These results demonstrate that CELA-MFP has the capability to learn relationships between different functional labels.

Visualization of attention relationships between multiple functions. (A–C) Show 3 multi-functional peptide cases where the relationships learned by the Transformer decoder align with their true functional labels. The head refers to one of the attention matrix for Transformer-based decoder layers.

The CELA-MFP web server

To help researchers predict multi-functional peptides, we have developed a user-friendly web server for the CELA-MFP model, which can be accessed at http://dreamai.cmii.online/CELA-MFP. This web server offers an intuitive and user-friendly interface, allowing users to predict multi-functional peptides by entering FASTA-formatted peptide sequences or uploading FASTA files containing peptide sequences (Fig. 7). This provides users with a convenient tool to delve deeper into the understanding of the functionality and characteristics of peptide sequences, offering robust support for research and applications in related fields.

User interface of the CELA-MFP web service. (A) The homepage, which introduces the functions and features of CELA-MFP. (B) The prediction page, where users enter peptide sequences and select suitable prediction models for submission, and then obtain corresponding prediction results.

Discussion

Peptides play crucial roles in biological processes such as cell signaling, enzyme regulation, and gene expression. In drug research and development, peptides have garnered significant attention. Studies on peptide structure and function have led to the discovery of many biologically active peptides, such as anti-bacterial, anti-viral, and anti-cancer peptides. As research progresses, more and more peptides are discovered to have multifunctional properties. For example, peptide Aurein 1.2 from Litoria aurea has shown broad-spectrum antibacterial and anticancer effects without toxicity [50–52]. Peptides pleuricidin 03 and pleuricidin 07 from Atlantic halibut exhibit strong antibacterial and anticancer activities [53, 54]. Traditional experimental methods are time-consuming and labor-intensive, hindering large-scale peptide screening. Computational methods offer a solution by enabling rapid screening and activity prediction of peptide candidates, aiding drug development.

This study proposes a new multi-functional peptide prediction framework, named CELA-MFP, which incorporates feature contrastive enhancement and label-adaptive capabilities. Contrastive learning improves peptide sequence representations, reducing overfitting and enhancing focus on different categories. A Transformer decoder models relationships between functional labels, adapting to datasets with varying label categories via self-attention. Comparative results on MFBP and MFTP datasets demonstrate the superior performance of CELA-MFP compared with existing methods. The framework's interpretability is shown through attention pattern visualizations, highlighting amino acid influences and modeling functionality associations. To support the exploration of multifunctional peptides, a web server has been established.

Although the CELA-MFP model achieves competitive results on both MFBP and MFTP, it still has certain limitations. On the one hand, the peptide features utilized are quite limited; this study only employs sequence context information and does not incorporate additional feature information. On the other hand, the multi-functional peptide dataset used in this study still suffers from issues such as uneven distribution of sample sizes and incomplete categorization. In the future, we plan to expand the dataset size by collecting more datasets of functional peptides. Additionally, we aim to integrate structural information, physicochemical properties, and posttranslational modification information, to obtain better representation features of peptides.

Key Points

We present CELA-MFP, an intelligent computational model designed for the identification of multi-functional peptides.
CELA-MFP initiates the process by extracting features from peptides using a pLM. Subsequently, it employs contrastive learning to enhance the representation of peptide sequences and utilizes a Transformer decoder to model relationships between different functionalities, thereby improving the performance of multi-functional peptide prediction.
CELA-MFP has exhibited superior performance in two datasets when compared with state-of-the-art methods.
The interpretability of CELA-MFP is demonstrated by visualizing attention patterns in pLM and Transformer decoder.
A user-friendly web server has been developed to facilitate rapid prediction of multi-functional peptides.

Supplementary Material

Supplementary_files0703new_bbae348

supplementary_files0703new_bbae348.pdf^{(5.9MB, pdf)}

Contributor Information

Yitian Fang, State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China; Peng Cheng Laboratory, 2 Xingke 1st Street, Nanshan District, Shenzhen 518055, China.

Mingshuang Luo, Peng Cheng Laboratory, 2 Xingke 1st Street, Nanshan District, Shenzhen 518055, China.

Zhixiang Ren, Peng Cheng Laboratory, 2 Xingke 1st Street, Nanshan District, Shenzhen 518055, China.

Leyi Wei, Centre for Artificial Intelligence Driven Drug Discovery, Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China; School of Informatics, Xiamen University, 422 Siming South Road, Xiamen 361005, China.

Dong-Qing Wei, State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic and Developmental Sciences, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai 200240, China; Peng Cheng Laboratory, 2 Xingke 1st Street, Nanshan District, Shenzhen 518055, China.

Conflict of interest: The authors declare no competing interests.

Funding

This work was supported by the Intergovernmental International Scientific and Technological Innovation and Cooperation Program of The National Key R&D Program (2023YFE0199200), the National Natural Science Foundation of China (32070662 and 32030063), the Joint Research Funds for Medical and Engineering and Scientific Research at Shanghai Jiao Tong University (YG2021ZD02), the Internal Research Grants of Macao Polytechnic University (RP/CAI02/2023), and the Science and Technology Development Fund (0177/2023/RIA3). The computations were partially performed at the Peng Cheng Laboratory and the Center for High-Performance Computing, Shanghai Jiao Tong University.

Data availability

The code is available on https://github.com/WeiLab-Biology/CELA-MFP.

References

1. Basith S, Manavalan B, Hwan Shin T. et al. Machine intelligence in peptide therapeutics: a next-generation tool for rapid disease screening. Med Res Rev 2020;40:1276–314. 10.1002/med.21658. [DOI] [PubMed] [Google Scholar]
2. Sánchez A, Vázquez A. Bioactive peptides: a review. Food Quality and Safety 2017;1:29–46. 10.1093/fqs/fyx006. [DOI] [Google Scholar]
3. Fosgerau K, Hoffmann T. Peptide therapeutics: current status and future directions. Drug Discov Today 2015;20:122–8. 10.1016/j.drudis.2014.10.003. [DOI] [PubMed] [Google Scholar]
4. Muttenthaler M, King GF, Adams DJ. et al. Trends in peptide drug discovery. Nat Rev Drug Discov 2021;20:309–25. 10.1038/s41573-020-00135-8. [DOI] [PubMed] [Google Scholar]
5. Haggag YA, Donia AA, Osman MA. et al. Peptides as drug candidates: limitations and recent development perspectives. Biom J 2018;8:1. 10.26717/BJSTR.2018.08.001694. [DOI] [Google Scholar]
6. Dziuba J, Iwaniak A, Minkiewicz P. Computer-aided characteristics of proteins as potential precursors of bioactive peptides SO POLIMERY. Polimery 2003;48:50–3. 10.14314/polimery.2003.050. [DOI] [Google Scholar]
7. Usmani SS, Bedi G, Samuel JS. et al. THPdb: database of FDA-approved peptide and protein therapeutics. PloS One 2017;12:e0181748. 10.1371/journal.pone.0181748. [DOI] [PMC free article] [PubMed] [Google Scholar]
8. Hoskin DW, Ramamoorthy A. Studies on anticancer activities of antimicrobial peptides, Biochimica et Biophysica Acta. BBA-Biomembranes 2008;1778:357–75. 10.1016/j.bbamem.2007.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]
9. Thakur N, Qureshi A, Kumar M. AVPpred: collection and prediction of highly effective antiviral peptides. Nucleic Acids Res 2012;40:W199–204. 10.1093/nar/gks450. [DOI] [PMC free article] [PubMed] [Google Scholar]
10. Manavalan B, Shin TH, Kim MO. et al. AIPpred: sequence-based prediction of anti-inflammatory peptides using random forest. Front Pharmacol 2018;9:276. 10.3389/fphar.2018.00276. [DOI] [PMC free article] [PubMed] [Google Scholar]
11. Shoombuatong W, Schaduangrat N, Pratiwi R. et al. THPep: a machine learning-based approach for predicting tumor homing peptides. Comput Biol Chem 2019;80:441–51. 10.1016/j.compbiolchem.2019.05.008. [DOI] [PubMed] [Google Scholar]
12. Meng C, Hu Y, Zhang Y. et al. PSBP-SVM: a machine learning-based computational identifier for predicting polystyrene binding peptides. Front Bioeng Biotechnol 2020;8:245. 10.3389/fbioe.2020.00245. [DOI] [PMC free article] [PubMed] [Google Scholar]
13. Jiang M, Zhao B, Luo S. et al. NeuroPpred-fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief Bioinform 2021;22:bbab310. 10.1093/bib/bbab310. [DOI] [PubMed] [Google Scholar]
14. Charoenkwan P, Schaduangrat N, Nantasenamat C. et al. iQSP: a sequence-based tool for the prediction and analysis of quorum sensing peptides via Chou’s 5-steps rule and informative physicochemical properties. Int J Mol Sci 2020;21:75. 10.3390/ijms21072629. [DOI] [PMC free article] [PubMed] [Google Scholar]
15. Zhang W, Xia E, Dai R. et al. PredAPP: predicting anti-parasitic peptides with undersampling and ensemble approaches. Interdiscip Sci 2022;14:258–68. 10.1007/s12539-021-00484-x. [DOI] [PubMed] [Google Scholar]
16. Dai R, Zhang W, Tang W. et al. BBPpred: sequence-based prediction of blood-brain barrier peptides with feature representation learning and logistic regression. J Chem Inf Model 2021;61:525–34. 10.1021/acs.jcim.0c01115. [DOI] [PubMed] [Google Scholar]
17. Ahmad A, Akbar S, Khan S. et al. Deep-AntiFP: prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks. Chemom Intel Lab Syst 2021;208:104214. 10.1016/j.chemolab.2020.104214. [DOI] [Google Scholar]
18. Sharma R, Shrivastava S, Kumar Singh S. et al. Deep-ABPpred: identifying antibacterial peptides in protein sequences using bidirectional LSTM with word2vec. Brief Bioinform 2021;22:bbab065. 10.1093/bib/bbab065. [DOI] [PubMed] [Google Scholar]
19. Charoenkwan P, Yana J, Nantasenamat C. et al. iUmami-SCM: a novel sequence-based predictor for prediction and analysis of umami peptides using a scoring card method with propensity scores of dipeptides. J Chem Inf Model 2020;60:6666–78. 10.1021/acs.jcim.0c00707. [DOI] [PubMed] [Google Scholar]
20. Khatun M, Hasan MM, Kurata H. PreAIP: computational prediction of anti-inflammatory peptides by integrating multiple complementary features. Front Genet 2019;10:129. 10.3389/fgene.2019.00129. [DOI] [PMC free article] [PubMed] [Google Scholar]
21. Yu L, Jing R, Liu F. et al. DeepACP: a novel computational approach for accurate identification of anticancer peptides by deep learning algorithm. Molecular Therapy-Nucleic Acids 2020;22:862–70. 10.1016/j.omtn.2020.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
22. Cai L, Wang L, Fu X. et al. ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation. Brief Bioinform 2021;22:bbaa367. 10.1093/bib/bbaa367. [DOI] [PubMed] [Google Scholar]
23. Yang X, Jin J, Wang R. et al. CACPP: a contrastive learning-based Siamese network to identify anticancer peptides based on sequence only. J Chem Inf Model 2023;64:2807–16. 10.1021/acs.jcim.3c00297. [DOI] [PubMed] [Google Scholar]
24. Xing W, Zhang J, Li C. et al. iAMP-Attenpred: a novel antimicrobial peptide predictor based on BERT feature extraction method and CNN-BiLSTM-attention combination model. Brief Bioinform 2024;25:bbad443. 10.1093/bib/bbad443. [DOI] [PMC free article] [PubMed] [Google Scholar]
25. Lee H, Lee S, Lee I. et al. AMP-BERT: prediction of antimicrobial peptide function based on a BERT model. Protein Sci 2023;32:e4529. 10.1002/pro.4529. [DOI] [PMC free article] [PubMed] [Google Scholar]
26. Fang Y, Xu F, Wei L. et al. AFP-MFL: accurate identification of antifungal peptides using multi-view feature learning. Brief Bioinform 2023;24:bbac606. 10.1093/bib/bbac606. [DOI] [PubMed] [Google Scholar]
27. Charoenkwan P, Nantasenamat C, Hasan MM. et al. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics 2021;37:2556–62. 10.1093/bioinformatics/btab133. [DOI] [PubMed] [Google Scholar]
28. Zhang X, Wei L, Ye X. et al. SiameseCPP: a sequence-based Siamese network to predict cell-penetrating peptides by contrastive learning. Brief Bioinform 2023;24:bbac545. 10.1093/bib/bbac545. [DOI] [PubMed] [Google Scholar]
29. Li Y, Li X, Liu Y. et al. MPMABP: a CNN and Bi-LSTM-based method for predicting multi-activities of bioactive peptides. Pharmaceuticals 2022;15:707. 10.3390/ph15060707. [DOI] [PMC free article] [PubMed] [Google Scholar]
30. Tang W, Dai R, Yan W. et al. Identifying multi-functional bioactive peptide functions using multi-label deep learning. Brief Bioinform 2022;23:bbab414. 10.1093/bib/bbab414. [DOI] [PubMed] [Google Scholar]
31. Yan W, Tang W, Wang L. et al. PrMFTP: multi-functional therapeutic peptides prediction based on multi-head self-attention mechanism and class weight optimization. PLoS Comput Biol 2022;18:e1010511. 10.1371/journal.pcbi.1010511. [DOI] [PMC free article] [PubMed] [Google Scholar]
32. Fan H, Yan W, Wang L. et al. Deep learning-based multi-functional therapeutic peptides prediction with a multi-label focal dice loss function. Bioinformatics 2023;39:btad334. 10.1093/bioinformatics/btad334. [DOI] [PMC free article] [PubMed] [Google Scholar]
33. Luo J, Zhao K, Chen J. et al. Discovery of novel multi-functional peptides by using protein language models and graph-based deep learning. bioRxiv 2023; 2023–04. [Google Scholar]
34. Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]
35. Zhang Y, Zhang Y, Xiong Y. et al. T4SEfinder: a bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model. Brief Bioinform 2022;23:bbab420. 10.1093/bib/bbab420. [DOI] [PubMed] [Google Scholar]
36. Rao R, Bhattacharya N, Thomas N. et al. Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst 2019;32:9689–701. [PMC free article] [PubMed] [Google Scholar]
37. El-Gebali S, Mistry J, Bateman A. et al. The Pfam protein families database in 2019. Nucleic Acids Res 2019;47:D427–32. 10.1093/nar/gky995. [DOI] [PMC free article] [PubMed] [Google Scholar]
38. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30. [Google Scholar]
39. Chen T, Kornblith S, Norouzi M. et al. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. PMLR, 2020, pp. 1597–607. [Google Scholar]
40. Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:171105101. 2017. [Google Scholar]
41. Li X, Sun X, Meng Y. et al. Dice loss for data-imbalanced NLP tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020, pp. 465–76.
42. Jadon S. A survey of loss functions for semantic segmentation. In: 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), IEEE, 2020, pp. 1–7. [Google Scholar]
43. Fürnkranz J, Hüllermeier E, Loza Mencía E. et al. Multilabel classification via calibrated label ranking. Mach Learn 2008;73:133–53. 10.1007/s10994-008-5064-8. [DOI] [Google Scholar]
44. Tsoumakas G, Vlahavas I. Random k-labelsets: an ensemble method for multilabel classification. In: European Conference on Machine Learning. Berlin, Heidelberg: Springer, 2007, pp. 406–17. [Google Scholar]
45. Wu G, Zheng R, Tian Y. et al. Joint ranking SVM and binary relevance with robust low-rank learning for multi-label classification. Neural Netw 2020;122:24–39. 10.1016/j.neunet.2019.10.002. [DOI] [PubMed] [Google Scholar]
46. Yang L, Wu X-Z, Jiang Y. et al. Multi-label learning with deep forest. ECAI 2020. IOS Press, 2020, 1634–41. [Google Scholar]
47. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–605. [Google Scholar]
48. Bailey TL. STREME: accurate and versatile sequence motif discovery. Bioinformatics 2021;37:2834–40. 10.1093/bioinformatics/btab203. [DOI] [PMC free article] [PubMed] [Google Scholar]
49. Vig J. BertViz: A tool for visualizing multihead self-attention in the BERT model. In: ICLR Workshop: Debugging Machine Learning Models. 2019, 3.
50. Rozek T, Wegener KL, Bowie JH. et al. The antibiotic and anticancer active aurein peptides from the Australian bell frogs Litoria aurea and Litoria raniformis: the solution structure of aurein 1.2. Eur J Biochem 2000;267:5330–41. 10.1046/j.1432-1327.2000.01536.x. [DOI] [PubMed] [Google Scholar]
51. Dennison SR, Harris F, Phoenix DA. The interactions of aurein 1.2 with cancer cell membranes. Biophys Chem 2007;127:78–83. 10.1016/j.bpc.2006.12.009. [DOI] [PubMed] [Google Scholar]
52. Giacometti A, Cirioni O, Riva A. et al. In vitro activity of aurein 1.2 alone and in combination with antibiotics against gram-positive nosocomial cocci. Antimicrob Agents Chemother 2007;51:1494–6. 10.1128/AAC.00666-06. [DOI] [PMC free article] [PubMed] [Google Scholar]
53. Patrzykat A, Gallant JW, Seo J-K. et al. Novel antimicrobial peptides derived from flatfish genes. Antimicrob Agents Chemother 2003;47:2464–70. 10.1128/AAC.47.8.2464-2470.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]
54. Hilchie AL, Doucette CD, Pinto DM. et al. Pleurocidin-family cationic antimicrobial peptides are cytolytic for breast carcinoma cells and prevent growth of tumor xenografts. Breast Cancer Res 2011;13:1–16. 10.1186/bcr3043. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary_files0703new_bbae348

supplementary_files0703new_bbae348.pdf^{(5.9MB, pdf)}

Data Availability Statement

The code is available on https://github.com/WeiLab-Biology/CELA-MFP.

[ref1] 1. Basith S, Manavalan B, Hwan Shin T. et al. Machine intelligence in peptide therapeutics: a next-generation tool for rapid disease screening. Med Res Rev 2020;40:1276–314. 10.1002/med.21658. [DOI] [PubMed] [Google Scholar]

[ref2] 2. Sánchez A, Vázquez A. Bioactive peptides: a review. Food Quality and Safety 2017;1:29–46. 10.1093/fqs/fyx006. [DOI] [Google Scholar]

[ref3] 3. Fosgerau K, Hoffmann T. Peptide therapeutics: current status and future directions. Drug Discov Today 2015;20:122–8. 10.1016/j.drudis.2014.10.003. [DOI] [PubMed] [Google Scholar]

[ref4] 4. Muttenthaler M, King GF, Adams DJ. et al. Trends in peptide drug discovery. Nat Rev Drug Discov 2021;20:309–25. 10.1038/s41573-020-00135-8. [DOI] [PubMed] [Google Scholar]

[ref5] 5. Haggag YA, Donia AA, Osman MA. et al. Peptides as drug candidates: limitations and recent development perspectives. Biom J 2018;8:1. 10.26717/BJSTR.2018.08.001694. [DOI] [Google Scholar]

[ref6] 6. Dziuba J, Iwaniak A, Minkiewicz P. Computer-aided characteristics of proteins as potential precursors of bioactive peptides SO POLIMERY. Polimery 2003;48:50–3. 10.14314/polimery.2003.050. [DOI] [Google Scholar]

[ref7] 7. Usmani SS, Bedi G, Samuel JS. et al. THPdb: database of FDA-approved peptide and protein therapeutics. PloS One 2017;12:e0181748. 10.1371/journal.pone.0181748. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref8] 8. Hoskin DW, Ramamoorthy A. Studies on anticancer activities of antimicrobial peptides, Biochimica et Biophysica Acta. BBA-Biomembranes 2008;1778:357–75. 10.1016/j.bbamem.2007.11.008. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref9] 9. Thakur N, Qureshi A, Kumar M. AVPpred: collection and prediction of highly effective antiviral peptides. Nucleic Acids Res 2012;40:W199–204. 10.1093/nar/gks450. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref10] 10. Manavalan B, Shin TH, Kim MO. et al. AIPpred: sequence-based prediction of anti-inflammatory peptides using random forest. Front Pharmacol 2018;9:276. 10.3389/fphar.2018.00276. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref11] 11. Shoombuatong W, Schaduangrat N, Pratiwi R. et al. THPep: a machine learning-based approach for predicting tumor homing peptides. Comput Biol Chem 2019;80:441–51. 10.1016/j.compbiolchem.2019.05.008. [DOI] [PubMed] [Google Scholar]

[ref12] 12. Meng C, Hu Y, Zhang Y. et al. PSBP-SVM: a machine learning-based computational identifier for predicting polystyrene binding peptides. Front Bioeng Biotechnol 2020;8:245. 10.3389/fbioe.2020.00245. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref13] 13. Jiang M, Zhao B, Luo S. et al. NeuroPpred-fuse: an interpretable stacking model for prediction of neuropeptides by fusing sequence information and feature selection methods. Brief Bioinform 2021;22:bbab310. 10.1093/bib/bbab310. [DOI] [PubMed] [Google Scholar]

[ref14] 14. Charoenkwan P, Schaduangrat N, Nantasenamat C. et al. iQSP: a sequence-based tool for the prediction and analysis of quorum sensing peptides via Chou’s 5-steps rule and informative physicochemical properties. Int J Mol Sci 2020;21:75. 10.3390/ijms21072629. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref15] 15. Zhang W, Xia E, Dai R. et al. PredAPP: predicting anti-parasitic peptides with undersampling and ensemble approaches. Interdiscip Sci 2022;14:258–68. 10.1007/s12539-021-00484-x. [DOI] [PubMed] [Google Scholar]

[ref16] 16. Dai R, Zhang W, Tang W. et al. BBPpred: sequence-based prediction of blood-brain barrier peptides with feature representation learning and logistic regression. J Chem Inf Model 2021;61:525–34. 10.1021/acs.jcim.0c01115. [DOI] [PubMed] [Google Scholar]

[ref17] 17. Ahmad A, Akbar S, Khan S. et al. Deep-AntiFP: prediction of antifungal peptides using distanct multi-informative features incorporating with deep neural networks. Chemom Intel Lab Syst 2021;208:104214. 10.1016/j.chemolab.2020.104214. [DOI] [Google Scholar]

[ref18] 18. Sharma R, Shrivastava S, Kumar Singh S. et al. Deep-ABPpred: identifying antibacterial peptides in protein sequences using bidirectional LSTM with word2vec. Brief Bioinform 2021;22:bbab065. 10.1093/bib/bbab065. [DOI] [PubMed] [Google Scholar]

[ref19] 19. Charoenkwan P, Yana J, Nantasenamat C. et al. iUmami-SCM: a novel sequence-based predictor for prediction and analysis of umami peptides using a scoring card method with propensity scores of dipeptides. J Chem Inf Model 2020;60:6666–78. 10.1021/acs.jcim.0c00707. [DOI] [PubMed] [Google Scholar]

[ref20] 20. Khatun M, Hasan MM, Kurata H. PreAIP: computational prediction of anti-inflammatory peptides by integrating multiple complementary features. Front Genet 2019;10:129. 10.3389/fgene.2019.00129. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref21] 21. Yu L, Jing R, Liu F. et al. DeepACP: a novel computational approach for accurate identification of anticancer peptides by deep learning algorithm. Molecular Therapy-Nucleic Acids 2020;22:862–70. 10.1016/j.omtn.2020.10.005. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref22] 22. Cai L, Wang L, Fu X. et al. ITP-Pred: an interpretable method for predicting, therapeutic peptides with fused features low-dimension representation. Brief Bioinform 2021;22:bbaa367. 10.1093/bib/bbaa367. [DOI] [PubMed] [Google Scholar]

[ref23] 23. Yang X, Jin J, Wang R. et al. CACPP: a contrastive learning-based Siamese network to identify anticancer peptides based on sequence only. J Chem Inf Model 2023;64:2807–16. 10.1021/acs.jcim.3c00297. [DOI] [PubMed] [Google Scholar]

[ref24] 24. Xing W, Zhang J, Li C. et al. iAMP-Attenpred: a novel antimicrobial peptide predictor based on BERT feature extraction method and CNN-BiLSTM-attention combination model. Brief Bioinform 2024;25:bbad443. 10.1093/bib/bbad443. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref25] 25. Lee H, Lee S, Lee I. et al. AMP-BERT: prediction of antimicrobial peptide function based on a BERT model. Protein Sci 2023;32:e4529. 10.1002/pro.4529. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref26] 26. Fang Y, Xu F, Wei L. et al. AFP-MFL: accurate identification of antifungal peptides using multi-view feature learning. Brief Bioinform 2023;24:bbac606. 10.1093/bib/bbac606. [DOI] [PubMed] [Google Scholar]

[ref27] 27. Charoenkwan P, Nantasenamat C, Hasan MM. et al. BERT4Bitter: a bidirectional encoder representations from transformers (BERT)-based model for improving the prediction of bitter peptides. Bioinformatics 2021;37:2556–62. 10.1093/bioinformatics/btab133. [DOI] [PubMed] [Google Scholar]

[ref28] 28. Zhang X, Wei L, Ye X. et al. SiameseCPP: a sequence-based Siamese network to predict cell-penetrating peptides by contrastive learning. Brief Bioinform 2023;24:bbac545. 10.1093/bib/bbac545. [DOI] [PubMed] [Google Scholar]

[ref29] 29. Li Y, Li X, Liu Y. et al. MPMABP: a CNN and Bi-LSTM-based method for predicting multi-activities of bioactive peptides. Pharmaceuticals 2022;15:707. 10.3390/ph15060707. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref30] 30. Tang W, Dai R, Yan W. et al. Identifying multi-functional bioactive peptide functions using multi-label deep learning. Brief Bioinform 2022;23:bbab414. 10.1093/bib/bbab414. [DOI] [PubMed] [Google Scholar]

[ref31] 31. Yan W, Tang W, Wang L. et al. PrMFTP: multi-functional therapeutic peptides prediction based on multi-head self-attention mechanism and class weight optimization. PLoS Comput Biol 2022;18:e1010511. 10.1371/journal.pcbi.1010511. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref32] 32. Fan H, Yan W, Wang L. et al. Deep learning-based multi-functional therapeutic peptides prediction with a multi-label focal dice loss function. Bioinformatics 2023;39:btad334. 10.1093/bioinformatics/btad334. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref33] 33. Luo J, Zhao K, Chen J. et al. Discovery of novel multi-functional peptides by using protein language models and graph-based deep learning. bioRxiv 2023; 2023–04. [Google Scholar]

[ref34] 34. Fu L, Niu B, Zhu Z. et al. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics 2012;28:3150–2. 10.1093/bioinformatics/bts565. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref35] 35. Zhang Y, Zhang Y, Xiong Y. et al. T4SEfinder: a bioinformatics tool for genome-scale prediction of bacterial type IV secreted effectors using pre-trained protein language model. Brief Bioinform 2022;23:bbab420. 10.1093/bib/bbab420. [DOI] [PubMed] [Google Scholar]

[ref36] 36. Rao R, Bhattacharya N, Thomas N. et al. Evaluating protein transfer learning with TAPE. Adv Neural Inf Process Syst 2019;32:9689–701. [PMC free article] [PubMed] [Google Scholar]

[ref37] 37. El-Gebali S, Mistry J, Bateman A. et al. The Pfam protein families database in 2019. Nucleic Acids Res 2019;47:D427–32. 10.1093/nar/gky995. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref38] 38. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30. [Google Scholar]

[ref39] 39. Chen T, Kornblith S, Norouzi M. et al. A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning. PMLR, 2020, pp. 1597–607. [Google Scholar]

[ref40] 40. Loshchilov I, Hutter F. Decoupled weight decay regularization. arXiv preprint arXiv:171105101. 2017. [Google Scholar]

[ref41] 41. Li X, Sun X, Meng Y. et al. Dice loss for data-imbalanced NLP tasks. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020, pp. 465–76.

[ref42] 42. Jadon S. A survey of loss functions for semantic segmentation. In: 2020 IEEE conference on computational intelligence in bioinformatics and computational biology (CIBCB), IEEE, 2020, pp. 1–7. [Google Scholar]

[ref43] 43. Fürnkranz J, Hüllermeier E, Loza Mencía E. et al. Multilabel classification via calibrated label ranking. Mach Learn 2008;73:133–53. 10.1007/s10994-008-5064-8. [DOI] [Google Scholar]

[ref44] 44. Tsoumakas G, Vlahavas I. Random k-labelsets: an ensemble method for multilabel classification. In: European Conference on Machine Learning. Berlin, Heidelberg: Springer, 2007, pp. 406–17. [Google Scholar]

[ref45] 45. Wu G, Zheng R, Tian Y. et al. Joint ranking SVM and binary relevance with robust low-rank learning for multi-label classification. Neural Netw 2020;122:24–39. 10.1016/j.neunet.2019.10.002. [DOI] [PubMed] [Google Scholar]

[ref46] 46. Yang L, Wu X-Z, Jiang Y. et al. Multi-label learning with deep forest. ECAI 2020. IOS Press, 2020, 1634–41. [Google Scholar]

[ref47] 47. Van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–605. [Google Scholar]

[ref48] 48. Bailey TL. STREME: accurate and versatile sequence motif discovery. Bioinformatics 2021;37:2834–40. 10.1093/bioinformatics/btab203. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref49] 49. Vig J. BertViz: A tool for visualizing multihead self-attention in the BERT model. In: ICLR Workshop: Debugging Machine Learning Models. 2019, 3.

[ref50] 50. Rozek T, Wegener KL, Bowie JH. et al. The antibiotic and anticancer active aurein peptides from the Australian bell frogs Litoria aurea and Litoria raniformis: the solution structure of aurein 1.2. Eur J Biochem 2000;267:5330–41. 10.1046/j.1432-1327.2000.01536.x. [DOI] [PubMed] [Google Scholar]

[ref51] 51. Dennison SR, Harris F, Phoenix DA. The interactions of aurein 1.2 with cancer cell membranes. Biophys Chem 2007;127:78–83. 10.1016/j.bpc.2006.12.009. [DOI] [PubMed] [Google Scholar]

[ref52] 52. Giacometti A, Cirioni O, Riva A. et al. In vitro activity of aurein 1.2 alone and in combination with antibiotics against gram-positive nosocomial cocci. Antimicrob Agents Chemother 2007;51:1494–6. 10.1128/AAC.00666-06. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref53] 53. Patrzykat A, Gallant JW, Seo J-K. et al. Novel antimicrobial peptides derived from flatfish genes. Antimicrob Agents Chemother 2003;47:2464–70. 10.1128/AAC.47.8.2464-2470.2003. [DOI] [PMC free article] [PubMed] [Google Scholar]

[ref54] 54. Hilchie AL, Doucette CD, Pinto DM. et al. Pleurocidin-family cationic antimicrobial peptides are cytolytic for breast carcinoma cells and prevent growth of tumor xenografts. Breast Cancer Res 2011;13:1–16. 10.1186/bcr3043. [DOI] [PMC free article] [PubMed] [Google Scholar]

PERMALINK

CELA-MFP: a contrast-enhanced and label-adaptive framework for multi-functional therapeutic peptides prediction

Yitian Fang

Mingshuang Luo

Zhixiang Ren

Leyi Wei

Dong-Qing Wei

Abstract

Introduction

Materials and methods

Datasets

Overall framework of CELA-MFP

Figure 1.

Peptide representation module

Transformer decoder module

Unsupervised contrastive learning

Multi-label classification loss function

Evaluation metrics

Implementation details

Results

Model ablation study

Figure 2.

Comparison with existing methods

Figure 3.

Interpretable analysis for CELA-MFP

Visualization of peptide sequence representations

Figure 4.

Visualization of peptide sequence motifs

Figure 5.

Visualization of attention relationships among multiple functionalities

Figure 6.

The CELA-MFP web server

Figure 7.

Discussion

Key Points

Supplementary Material

Contributor Information

Funding

Data availability

References

Associated Data

Supplementary Materials

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases