Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2023 Mar 10;39(3):btad108. doi: 10.1093/bioinformatics/btad108

ExamPle: explainable deep learning framework for the prediction of plant small secreted peptides

Zhongshen Li 1,2,#, Junru Jin 3,4,#, Yu Wang 5,6, Wentao Long 7,8, Yuanhao Ding 9, Haiyan Hu 10, Leyi Wei 11,12,
Editor: Pier Luigi Martelli
PMCID: PMC10027287  PMID: 36897030

Abstract

Motivation

Plant Small Secreted Peptides (SSPs) play an important role in plant growth, development, and plant–microbe interactions. Therefore, the identification of SSPs is essential for revealing the functional mechanisms. Over the last few decades, machine learning-based methods have been developed, accelerating the discovery of SSPs to some extent. However, existing methods highly depend on handcrafted feature engineering, which easily ignores the latent feature representations and impacts the predictive performance.

Results

Here, we propose ExamPle, a novel deep learning model using Siamese network and multi-view representation for the explainable prediction of the plant SSPs. Benchmarking comparison results show that our ExamPle performs significantly better than existing methods in the prediction of plant SSPs. Also, our model shows excellent feature extraction ability. Importantly, by utilizing in silicomutagenesis experiment, ExamPle can discover sequential characteristics and identify the contribution of each amino acid for the predictions. The key novel principle learned by our model is that the head region of the peptide and some specific sequential patterns are strongly associated with the SSPs’ functions. Thus, ExamPle is expected to be a useful tool for predicting plant SSPs and designing effective plant SSPs.

Availability and implementation

Our codes and datasets are available at https://github.com/Johnsunnn/ExamPle.

1 Introduction

Plant Small Secreted Peptides (SSPs) are crucial intercellular messenger molecules that regulate a multitude of processes (Matsubayashi 2014). SSPs are usually encoded within preproteins consisting of 100–250 amino acids. These preproteins are then digested into shorter bioactive peptides (Lease and Walker 2006; Breiden and Simon 2016; de Bang et al. 2017) that exhibit physiological effects at extremely low doses, often in the nanomolar range (Murphy et al. 2012).

SSPs have become a significant class of regulatory chemicals involved in plant growth, development, and plant–microbe interactions (Czyzewicz et al. 2013; Nakaminami et al. 2018). During plant growth and development processes, most SSPs act as signaling molecules that participate in cell-to-cell communication by binding membrane receptors and coordinating responses with plant hormones (Murphy et al. 2012; Fukuda and Ohashi-Ito 2019). In terms of meristem maintenance, CLE14 and CLE40 expression have been reported to play a part in regulating meristematic activity and cell number in the root meristematic zone of the Arabidopsis thaliana plant (De Smet et al. 2008; Meng and Feldman 2010). Additionally, SSPs are involved in a variety of biotic stress responses in diverse plant species. For instance, a SSP named SYSTEMIN discovered in Solanum lycopersicum was the first wound response signaling peptide (Pearce et al. 1991; Constabel et al. 1998). When herbivores or pathogens attack plants, SYSTEMIN can activate several defense signals and pathways by interacting with SYSTEMIN RECEPTOR 1 (SYR1). These pathways and signals include the stimulation of PROTEASE INHIBITOR production as well as the enhancement of ethylene and jasmonic acid biosynthesis (Kandoth et al. 2007; Wang et al. 2018). What’s more, SSPs are crucial for other physiological processes in plants, such as embryogenesis, reproductive development, pathogen interaction, and increasing fertilizer-use effectiveness (Matsubayashi 2014; Breiden and Simon 2016). Given their diverse effects on plants, SSPs are of interest as potential strategies to improve plant performance.

In traditional biological experiments, researchers use protein mass spectrometry (MS) to discover SSPs. For example, a novel 15-aa SSP named CEP1 encoded by AT1G47485 is identified in Arabidopsis thaliana (A. thaliana) by liquid chromatography–mass spectrometry analysis (Ohyama et al. 2008). However, it is a difficult task to identify plant SSPs on a large scale using biological experiments because traditional biological experiments require much time and manpower. In recent years, more and more computational methods have been developed to solve various biological problems (Mrozek 2020; Li et al. 2021a,b, 2022). To address the mentioned issue above, researchers have developed some computational methods for predicting plant SSPs. To predict - SSPs containing N-terminal signal sequence (NSS), Teufel et al. (Teufel et al., 2022) proposed SignalP 6.0, a model based on protein language models for the prediction of signal peptides. Since N-terminal signal sequence usually exists in SSPs, SignalP6.0 can be used for predicting SSPs as well. The potential problem is that it might produce false positives as the NSS also often occurs in serveral types of membrane proteins (Uhlén et al. 2015).

The method mentioned above have a number of shortcomings, and there is a lack of publicly available tools for specifically predicting the plant SSPs. Importantly, these methods do not consider the analysis of the sequential patterns of the plant SSPs, which is significant for peptide downstream tasks. To address these problems, we propose ExamPle, a novel Explainable contrastive hybrid-view dual Siamese framework for Plant Small Secreted Peptide prediction and sequential pattern discovery. Notably, with a hybrid view of peptide sequences and secondary structures, we use the Siamese network architecture and contrastive learning to automatically learn context-sensitive feature representations, sufficiently capturing discriminative information to represent the plant SSPs. Experiment results show that our model has excellent prediction performance for plant SSPs prediction and is significantly better than existing machine learning methods. Comparison between handcrafted feature encodings and our model using dimension reduction methods proves the great learning ability of our model. Furthermore, by quantatively evaluating the amino acid contribution, we demonstrate that ExamPle is capable of discovering important sequential characteristics that impact the predictive performance and identify the contribution function of each amino acid, which is bascially consistent with the statistical results. In addition, by accumulating functional differences between a variant and original sequence, we reveal that the head sequential region of the peptide and some specific sequential patterns are strongly associated with the SSPs function. Overall, ExamPle can not only provide the excellent prediction ability for plant SSPs but also give sufficient insights on effective and stable plant SSPs design.

2 Materials and methods

2.1 Dataset

Plant SSPs dataset: We collect positive samples of plant SSPs dataset from MtSSPdb (Boschiero et al. 2020). MtSSPdb is a comprehensive database that contains plant SSPs discovered in the model legume Medicago truncatula. As for negative samples, we find two non-secreted peptide families CYSTM and DVL from previous studies (Butenko et al. 2009; Xu et al. 2018); afterwards, we search for non-SSPs sequences and download them from the Pfam database, which is a large collection of protein families (Mistry et al. 2021). After that, we use CD-Hit (Li and Godzik, 2006) to reduce the similarity of the sequences in the dataset to 20%. Besides, datasets with balanced positive and negative samples generally benefit the training process of deep learning models. Consequently, we take the same number of positive and negative samples from the two databases. Following the above procedure, we yield the plant SSPs dataset, including 1184 plant secreted peptides and 1184 plant non-secreted peptides. Furthermore, the dataset is divided into training and testing datasets according to the proportions of 80% and 20%, respectively.

Peptide secondary structure: In this study, we use the PHAT web interface to generate peptide secondary structure. PHAT was proposed by Jiang et al. (2023). PHAT is a novel deep learning framework for predicting peptide secondary structures. A powerful pre-trained protein language model and a novel hypergraph multi-head attention network are utilized in PHAT, which can not only transfer semantic knowledge from large-scale proteins to peptides, learning high-latent and long-term features of peptide residues but also encode peptide residues with multi-semantic secondary structural information while capturing contextual features from consecutive regions using multi-level attention mechanisms. Besides, PHAT has an interactive, code-free, and non-programmatic web interface. We use the PHAT web interface to obtain the predicted 3-state secondary structure of peptides, and then the 3-state secondary structure information is added to the plant SSPs dataset.

2.2 Framework overview

Here, we propose a new predictive framework based on the dual Transformer and Bi-GRU architecture and Siamese network, which fuses peptide sequence and secondary structure information to predict the plant SSPs. Our framework architecture is illustrated in Fig. 1, including B(i) contrastive learning and Transformer-based dual Siamese network and B(ii) feature fusion and classification. Firstly, in B(i), we input the plant SSPs sequences and their secondary structures into the dual Transformer-based Siamese network model for feature embedding. To be specific, each Transformer-based Siamese network includes Transformer encoder, Bi-GRU (Bidirectional-Gate Recurrent Unit), fully connected neural network (FNN) and contrastive module. The two Transformer models do not share weights. Transformer encoder is used to generate context-sensitive feature representations of SSPs sequences and secondary structures. Afterward, Bi-GRU is used to capture the feature representations with long-distance information. Subsequently, features are embedded using a FNN to generate the unified representation. After that, we can obtain the distance between two SSPs feature representations. In the contrastive module, the contrastive loss is calculated, which is utilized to decrease the distance between representation vectors with the same labels and increase the distance between representation vectors with the different labels. Secondly, in B(ii), we fuse the sequence and secondary structure features learnt from B(i). Subsequently, we use a FNN as the discriminator. With a prediction score given by the discriminator, the model can determine whether a given peptide belongs to the plant SSPs. It belongs to the plant SSPs if the prediction score is higher than 0.5.

Figure 1.

Figure 1

The overview of the ExamPle framework. (A) It presents how we pre-process and construct the plant SSPs dataset and how we obtain the peptide secondary structure information. (B(i)) We feed the peptide sequences and peptide secondary structure into the dual Transformer-based Siamese network to get their corresponding representations. (B(ii)) After acquiring the corresponding representations, the ExamPle can automatically fuse the two features, and then feed the fusion features to the discriminator for classification. (C) As for the peptide sequences and secondary structures, we separately use a Siamese network composed of Transformer, Bi-GRU, and FNN to extract the latent features. Meanwhile, we use the contrastive module to optimize the parameters of the network

2.2.1 Transformer

Transformer was proposed by Vaswani et al. (2017), which was initially applied for machine translation tasks and had become the state-of-the-art model in a large number of Natural Language Processing (NLP) tasks. Compared to the previous methods based on protein energy (Mrozek et al. 2007; 2009; Malysiak-Mrozek and Mrozek 2011), AlphaFold achieves great success by using Transformer (Jumper et al. 2021). So, in this work, we adopt Transformer as our model’s backbone. In the Transformer part of our model, we use the original version of Transformer. Transformer is composed of the encoding module and the decoding module. The encoding module comprised of a series of encoders used to extract the features of inputs. Additionally, the decoding module consists of a series of decoders that are used for generating representations. Here, we only extract features using a Transformer encoder. The Transformer encoder layer is illustrated in Fig. 1.

Aiming to encode the positional information, the Transformer has a positional vector added to it, which is implemented to each embedding of inputs using sine and cosine functions. In other words, the output Xi,embed of the embedding layer is

Xi,embed=EmbeddingXi+PosEmbeddingXi  (1)

Afterward, Xi,embed is fed to a Multi-headed Self-Attention (MSA) block and a Multilayer Perceptron (MLP) block. The layer-normalization (LN) is applied before each block and residual connections are applied after each block. The output of the first encoder layer Xi,output1 is

Xi,output1=MSALNXi,embed1+Xi,embed1  (2)
Xi,output1=MLPLNXi,output1+Xi,output1  (3)

Then, Xi,output1 is fed into the second encoder layer, generating the output denoted as Xi,output2, which is the same in the following encoder layers. Suppose there are n encoder layers in the Transformer encoding module, the output of the last layer is Xi,outputn. After above operation, we can gain a context-dependent embedded representation. The Transformer utilizes the multi-head attention mechanism, which captures not only the local information in a window of a fixed area of the sequence but also the global information of the sequence, explaining why the representation is reliant on context. With the ability of learning context-sensitive embedding representation, this approach can not only resolve the word ambiguity issue in the realm of NLP successfully but also improve token representation in peptide sequences significantly. In addition, the later GRU model also benefits from this context-sensitive embedding representation.

2.2.2 Gated recurrent unit

Gated recurrent unit (GRU) was proposed by Cho et al. (2014). It aims to make each recurrent unit to capture dependencies of different time scales adaptively. The architecture of GRU has been simplified compared to Long Short-Term Memory (LSTM) (Hochreiter and Schmidhuber 1997) while keeping the same effectiveness as LSTM. The GRU has gating units that modulate the flow of information inside the unit similarly to the LSTM unit, but without having a separate memory cell. The input gate, forget gate, and output gate—three essential parts of an LSTM cell—are transformed into two gates in GRU: update gate (zt) and reset gate (rt). GRU combines the unit state and output into one state ht. For each element (time step t) in the input sequence, each layer computes using the functions as follows: where Wz, Wr, and W are weights of GRU; ht is the hidden state at time step t; ht-1 is the hidden state of the layer at time t-1 or the initial hidden state at time 0; σ is the sigmoid function; and is the Hadamard product. In a multilayer Bi-GRU, the input xt(l) of the l-th layer (l ≥ 2) is the hidden state ht(l-1) of the previous layer. After passing through K layers, we can obtain the final flattened hidden state output hK (hK=[h1K,h2K,h3K,,hLK]) of the layer K.

zt=σWz·ht-1,xt  (4)
rt=σWr·ht-1,xt  (5)
ht=tanhW·rtht-1,xt  (6)
h=1-ztht-1+ztht  (7)

2.2.3 Siamese network and contrastive module

The Siamese network can be used to quantify the similarity of two sequences. For example, aiming to finding matching and non-matching pairs of images, Melekhov et al. (2016) obtained the feature vectors with convolutional neural networks which were learnt from labeled examples of matching and non-matching image pairs by using a contrastive loss function in a Siamese network architecture. Li and Zhang (2019) proposed a Siamese network called SiamVGG that combined a convolutional neural network backbone and a cross-correlation operator, taking advantage of the features from exemplary images for more accurate object tracking. To be specific, given two input vectors v1 and v2, they are input into the same modules. The network maps the inputs to the same feature space. Afterwards, the network computes the similarity of the two inputs using distance measuring function, denoted as D(v1,v2). Here, we use Euclidean distance D(v1,v2):

Dv1,v2=v1-v22  (8)

After gaining the hidden state output hK, we use a FNN to map it to the same feature space. By doing that, we can map an input vector of a sequence Xi to a representation vector vi:

vi=MLPBi-GRUTransformerEncoderXi  (9)

Aiming to decrease the distance between representation vectors of peptide sequences with the same label and increase the distance between representation vectors of peptide sequences with the different label, we use contrastive loss (Hadsell et al. 2006) L1 as the loss function: where Y=0 if sequences X1 and X2 have the same label and Y=1 if they are different. And m>0 is a margin. If the distance between v1 and v2 is further than m, we are not going to optimize; Otherwise, if the distance between v1 and v2 is smaller than m, the distance between them will be further. That is, the margin defines a radius around v, and only dissimilar pairs whose distance falls inside this range contribute to the loss function.

L1v1,v2,Y=121-YDv1,v22+12Ymax0, m-Dv1,v22  (10)

In this way, we can use gradient descent algorithm to optimize the parameters of the network and generate unified representation for each inputs pair. Also, we need a discriminator D(v) to map the representation vector to the class outputs. Consequently, cross-entropy loss is utilized as one of the loss functions in our model:

L2v,Y=-YlogDv-1-Ylog1-Dv  (11)

To be mentioned, if the FNN and the discriminator are connected directly, the parameters will be updated jointly, which will result in the catastrophic forgetting (McCloskey and Cohen 1989). In order to avoid this circumstance, we freeze the parameters of the previous layers, only updating the parameters of the discriminator when training the model. Consequently, the comprehensive loss function of our model is where Y1 and Y2 are the labels of v1 and v2 and Y1=1 if v1 is positive, else Y1=0.

Lv1,v2,Y,Y1,Y2=L1v1,v2,Y+L2v1,Y1+L2v2,Y2  (12)

2.3 Performance measures

Refer to previous works, we use following classification metrics to evaluate our model: accuracy (ACC), specificity (SP), sensitivity (SE), and Matthew correlation coefficient (MCC). The formulas of these metrics are as follows:

ACC=TP+TNTP+TN+FP+FN  (13)
SP=TNTN+FP  (14)
SE=TPTP+FN  (15)
MCC=TP×TN+FP×FNTP+FPTP+FNTN+FNTN+FP  (16)

In the formulas, TP is the number of true-positive samples, FP is the number of false-positive samples, TN is the number of true-negative samples, and FN is the number of false-negative samples. Accuracy is the ratio of the correctly classified samples to the total samples, which is the most intuitive performance measurement. Specificity measures the proportion of the negative samples which are classified correctly, and Sensitivity measures the proportion of the positive samples which are classified correctly. MCC takes into account positives and negatives, so that it is generally regarded as a balanced metric to measure the overall performance of a model.

3 Results

3.1 Performance comparison of ExamPle with other methods on benchmark datasets

To evaluate the performance of ExamPle, we adopted several classic comparative models including Naive Bayes (Rish 2001), Logistic Regression (Hosmer et al. 2013), SVM (Support Vector Machine) (Cortes and Vapnik 1995), Random Forest (Breiman 2001), DNN (Deep Neural Networks) (Montavon et al. 2018), and standard Transformer considering the lack of directly predicting SSPs in previous work. For a fair comparison, we used training set for model training and evaluated these models in independent dataset within the same setting. For better reproducibility of the comparative experiment, the architectures of DNN and Transformer are implemented by DeepBIO (Wang et al. 2022) server. As illustrated in Fig. 2, ExamPle outperforms all of the comparative models in terms of ACC, SE, SP, and MCC metrics. For instance, the performance in ACC of ExamPle increases by 3.65% compared to Transformer and increases by 11.99% compared to Random Forest; the performance in MCC increases by 10.91% compared to DNN and increases by 56.35% compared to SVM. In conclusion, these results demonstrate that the ability to effectively predict plant SSPs of ExamPle is better than that of the majority of machine learning and deep learning approaches.

Figure 2.

Figure 2

Performance comparison between ExamPle, several classic machine learning models, and deep learning models. (A–D) respectively represent the ACC, SE, SP, and MCC values of our proposed ExamPle and other models, including Naïve Bayes, Logistic Regression, SVM, Random Forest, DNN, and Transformer

3.2 The feature representation of ExamPle is better than handcrafted feature encodings

In order to verify the effectiveness of the feature representation learned by ExamPle, we compared the performance of different handcrafted feature methods with our model. With the help of iLearnPlus (Chen et al. 2018, 2020, 2021), we chose five feature encoding methods, including Amino Acid Composition (AAC) (Spackman et al. 1958), CTD Composition (CTDC), CTD Distribution (CTDD), CTD Transition (CTDT) (Dubchak et al. 1999), and Conjoint Triad (CTriad) (Shen et al. 2007). The detailed information on these handcrafted feature encodings can be found in Supplementary Table S1. Then, we separately feed five handcrafted feature encoding methods into the classification layer of our model for training and testing.

As shown in Fig. 3, the CTriad encoding method reaches 94.67% on ACC, while the CTDC encoding method reaches 78.86%. Nevertheless, the representation embedded by ExamPle is superior to all other handcrafted feature encoding methods. Especially when it comes to the performance of MCC, our model is 37.57% higher than CTDC and 24.28% higher than CTDT. To further validate the performance of ExamPle, we visualized the feature representations embedded by five handcrafted feature encoding methods as well as our model using dimension reducting. As illustrated in Fig. 3(D-I) , we can more intuitively see that the representation encoded by our model can be classified better than any other five handcrafted feature representations. In summary, with excellent adaptability and robustness, our model can learn a more powerful feature representation compared to other handcrafted feature encoding methods.

Figure 3.

Figure 3

Performance comparison between the feature representation of ExamPle and other five handcrafted feature encoding methods. (A–C) respectively represent the ACC, SP, and MCC values of the feature representation of ExamPle and other handcrafted feature encoding methods, including CTDC, AAC, CTDT, CTDD, and CTriad. (D–I) Feature visualization results of ExamPle and other five handcrafted feature encoding methods are shown

3.3 The visualization of feature representations during the training process

It is rarely clear how the deep learning model works during the training process. For a more transparent understanding, we reduced our model’s latent space to a 2D space using principal component analysis (PCA) (Abdi and Williams 2010) and t-distributed Stochastic Neighbor Embedding (t-SNE) (Van der Maaten and Hinton 2008) to verify the efficiency of ExamPle. As shown in Fig. 4, at the beginning of the training procedure, negative and positive samples are mixed together. Subsequently, as the number of training iterations increases, the samples are gradually divided into two groups, indicating our model can learn sufficient information through back propagation optimization.

Figure 4.

Figure 4

Visualization of samples in the test dataset by PCA and t-SNE. Epoch[x] in the title of each subfigure denotes the corresponding training epoch of the model. 0 and 1 respectively represent the non-SSPs and the SSPs. (A–C) PCA visualization images at different epoch, and (D–F) show t-SNE visualization images at different epoch are shown

3.4 Exploring the sequential characteristic of plant SSPs by using ExamPle

Given the excellent performance of our complex neural network architecture, we are targeting further exploration of the sequential characteristics of plant SSPs from a computation perspective. To this end, we performed in silico mutagenesis (ISM) interpretation method for every peptide in our dataset and mutated each reference amino acid of a peptide into other nineteen alternative amino acids to evaluate the predicted probability change. Moreover, we conducted amino acid statistical probability for each position on SSPs dataset and the non-SSPs dataset respectively in Fig. 5A and B as a reference to our analysis.

Figure 5.

Figure 5

The results of exploring the sequential characteristic of plant SSPs by using our ExamPle. (A) and (B) respectively represents the probability distribution statistics of the plant SSPs and the plant non-SSPs. (C) The heat map showing the result of the ISM experiment. The positive value represents a higher prediction score for the sequence when replaced with a certain amino acid, and the negative value represents a lower prediction score. (D) The comparison results between the contribution scores and amino acid composition (AAC) of each amino acid on SSPs and non-SSPs. The contribution scores are the sum of each amino acid score on each peptide obtained from the ISM experiment

Here, we select one 59-bp-length sequence as the sample to generate the heat map in Fig. 5C, in which the positive value represents a higher prediction score for the sequence when replaced with a certain amino acid and the negative value represents a lower prediction score. This heat map easily reveals that the amino acids Q (Glutamine), and P (Proline) have a significant inhibitory effect on the prediction score. To be specific, in the position between 10 and 16, when the amino acid is mutated with Q or P, the prediction score is significantly reduced. The possible reason for this is that Q and P are frequently present in the non-SSPs and are present less frequently in the SSPs, as illustrated in the Fig. 5A and B. So, when recognizing Q and P in the sequence, the model tends to consider the sequence as the non-SSPs. Particularly, C (L-Cysteine) is frequently present in the tail of the non-SSPs. And observations from Fig. 5C implied the amino acid C significantly reduced the predicted score in the tail region. Besides, the amino acids I (l-isoleucine), L (Leucine), and F (Phenylalanine) have a stimulative effect on the prediction score in most positions of the sequence. The possible reason for this is that, as shown in Fig. 5A, I, L, and F appear frequently in the SSPs sequences while appearing rarely in the non-SSPs sequences. When recognizing I, L, and F in the sequence, the model tends to consider the sequence as SSPs.

Then, to make the results more statistically meaningful, we computed the contribution of each amino acid on the whole dataset. For each peptide of SSPs and non-SSPs, we calculated amino acid scores by accumulating each position score from the ISM results. In detail, we can sum the score of one row in Fig. 5C and get twenty value each representing the contribution score of twenty amino acid in this 59- bp-length sequence. Given the amino acid contribution score of each peptide, we estimated the comprehensive amino acid contribution score on SSP and non-SSP respectively by adding them up. The results of contribution of different amino acids in the SSPs dataset and non-SSPs dataset are shown in Fig. 5D. In addition, we utilized the Amino Acid Composition (AAC) as a statistical analysis reference. As can be seen in Fig. 5D, for the vast majority of amino acids, the relative contribution of each amino acid to the prediction score in the SSPs or non-SSPs is consistent with the relative AAC of this amino acid in the SSPs or non-SSPs. In short, for a particular amino acid, if it has a higher AAC in the SSPs dataset than in the non-SSPs, the contribution to the prediction scores will be likely to be higher in the SSPs than in the non-SSPs. Besides, we can see that the AAC of amino acid G (Glycine) in the SSPs is smaller than that in the non-SSPs, but our model considers that G contributes more to the prediction score in the SSPs, showing our model can mine features that cannot be counted by traditional statistical methods. To sum up, these experiments indicate that our model can mine the sequential characteristics of SSPs.

3.5 The interpretable analysis on region importance by using ExamPle

We next seek to understand the connection between the functional impact and the sequence region. To this end, regardless of the positive or negative effect of one position, we utilized the absolute value subtracted from the ISM experiment as the importance scores to evaluate how the contextual region affects the plant SSPs’ function. To unify all the sequences into one fixed length, we split the sequence into 50 parts. Here, each part is referred as a position ratio. In Fig. 6A, we present a complete correlation map between amino acid and position ratio in the whole dataset. The colors are deeper, the importance scores are higher. For example, amino acid M (Methionine) has a more functional impact on the central region, while amino acid C (Cysteine) has a more functional impact in the tail region, which are both meaningful points to be noticed in the peptide design. Next, to further intuitively illustrate which region is more important, we divided each peptide in the dataset into 25 parts and accumulated the amino acid scores to get the comprehensive score of the position. In Fig. 6B, it is obvious that there is an increasing trend on important scores in the head region and a slow decline in the central region. Between the position ratios 1 and 3, the median of the importance scores is twice as high as that of the position ratios behind 9.

Figure 6.

Figure 6

The interpretable analysis on region importance by using our ExamPle. (A) A complete correlation map between amino acid and position ratio on the whole dataset using the absolute value obtained from the ISM experiment is shown. (B) A boxplot of the importance scores based on position ratio is shown. (C) The result of the clipping experiment is shown. The forward clipping means we cut the sequence on the head region and conserve the tail region. The backward clipping means we cut the sequence on the tail region and conserve the head region. (D) The comparison between the motif searched by the conventional tool STREME and the motif extracted from our model is shown

To better verify the phenomenon in Fig. 6B, we conducted a clipping experiment where we clipped the sequence from the forward direction and backward direction respectively, and fed the clipped sequence into the model to investigate whether the clipped sequence can conserve the original function. As can be seen in Fig. 6C, the backward clipping performance is very robust, and the accuracy remains above 0.9 until the clipping ratio increases to 0.5. However, the metric of forward clipping declines rapidly once the clipping operation starts, and the ACC of the whole dataset is even lower than 0.5 when the clipping ratio exceeds 0.7. Interestingly, the metric SP measuring the accuracy of negative cases maintains the same level at the beginning of the clipping, which can sufficiently demonstrate that negative sequences do not rely on the head region and may be related to the central region.

After we utilize importance scores to intuitively find which region is important for the peptide function, it is very natural to explore the motif of this region from a computation perspective. In Fig. 6D, we discovered one basic motif based on the important scores and compared it with the motif searched by a conventional tool—STREME (Bailey 2021). As can be seen, our learned motif (highlighted with a gray-color window) closely resembles the STREME’s motif. Then, we utilize TOMTOM (Bailey et al. 2009) to quantitatively calculate the motif similarity measured by the P-value. The lower P-value indicates a higher degree of motif consistency. Fig. 6D demonstrates that the motif we select is highly similar to the STREME’s motif, indicating that the conserved sequential pattern can be discovered by our model. Taken together, these experiment results suggest that our model ExamPle has the potential to be a new tool for sequential pattern discovery and may pave the way for stable and effective plant SSPs design.

4 Conclusion

In this study, we present a novel explainable contrastive hybrid-view dual Siamese framework ExamPle for predicting the plant SSPs and discovering the sequential pattern of the SSPs. The experiment results demonstrate that the Siamese framework with secondary structure information can learn discriminative feature information and effectively enhance the performance of our model for the plant SSPs prediction. Besides, by adopting dimension reduction methods, the visualization of latent space clearly shows the great learning ability of our model and outperforms traditional handcrafted feature engineering. Utilizing ISM interpretation methods, ExamPle can discover sequence characteristics and identify the contribution function of each amino acid, which are mostly consistent with the statistical results. In addition, by accumulating functional differences between the variant and original sequences, the key novel principle learned by our model is that the head region in peptides and some specific sequential patterns are strongly associated with the SSP function. Thus, our well-trained model ExamPle can not only provide the prediction for plant SSPs but also pave the way for effective and stable plant SSPs design.

Supplementary Material

btad108_Supplementary_Data

Contributor Information

Zhongshen Li, School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.

Junru Jin, School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.

Yu Wang, School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.

Wentao Long, School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.

Yuanhao Ding, Hainan Key Laboratory for Sustainable Utilization of Tropical Bioresource, College of Tropical Crops, Hainan University, Haikou 570228, China.

Haiyan Hu, Hainan Key Laboratory for Sustainable Utilization of Tropical Bioresource, College of Tropical Crops, Hainan University, Haikou 570228, China.

Leyi Wei, School of Software, Shandong University, Jinan 250101, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.

Supplementary data

Supplementary data is available at Bioinformatics online.

Conflict of interest: All the authors have no conflict of interest.

Funding

This work was supported by the Natural Science Foundation of China [62071278].

Data availability

The authors declare that the datasets and related codes are available at https://github.com/Johnsunnn/ExamPle.

References

  1. Abdi H, Williams LJ.. Principal component analysis. Wiley Interdiscip Rev Comput Stat 2010;2:433–59. [Google Scholar]
  2. Bailey TL, Boden M, Buske FA. et al. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Res 2009;37(Suppl_2):W202–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Bailey TL. STREME: accurate and versatile sequence motif discovery. Bioinformatics 2021;37:2834–40. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Boschiero C, Dai X, Lundquist PK. et al. MtSSPdb: the Medicago truncatula small secreted peptide database. Plant Physiol 2020;183:399–413. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Breiden M, Simon R.. Q&A: how does peptide signaling direct plant development? BMC Biol 2016;14:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Breiman L. Random forests. Mach Learn 2001;45:5–32. [Google Scholar]
  7. Butenko MA, Vie AK, Brembu T. et al. Plant peptides in signalling: looking for new partners. Trends Plant Sci 2009;14:255–63. [DOI] [PubMed] [Google Scholar]
  8. Chen Z, Zhao P, Li C. et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res 2021;49:e60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Chen Z, Zhao P, Li F. et al. iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics 2018;34:2499–502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Chen Z, Zhao P, Li F. et al. iLearn: an integrated platform and meta-learner for feature engineering, machine-learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform 2020;21:1047–57. [DOI] [PubMed] [Google Scholar]
  11. Cho K, van Merriënboer B, Gulcehre C. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, Doha, Qatar, EMNLP. pp.1724–1734.
  12. Constabel CP, Yip L, Ryan CA.. Prosystemin from potato, black nightshade, and bell pepper: primary structure and biological activity of predicted systemin polypeptides. Plant Mol Biol 1998;36:55–62. [DOI] [PubMed] [Google Scholar]
  13. Cortes C, Vapnik V.. Support-vector networks. Mach Learn 1995;20:273–97. [Google Scholar]
  14. Czyzewicz N, Yue K, Beeckman T. et al. Message in a bottle: small signalling peptide outputs during growth and development. J Exp Bot 2013;64:5281–96. [DOI] [PubMed] [Google Scholar]
  15. de Bang TC, Lundquist PK, Dai X. et al. Genome-wide identification of Medicago peptides involved in macronutrient responses and nodulation. Plant Physiol 2017;175:1669–89. [DOI] [PMC free article] [PubMed] [Google Scholar]
  16. De Smet I, Vassileva V, De Rybel B. et al. Receptor-like kinase ACR4 restricts formative cell divisions in the Arabidopsis root. Science 2008;322:594–7. [DOI] [PubMed] [Google Scholar]
  17. Dubchak I, Muchnik I, Mayor C. et al. Recognition of a protein fold in the context of the SCOP classification. Proteins 1999;35:401–7. [PubMed] [Google Scholar]
  18. Fukuda H, Ohashi-Ito K.. Vascular tissue development in plants. Curr Top Dev Biol 2019;131:141–60. [DOI] [PubMed] [Google Scholar]
  19. Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06). New York, USA. IEEE, 2006, 1735–42.
  20. Hochreiter S, Schmidhuber J.. Long short-term memory. Neural Comput 1997;9:1735–80. [DOI] [PubMed] [Google Scholar]
  21. Hosmer DW Jr, Lemeshow S, Sturdivant RX.. Applied Logistic Regression. New York, USA: John Wiley & Sons, 2013. [Google Scholar]
  22. Jiang Y, Wang R, Feng J. et al. Explainable deep hypergraph learning modeling the peptide secondary structure prediction. Adv Sci 2023;10:2206151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kandoth PK, Ranf S, Pancholi SS. et al. Tomato MAPKs LeMPK1, LeMPK2, and LeMPK3 function in the systemin-mediated defense response against herbivorous insects. Proc Natl Acad Sci USA 2007;104:12205–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Lease KA, Walker JC.. The Arabidopsis unannotated secreted peptide database, a resource for plant peptidomics. Plant Physiol 2006;142:831–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Li W, , Godzik A.. Cd-hit: A fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 2006;22:1658–9. https://doi.org/ 10.1093/bioinformatics/btl158. [DOI] [PubMed] [Google Scholar]
  27. Li Y, Qiao G, Gao X. et al. Supervised graph co-contrastive learning for drug-target interaction prediction. Bioinformatics 2022;10. [DOI] [PubMed] [Google Scholar]
  28. Li Y, Qiao G, Wang K. et al. Drug–target interaction predication via multi-channel graph neural networks. Brief Bioinform 2021. a;23:bbab346. [DOI] [PubMed] [Google Scholar]
  29. Li Y, Wang K, Wang G.. Evaluating disease similarity based on gene network reconstruction and representation. Bioinformatics 2021b. [DOI] [PubMed] [Google Scholar]
  30. Li Y, Zhang X. SiamVGG: Visual tracking using deeper siamese networks. arXiv, arXiv:1902.02804, 2019, preprint: not peer reviewed.
  31. Malysiak-Mrozek B, Mrozek D.. An improved method for protein similarity searching by alignment of fuzzy energy signatures. Int J Comput Intell Syst 2011;4:75–88. [Google Scholar]
  32. Matsubayashi Y. Posttranslationally modified small-peptide signals in plants. Annu Rev Plant Biol 2014;65:385–413. [DOI] [PubMed] [Google Scholar]
  33. McCloskey M, Cohen NJ.. Catastrophic interference in connectionist networks: the sequential learning problem. In: Bower GH (ed.) Psychology of Learning and Motivation. Amsterdam, Netherlands: Elsevier, 1989, 109–65. [Google Scholar]
  34. Melekhov I, Kannala J, Rahtu E. Siamese network features for image matching. In: 2016 23rd International Conference on Pattern Recognition (ICPR). IEEE, 2016, 378–83.
  35. Meng L, Feldman LJ.. CLE14/CLE20 peptides may interact with CLAVATA2/CORYNE receptor-like kinases to irreversibly inhibit cell division in the root meristem of Arabidopsis. Planta 2010;232:1061–74. [DOI] [PMC free article] [PubMed] [Google Scholar]
  36. Mistry J, Chuguransky S, Williams L. et al. Pfam: the protein families database in 2021. Nucleic Acids Res 2021;49:D412–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  37. Montavon G, Samek W, Müller K-R.. Methods for interpreting and understanding deep neural networks. Digit Signal Process 2018;73:1–15. [Google Scholar]
  38. Mrozek D, Malysiak B, Kozielski S. An optimal alignment of proteins energy characteristics with crisp and fuzzy similarity awards. In: 2007 IEEE International Fuzzy Systems Conference. IEEE, 2007, 1–6.
  39. Mrozek D, Malysiak-Mrozek B, Kozielski S. Alignment of protein structure energy patterns represented as sequences of Fuzzy Numbers. In: NAFIPS 2009-2009 Annual Meeting of the North American Fuzzy Information Processing Society. IEEE, 2009, 1–6.
  40. Mrozek D. A review of cloud computing technologies for comprehensive microRNA analyses. Computat Biol Chem 2020;88:107365. [DOI] [PubMed] [Google Scholar]
  41. Murphy E, Smith S, De Smet I.. Small signaling peptides in Arabidopsis development: how cells communicate over a short distance. Plant Cell 2012;24:3198–217. [DOI] [PMC free article] [PubMed] [Google Scholar]
  42. Nakaminami K, Okamoto M, Higuchi-Takeuchi M. et al. AtPep3 is a hormone-like peptide that plays a role in the salinity stress tolerance of plants. Proc Natl Acad Sci USA 2018;115:5810–5. [DOI] [PMC free article] [PubMed] [Google Scholar]
  43. Ohyama K, Ogawa M, Matsubayashi Y.. Identification of a biologically active, small, secreted peptide in Arabidopsis by in silico gene screening, followed by LC-MS-based structure analysis. Plant J 2008;55:152–60. [DOI] [PubMed] [Google Scholar]
  44. Pearce G, Strydom D, Johnson S. et al. A polypeptide from tomato leaves induces wound-inducible proteinase inhibitor proteins. Science 1991;253:895–7. [DOI] [PubMed] [Google Scholar]
  45. Rish I. An empirical study of the naive Bayes classifier. In: IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence. 2001, 41–6.
  46. Shen J, Zhang J, Luo X. et al. Predicting protein–protein interactions based only on sequences information. Proc Natl Acad Sci USA 2007;104:4337–41. [DOI] [PMC free article] [PubMed] [Google Scholar]
  47. Spackman DH, Stein WH, Moore S.. Automatic recording apparatus for use in chromatography of amino acids. Anal Chem 1958;30:1190–206. [PubMed] [Google Scholar]
  48. Teufel F, , Almagro ArmenterosJJ, , Johansen AR. et al. SignalP 6.0 predicts all five types of signal peptides using protein language models. Nat Biotechnol 2022;40:1023–5. https://doi.org/ 10.1038/s41587-021-01156-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  49. Uhlén M, Fagerberg L, Hallström BM. et al. Tissue-based map of the human proteome. Science 2015;347:1260419. [DOI] [PubMed] [Google Scholar]
  50. Van der Maaten L, Hinton G.. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–2605. [Google Scholar]
  51. Vaswani A, Shazeer N, Parmar N. et al. Attention is all you need. Adv Neural Inf Process Syst 2017;30. [Google Scholar]
  52. Wang L, Einig E, Almeida-Trapp M. et al. The systemin receptor SYR1 enhances resistance of tomato against herbivorous insects. Nat Plants 2018;4:152–6. [DOI] [PubMed] [Google Scholar]
  53. Wang R, Jiang Y, Jin J. et al. DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation, and visualization analysis. bioRxiv. 2022.2009. 2029.509859, 2022. [DOI] [PMC free article] [PubMed]
  54. Xu Y, Yu Z, Zhang D. et al. CYSTM, a novel non-secreted cysteine-rich peptide family, involved in environmental stresses in Arabidopsis thaliana. Plant Cell Physiol 2018;59:423–38. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btad108_Supplementary_Data

Data Availability Statement

The authors declare that the datasets and related codes are available at https://github.com/Johnsunnn/ExamPle.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES