Skip to main content
Bioinformatics logoLink to Bioinformatics
. 2024 Feb 1;40(2):btae057. doi: 10.1093/bioinformatics/btae057

StructuralDPPIV: a novel deep learning model based on atom structure for predicting dipeptidyl peptidase-IV inhibitory peptides

Ding Wang 1, Junru Jin 2, Zhongshen Li 3, Yu Wang 4, Mushuang Fan 5, Sirui Liang 6, Ran Su 7,, Leyi Wei 8,9,
Editor: Alfonso Valencia
PMCID: PMC10904144  PMID: 38305458

Abstract

Motivation

Diabetes is a chronic metabolic disorder that has been a major cause of blindness, kidney failure, heart attacks, stroke, and lower limb amputation across the world. To alleviate the impact of diabetes, researchers have developed the next generation of anti-diabetic drugs, known as dipeptidyl peptidase IV inhibitory peptides (DPP-IV-IPs). However, the discovery of these promising drugs has been restricted due to the lack of effective peptide-mining tools.

Results

Here, we presented StructuralDPPIV, a deep learning model designed for DPP-IV-IP identification, which takes advantage of both molecular graph features in amino acid and sequence information. Experimental results on the independent test dataset and two wet experiment datasets show that our model outperforms the other state-of-art methods. Moreover, to better study what StructuralDPPIV learns, we used CAM technology and perturbation experiment to analyze our model, which yielded interpretable insights into the reasoning behind prediction results.

Availability and implementation

The project code is available at https://github.com/WeiLab-BioChem/Structural-DPP-IV.

1 Introduction

Type 2 diabetes is the most common form of diabetes worldwide (Chatterjee et al. 2017), accounting for 90%–95% of all cases of diabetes. It may not be diagnosed until several years after the onset of illness and complications develop. Until recently, this form of diabetes was seen only in adults, but it is now also occurring with increasing frequency in children (Copeland et al. 2013). Dipeptidyl peptidase IV (DPP-IV, E.C.3.4.14.5) is a cell-surface aminopeptidase that was originally characterized as a T-cell differentiation antigen (Kikkawa et al. 2006). This enzyme is related to immune regulation, signal transduction and apoptosis (Golightly et al. 2012). The proteolytic activity of DPP-IV had been classified as a serine-type protease that cleaves a proline or alanine residue at the N-terminal dipeptides position (Casrouge et al. 2018). DPP-IV inhibitory peptides (DPP-IV-IPs) had been widely considered a promising anti-type 2 diabetes drug (De et al. 2019). This way of using inhibitory peptides for treatment causes fewer adverse reactions to patients compared with chemically synthesized small-molecule drugs (Barnett 2006, Jarvis et al. 2013). Therefore, it is crucial to develop new DPP-IV inhibitory drugs and functional foods.

Traditional laboratory-based techniques for predicting DPP-IV-IP are precise but resource-intensive, expensive, and time-consuming (Nongonierma and FitzGerald 2019, Wang et al. 2021). Enzymatic hydrolytic screening is still the main approach used, which hampers the efficiency of DPP-IV-IP screening. With the improvement and supplementation of peptide datasets, several machine learning-based predictors have been developed to predict DPP-IV-IP (Charoenkwan et al. 2020a, Guan et al. 2022, Phasit et al. 2022). Despite these advancements, these models still exhibit some limitations, and further improvements are possible. iDPPIV-SCM is the first ML-based DPP-IV-IP predictor using scoring card method (SCM), it provided accuracies of 0.819 and 0.797 for cross-validation and independent datasets respectively, but lacks informative features from various facets. This model is not yet accurate enough for real-world applications, and lacked biochemical interpretability (Charoenkwan et al. 2020,b). StackDPPIV uses combined machine learning methods and utilized more feature encodings from multiple perspectives. It achieved an accuracy of 0.891, but depends on feature engineering for sequence and structure information extracting (Phasit et al. 2022). BERT-DPPIV (Guan et al. 2022) is the current state-of-the-art predictor with satisfactory accuracy. However, it only uses NLP method for predicting and depends on time-costing pretraining procedure, which lacks physicochemical information.

In the present study, we proposed a sequence structure-based machine learning predictor (StructuralDPP-IV), for DPP-IV-IP identification. This predictor used joint representation extracted from both NLP method (TextCNN) and amino acid structure. To utilize the information from amino acid structure, StructuralDPP-IV used SMILES representation as a medium to extract physicochemical features, which made the model well interpretable from atomic level. Furthermore, on an independent test dataset, as well as the Pentapeptide and Tripeptide wet experiment dataset (Guan et al. 2022), StructuralDPPIV consistently achieves an accuracy rate surpassing 90%. Notably, it outperforms other established methodologies, thus affirming the robustness and efficacy of our model. To delve deeper into the model's underlying mechanisms, we used CAM (Class Activation Mapping) (Zhou et al. 2016) and permutation analyses to investigate the intricate relationship between predicted peptide sequences, amino acids, and their respective positions. The resulting interpretable findings, of particular interest, align with biological knowledge and statistical scrutiny, thereby offering valuable insights and recommendations for peptide design.

2 Materials and methods

2.1 Dataset

We used Charoenkwan et al.’s dataset (Phasit et al. 2022) in this study for model development and assessment. The dataset contains 665 positive and 665 negative samples, which is divided into benchmark set for training and test set for independent testing. Benchmark set contains 532 DPP-IV-IPs and 532 negative samples, while independent test set includes 113 DPP-IV-IPs and the same number of non-DPP-IV-IPs.

Pentapeptide and Tripeptide wet experiment dataset is derived from the biological experiment in BERT-DPPIV (Guan et al. 2022). Pentapeptide dataset consists of peptides that contain proline or alanine and repeat the dipeptide unit VP or IPI. Tripeptide dataset consists of four kind classes, i.e. WPX, WAX, WRX, and WVX, in which X means any possible kind in 20 human amino acids. All the wet biological experiment results and labels are given in the paper of BERT-DPPIV.

2.2 StructuralDPP-IV architecture

Figure 1 shows the StructuralDPPIV workflow and model composition. Our model consists of three parts: (B) Structural encoding module, (C) TextCNN encoding module and (C) Classification module. Given a peptide sequence, we use text convolutional neural module (TextCNN) and structural encoding module to encode representation at sequence level and representation at atom structure level respectively. The process of obtaining a joint representation for a peptide involved two sub-representations derived from the TextCNN module and the Structural module, respectively. To fuse the information contained in these sub-representations, we use an element-wise multiplication of the two-dimensional vectors. The resulting joint representation captures the combined features of the peptide as captured by both the textual and structural information. We subsequently utilize this joint representation to predict whether the peptide is expected to be DPP-IV-IP. This methodology enables the integration of multiple sources of information to enhance the accuracy of predictions regarding the properties of peptides.

Figure 1.

Figure 1.

Overview of the StructuralDPP-IV framework. (A) Peptide sequence. We get one sequence from the dataset. (B) TextCNN encoding module. This module is used to encode the peptide from sequence information by using TextCNN model. (C) Structural encoding module. This module is used to encode peptide from the perspective of amino acid structural information. (D) Classification module. After we get two sub-representations of peptide from TextCNN module and Structural module respectively, we use element-wise multiplication to fuse 2D vector to get the joint representation. Then, we use this joint representation to predict whether this peptide is expected to be DPP-IV-IP. (E) Prediction module. In this work, we verify our model’s validity by utilizing two wet biological experiments. (F) Explainable analysis. We utilize two explainable methods, i.e. CAM technology and perturbation experiments to analyze the knowledge learned from our model.

2.2.1 SMILES representation of peptides

Effective representation of molecular structure is a crucial step in the present study. SMILES(simplified molecular input line entry system) is a specification in the form of a line notation for describing the structure of chemical species using short ASCII strings (Weininger 1988). It had been widely used in many fields of bioinformatics, e.g. interaction and binding affinity prediction of drug–target (Karimi et al. 2019, Yang et al. 2021, Zeng et al. 2021), peptide toxicity prediction (Wei et al. 2021). Recent studies show that SMILES was used to encode DNA for the prediction of rice 6 mA sites by deep learning methods and achieved good performance (Liu et al. 2022).

2.2.2 Structural encoding module

Structural encoding module encodes peptide’s SMILES representation to get physicochemical information. It is a crucial step to embed the biochemical information contained in SMILES into the features. For one peptide sequence S with amino acid a1, a2, , an where n is the number of atoms in one amino acid molecule, S(a1, a2,, an) is encoded into one 3d tensor TS of shape (n×m×p), where n is the max sequence length in training and independent test dataset [90 for the dataset from Charoenkwan et al. (2020a)], m is the max number of non-hydrogen atoms in one amino acid (15 for human race) and p is the feature number extracted from one atom (21 for our model). Here we note TS as TS(M1,M2, , Mn) where Mn is feature representation matrix of amino acid an, and Mi as Mi(v1,v2,, vm) where vi(f1, f2,, fp) is feature vector of one atom ti. The meanings of these features are shown in Table 1.

Table 1.

All the 21 features encoded by structural feature extraction part.

Atom feature Contents Possible value
0, 1, 2, 3 One-hot-encoding, C, N, O or S 0 or 1
4, 5, 6 One-hot-encoding, degree 0 or 1
7, 8, 9, 10 One-hot-encoding, num of Hs on the atom 0 or 1
11, 12, 13, 14 One-hot-encoding, number of implicit Hs on the atom. 0 or 1
15 One-hot-encoding, aromatic or not 0 or 1
16, 17 One-hot-encoding, in ring or not 0 or 1
18 Hybridization type Int, 1–7
19 Gasteiger charges(neighbors) Float
20 Gasteiger charges(self) Float

For each atom t, feature 0–19 is multiplied with adjacent matrix of this atom. Thus, the value encoded in one entry includes information about neighbor nodes. In Table 1, feature 20 is attached to the multiplication result as one information from one atom itself instead of that from neighbors. Given a peptide sequence, e.g. “ACDEFGHIKLMNPQRSTVWY,” Structural encoding module first encodes each char, i.e. each amino acid into SMILES sequence representation, then generates one modular object using acquired SMILES with RDKit.

We chose these features based on their biochemical relevance. The atomic features mentioned above comprehensively encapsulate the biochemical characteristics of each individual atom. Since the final vector representation for each atom is a result of the features of neighboring atoms multiplied by the adjacency matrix, this approach effectively conveys physicochemical information for each atom. Through our experimental tests, we found that the inclusion of atomic properties such as atomic number and second-degree neighbor features did not lead to an improvement in model performance.

For encoding features like atomic types, we used one-hot encoding. This encoding method is advantageous for non-ordinal categorical data as it helps avoid bias and can provide some degree of regularization. Our experiments showed that one-hot encoding was effective in improving model performance when encoding atomic types, whereas for other features, it either slightly outperformed direct encoding or showed minimal differences.

Graph convolutional network is a generalization of convolution on graph structured data and is able to capture complex relationships between nodes in a graph, which has been widely used in many bioinformatic tasks (Li et al. 2020, Chu et al. 2021, Ryu et al.2020). Compared with direct connection between neural network layers, the residuals block in ResNet is able to effectively prevent gradient from exploding or disappearing in deep network through shortcut connection (He et al. 2016a, b). In StructuralDPPIV, we utilize residual blocks for further information extraction. The workflow of Structural encoding module is as follows:

Tconv=ConvTS (1)
Tres1=Tconv+ConvreluNormConvreluNormTconv (2)
vS=ConvTres1+ConvreluNormConvreluNormMPTres1 (3)

where Tconv, Tres1, and vS are the output of the convolutional layer, the first and the second residual block, respectively. Norm denotes batch normalization, MP means max-pooling operation. In Structural encoding module, we first input features of 3d tensor TS encoded from peptide sequence into GCN layer, then feed the output into two residual blocks to extract more effective and distinguishable features.

2.2.3 TextCNN encoding module

Text convolutional neural network (TextCNN) utilizes convolutional neural networks (CNN) for sentence-level classification tasks. It utilizes various filter sizes applied to a sentence, capturing diverse latent features. Resulting features are then pooled to generate a fixed-length representation, which is subsequently fed into a linear layer for prediction. TextCNN achieved satisfactory accuracy while maintaining good performance on multiple benchmarks (Kim 2014). In recent years, TextCNN had been widely used in many bioinformatic fields such as cancer pathology reports classification (Alawad et al. 2019, Savova et al. 2019, Alawad et al. 2020) and bioactive peptide discovery (He et al. 2022). In our model, we use TextCNN to provide a simple and intuitive NLP representation of peptide sequences. The following equations illustrate how this module works:

vemb=Embeda (4)
vconved=Convsvemb (5)
vT=MPvconved (6)

where a refers to amino acid, Embed refers to embedding layer, Convs refers to convolution layer, and MP refers to max-pooling operation. TextCNN encoding module process one amino acid into an embedding vector vemb, one peptide sequence with m acids into tensor M(v1,v2,, vm). In this article, we use representation of sequence with non-static channels, i.e. vector embedding of amino acid can be updated during training by backpropagation. Different convolutional filters with weights w are then applied to M, gaining new embedding vconved which is further used for Max-over-time pooling operation. The final output, noted as vT, of linear layers behind max-pooling module is concatenated with vS for classification:

2.2.4 Joint representation

The final stage of the model processing entails element-wise multiplication of the output vectors from the Structural and TextCNN modules. For example, the structural embedding vector (vS) (1, 4, 2, 8) multiplies text embedding vector (vT) (5, 7, 1, 4) get the final representation vector (5, 28, 2, 32). Element-wise multiplication introduces a non-linear interaction between the elements of the two feature vectors. This can capture complex relationships between features that may not be captured by simple addition, and this step is exactly the fusion mechanism mentioned above. The resulting vector is then used in classification task:

vfinal=vS  vT (7)
y^=LinearReLUDropoutLinearvfinal (8)

where vS, vT is intermediate representation derived from Structural encoding module or TextCNN encoding module, vfinal is the vector used for classification, is element-wise multiplication operation, and Linear, Dropout and ReLU refer to corresponding neural networks.

2.3 Evaluation metrics

To evaluate the prediction performance of our model, we utilize five metrics commonly used in binary classification tasks (Charoenkwan et al. 2021a,b):

ACC=TP+TNTP+TN+FP+FN (9)
Sn=TPTP+FN (10)
Sp=TNTN+FP (11)
MCC=TP×TN-FP×FNTP+FP×TN+FN×TP+FN×TN+FP (12)

where TP, TN, FP, FN are the number of true positives, true negatives, false positives and false negatives respectively. Sp and Sn stand for Specificity and Sensitivity respectively, they can reflect the model’s ability to recognize DPP-IV inhibitory peptides and non-DPP-IV inhibitory peptides. Several previous works have used these metrics to evaluate model performance. Typically, a higher value for ACC, Sn, Sp and MCC metrics indicates better performance of the model.

2.4 Training process

Here we provide comprehensive information about the training process of StructuralDPPIV. Computational calculations were executed utilizing an NVIDIA RTX 3090 GPU equipped with 24 GB of memory. Prior to the training process, preprocessing of each peptide record was performed to derive intermediate structural and textual encodings, herein referred to as intermediate SE and TE tensors. It is noteworthy that SE tensor was directly generated by physicochemical features, and intermediate TE was one-hot dictionary tensor. Subsequently, these tensors were persistently stored on disk as pickle (.pkl) files. This nonparametric preprocessing step yields a significant acceleration in both the subsequent training and testing phases.

The training hyperparameters used in this study are as follows: a batch size of 32, a learning rate of 0.000005, and a maximum number of epochs set at 150, beyond which the model's performance reaches a sufficiently stable state. To further enhance performance, we leveraged the Focal Loss (Lin et al. 2017) and Adam (Kingma and Ba 2014) optimizer for better performance. The training process consumes approximately 8 minutes.

3 Results

3.1 StructuralDPPIV outperforms other state-of-art methods on independent dataset

In this section, we compared StructuralDPPIV with previously reported DPP-IV-IP discriminant models, including Decision Tree(DT) (Dhall et al. 2021), iDPPIV-SCM (Charoenkwan et al. 2020a), Random Forest(RF) (Charoenkwan et al. 2021,b, Dhall et al. 2021), Naïve Bayes (NB) (Charoenkwan et al. 2020,b, Jia and He 2016), K nearest neighbor(KNN) (Hasan et al. 2020, Liu and Chen 2020), fuzzy KNN(fKNN) (Min et al. 2013), SVM (Zou and Yin 2021), StackDPPIV (Phasit et al. 2022) and Bert-DPPIV (Guan et al. 2022). The experimental results indicated that StructuralDPPIV outperforms other DPP-IV-IP predictors, demonstrating a superior accuracy and MCC of 1.58% and 3.18%, respectively, compared to the current state-of-the-art DPP-IV-IP predictor, BERT-DPPIV. Further comparative results can be found in Table 2 and Fig. 2A. In summary, StructuralDPPIV excels in this comparison due to its deep learning architecture, effectively capturing complex patterns and relationships within peptide sequences. It demonstrates superior classification performance, particularly in terms of high sensitivity and MCC, highlighting its capability to address class imbalance and provide a balanced classification measure. In addition, StructuralDPPIV also demonstrates a high level of interpretability that was not present in previous prediction models. This success stems from its comprehensive and reasonable utilization of physicochemical and sequential information.

Table 2.

The comparison between StructuralDPPIV and other existing models.

Model ACC MCC Sn Sp AUC
StructuralDPPIV 0.9098 0.8218 0.9474 0.8722 0.9656
BERT-DPPIV 0.8940 0.7900 0.8720 0.9170 0.9600
StackDPPIV 0.8910 0.7840 0.8570 0.9250 0.9610
SVM 0.8650 0.7310 0.8740 0.8560 0.9390
iDPPIV-SCM 0.7970 0.5940 0.7890 0.8050 0.8470
KNN 0.8571 0.7163 0.8947 0.8195
fKNN 0.8609 0.7243 0.9023 0.8195
DT 0.8722 0.7444 0.8797 0.8647
NB 0.8083 0.6229 0.8797 0.7368
RF 0.8271 0.6589 0.8872 0.7669

Figure 2.

Figure 2.

StructuralDPPIV outperforms other models. (A) Different model performances on independent test dataset with different metrics (AUC record of DT, NB, RF is not yet available). (B) The training process of StructuralDPPIV and its submodules. (C) ROC curve of StructuralDPPIV and its submodules. (D–F) Umap analysis of StructuralDPPIV and its submodules.

To further demonstrate the model's generalizability, we also conducted 10-fold cross-validation training and verified the performance on the training dataset, as detailed in the Supplementary Table S2. The same ablation experiment was previously conducted in StackDPPIV (Phasit et al. 2022). This work reported an average accuracy of 0.8890 across 10-folds.

3.2 Ablation experiment identified the importance of submodules in StructuralDPPIV

To demonstrate the predictive mechanism of StructuralDPPIV, we compared the model using only the TextCNN encoding module or only the Structural encoding module with the original model. The experimental results in Fig. 2B and Table 3 indicated that the Structural encoding module played a critical role in the model's prediction process, with a significant improvement in all metrics. However, the combination of Structural and TextCNN encoding modules enhanced prediction robustness with a more stable training process. On the independent test dataset, the complete model exhibited significantly higher accuracy than that achieved by any single encoding module in isolation.

Table 3.

The comparison between StructuralDPPIV and other existing models.

Model ACC MCC Sn Sp AUC
StructualDPPIV 0.9098 0.8218 0.9474 0.8722 0.9656
TextCNN encoding only 0.8346 0.6869 0.9473 0.7218 0.9226
Structural encoding only 0.8909 0.7918 0.9699 0.8120 0.9629

Subsequently, we utilized the Umap (McInnes et al. 2018) to visualize the output features processed from StructuralDPPIV and submodules, which had been widely used for the data visualization of different bioinformatics tasks (Liang et al. 2023). The Umap results demonstrated that StructuralDPPIV has a significantly better ability to distinguish peptide samples than using a single Structural encoding or TextCNN encoding module. As shown in Fig. 2D, most negative samples were effectively separated from positive samples in the representation space. In contrast, Fig. 2E and F showed that the single module cannot project positive and negative examples to different locations in the intermediate representation space, with some negative and positive samples almost connected in the handover area.

3.3 The performance of StructuralDPPIV on polypeptide screening

To further verify the reliability of our model, we applied StructuralDPPIV to two other datasets: the Pentapeptide and Tripeptide wet experiment datasets. Figure 3A compared the performance of StructuralDPPIV with BERT-DPPIV. According to the wet experiment conducted by the same author (Guan et al. 2022), 22 pentapeptides are qualified inhibitory peptides, and BERT-DPPIV predicted 14 of the 22 pentapeptides to exhibit inhibitory activity against DPP-IV. In contrast, StructuralDPPIV successfully predicted 21 of the 22 pentapeptides to be DPP-IV-IPs, with only 4.5% error. On Tripeptide dataset, 72 out of 80 tripeptides were found to efficiently inhibit DPP-IV. We utilized StructuralDPPIV on this dataset and obtained an accuracy of 90%, which is comparable to BERT-DPPIV.

Figure 3.

Figure 3.

Pentapeptides and Tripeptides screening of StructuralDPPIV, which is previously conducted using BERT-DPPIV model. (A) The comparison between StructuralDPPIV and BERT-DPPIV on Pentapeptide and Tripeptide datasets. (B and C) Conversion matrix between selected amino acids according to wet experiments and StructuralDPPIV.

Furthermore, we investigated the conversion results between different amino acids using wet experiments to test the inhibitory activity changes after one amino acid is mutated into another. To ensure that the samples were sufficient, we selected the four amino acids that appeared most frequently in the Tripeptide dataset. The results are presented in Fig. 3B, where we observed a significant increase in inhibitory activity when V(Valine) is mutated into A(Alanine), while the conversion between V(Valine) and P(Proline) has little influence on inhibitory activity. We use our model prediction score to calculate the same conversion matrix on Tripeptide dataset, which can verify whether our model could learn this regular pattern. Interestingly, as shown in Fig. 3C, the distribution of the conversion matrix calculated by our model was similar to the conversion matrix calculated by wet experiment, demonstrating the excellent learning ability and detection power of our model.

3.4 Comprehensive analysis on StructuralDPPIV

In order to further analyze the workflow and interpretability of StructuralDPPIV, we conducted a series of experiments to explore working mechanism of our model, including amino acid position analysis, atomic analysis, amino acid type analysis and feature analysis experiments. Experimental results showed that our model has excellent information-capturing ability.

3.4.1 CAM analysis

CAM (Class Activation Mapping) analysis is a technique used to produce visual explanations for decisions made by CNN-based models (Zhou et al. 2016). This technique enhances the transparency of models by using gradients of any target concept that flow into the final convolutional layer to generate a coarse localization map that highlights significant regions in the image for prediction purposes (Selvaraju et al. 2017). In our study, we utilized CAM to evaluate the importance of different amino acids and atoms in the decision-making process of our model. The CAM score map of amino acid position and atom index is presented in Fig. 4A.

Figure 4.

Figure 4.

Position and atom importance analysis. (A–C) Class Activation Mapping (CAM) and inspection of amino acid position and amino acid type. (D–F) Perturbation inspection of position and amino acid type. (G) Examples of CAM analysis score map on three amino acids. Experimental results showed that StructuralDPPIV focuses on different R groups of amino acid, therefore making credible decisions. (H) Effect of different amino acid types on prediction.

To make a more comprehensive analysis, we calculated the CAM score of each amino acid statistically for each position of each sequence in the dataset, focusing on position and amino acid type. As demonstrated in Fig. 4B, both positive and negative examples have relatively high importance scores before the 14th position, indicating the importance of the head region when designing DPP-IV-IP. We also analyzed the importance score on different amino acid types by adding up CAM score of each amino acid respectively, as illustrated in Fig. 4C. Our results indicated that Y(Tyrosine), F(Phenylalanine), and E(Glutamic acid) had the highest importance score, which was useful for DPP-IV-IP designing.

Moreover, CAM analysis was applied to amino acid molecular structure analysis, and the importance score on atom index is presented in Fig. 4G (the remained analysis of other amino acids can be seen in Supplementary Fig. S1). Our analysis of 20 amino acid demonstrated that for most amino acid molecules, StructuralDPPIV mainly focuses on the structure of R groups, instead of paying much attention to amino and carboxyl groups. This was reasonable and consistent with expectations.

3.4.2 Perturbation analysis

Perturbation analysis is a process that involves the replacement of an amino acid in a polypeptide sequence with another amino acid, followed by the recording of the model's prediction change for the new sequence. Through this analysis, statistical calculation of perturbation score for each amino acid could be executed for each position of each sequence.

It is noteworthy that perturbation analysis also provided results on amino acid position and type that were similar to those obtained using CAM analysis. Interestingly, the results on position importance and amino acid importance were comparable between perturbation experiment and CAM analysis. Specifically, as observed in the CAM analysis on position importance (Fig. 4E), the perturbation experiment gave greater attention to the head region, particularly the 4th and 13th positions, which was highly consistent with the CAM analysis. Furthermore, the amino acid type importance score calculated by the perturbation experiment was presented in Fig. 4F. As a gray line runs through the two charts, the left part highly overlapped between perturbation experiment and CAM analysis, sharing the same amino acids Y(Tyrosine), E(Glutamic acid), W(Tryptophan), R(Arginine), and K(Lysine). Although the order between them was not entirely consistent, there was an 83% coincidence rate in the composition of the left part.

In the above perturbation analysis, we used the absolute value to represent the importance score. However, it is still unknown whether a mutation has a positive or negative influence on the prediction score. To address this, we utilized subtractive values. Positive values represent a positive effect, while negative values indicate a negative impact. Furthermore, for statistical verification, we introduce K-fold on AAC (Amino Acid Composition), using one amino acid composition value on positive sequences to divide that on negative sequences. In the realm of AAC fold metrics, it could be observed that the median value is 1. Furthermore, it is worth noting that AAC metrics that exceed 1 demonstrate amino acids exhibiting positive effects, while those yielding negative effects exhibit AAC metrics below 1. The experimental results are shown in Fig. 4H. Most amino acid effects are consistent between perturbation scores and AAC fold, which proved the data mining ability of our model. What’s more, in the blue frame, we found that Y(Tyrosine) and W(Tryptophan) were not consistent between two analyses. So, StructuralDPPIV may have the potential to learn the latent effect of amino acid which cannot be notified by general statistics methods.

3.4.3 Feature encoding analysis

In this study, we aimed to investigate the relative contribution of different features to the performance of our StructuralDPPIV model, both theoretically and experimentally. So we conducted a set of experiments in which we systematically removed one of the nine feature encodings and recorded model performances under the same conditions. Specifically, we compared the average accuracy rate of models without different feature encodings after training for >100 epochs. As shown in Fig. 5A, when one feature encoding was removed, the performance of our model decreased to varying degrees, demonstrating the importance of each feature in the model's prediction process.

Figure 5.

Figure 5.

Feature encoding importance analysis. (A) Model performance in circumstances with specific feature encoding removed, compared with the original StructuralDPPIV model. (B) Permutation analysis of different features, which is roughly consistent with the experimental verification results.

To further examine which feature encoding played a more critical role in the prediction accuracy of our model, we utilized permutation importance analysis. This technique measures the decrease in a model score when a single feature value is randomly shuffled (Breiman 2001). As displayed in Fig. 5B, we observed that the hybridization feature had the greatest impact on our model's performance. Nevertheless, it is recognized that the underlying cause for this observation might stem from the fact that atom hybridization is encapsulated as a single integer feature, rather than being represented through a binary (0–1) one-hot encoding scheme. This particular method of representation potentially leads the model to prioritize this feature over other encoded features. Additionally, we observed that charge also represents a vital feature encoding, which significantly affects predictive outcomes. It can be assumed that charge information carries implications regarding the polarity of the modules, thereby rendering it a critical property for accurate predictions.

4 Discussions

The dataset involved in this study contains peptides with varying lengths from 2 to 90. Considering the significant variance in sequence lengths, our model uses a padding mechanism to standardize the shape of tensors across different records. We hypothesize that uniform sequence lengths will yield improved performance for StructuralDPPIV. However, in parallel, we have experimented with various padding methods for data augmentation, aiming to mitigate the impact of diverse sequence lengths on model prediction accuracy, but there was no significant improvement observed. Nevertheless, we still believe that better padding or data augmentation methods can further enhance the model's prediction accuracy. A similar model architecture may exhibit enhanced performance on other datasets containing peptides, DNA, and RNA sequences of uniform lengths. This presents a potential avenue for future research endeavors.

5 Conclusions

Recent studies reported that DPP-IV inhibitory peptides play a vital role in diabetes study, it is widely used as novel antidiabetic agents to improve glycemic regulation in Type 2 diabeticsby inhibiting incretin hormones degradation. In the current research, we present StructuralDPPIV as a novel sequence structure-based DPP-IV-IP predictor, which utilizes TextCNN and SMILES representation of peptide sequence and achieved state-of-the-art accuracy on independent test set. The model consists of two modules: Structural encoding module and TextCNN encoding module. Structural Encoding module encodes 21 selected physicochemical features extracted from amino acid, which can be gained from SMILES representation of peptide sequence, then GCN and ResNet are used to extract hidden information and transform input tensor into vector, which will be element-wise multiplied with encoding vector extracted from TextCNN and then used for classification. This model outperformed both TextCNN and Structural single module, and achieved satisfactory results on additional tripeptide and pentapeptide dataset. Finally, we conducted explainable analysis experiments, including atomic, feature, position and amino acid type importance analysis. The result shows that StructuralDPPIV is a promising and novel model for DPP-IV-IP prediction.

Supplementary Material

btae057_Supplementary_Data

Contributor Information

Ding Wang, School of Software, Shandong University, Jinan 250101, China.

Junru Jin, School of Software, Shandong University, Jinan 250101, China.

Zhongshen Li, School of Software, Shandong University, Jinan 250101, China.

Yu Wang, School of Software, Shandong University, Jinan 250101, China.

Mushuang Fan, School of Software, Shandong University, Jinan 250101, China.

Sirui Liang, School of Software, Shandong University, Jinan 250101, China.

Ran Su, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China.

Leyi Wei, Faculty of Applied Sciences, Macao Polytechnic University, Macao 999078, China; Joint SDU-NTU Centre for Artificial Intelligence Research (C-FAIR), Shandong University, Jinan 250101, China.

Supplementary data

Supplementary data are available at Bioinformatics online.

Conflict of interest

None declared.

Funding

The work was supported by the National Natural Science Foundation of China [62071278, and 62222311], and Internal research grants of Macao Polytechnic University (RP/CAI-02/2023).

Data availability

All code used in data analysis and preparation of the manuscript, alongside a description of necessary steps for reproducing results, can be found in a GitHub repository accompanying this manuscript: https://github.com/WeiLab-BioChem/Structural-DPP-IV.

References

  1. Alawad M, Gao S, Qiu J. et al. Deep transfer learning across cancer registries for information extraction from pathology reports. In: 19-22 May 2019 IEEE EMBS International Conference on Biomedical & Health Informatics (BHI). Chicago, IL, USA: IEEE, 2019, 1–4. [DOI] [PMC free article] [PubMed]
  2. Alawad M, Gao S, Qiu JX. et al. Automatic extraction of cancer registry reportable information from free-text pathology reports using multitask convolutional neural networks. J Am Med Inform Assoc 2020;27:89–98. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Barnett A. DPP‐4 inhibitors and their potential role in the management of type 2 diabetes. Int J Clin Practice 2006;60:1454–70. [DOI] [PubMed] [Google Scholar]
  4. Breiman L. Random forests. Machine Learn 2001;45:5–32. [Google Scholar]
  5. Casrouge A, Sauer AV, Barreira da Silva R. et al. Lymphocytes are a major source of circulating soluble dipeptidyl peptidase 4. Clin Exp Immunol 2018;194:166–79. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Charoenkwan P, Kanthawong S, Nantasenamat C. et al. iDPPIV-SCM: a sequence-based predictor for identifying and analyzing dipeptidyl peptidase IV (DPP-IV) inhibitory peptides using a scoring card method. J Proteome Res 2020a;19:4125–36. [DOI] [PubMed] [Google Scholar]
  7. Charoenkwan P, Nantasenamat C, Hasan MM. et al. Meta-iPVP: a sequence-based meta-predictor for improving the prediction of phage virion proteins using effective feature representation. J Comput Aided Mol Des 2020b;34:1105–16. [DOI] [PubMed] [Google Scholar]
  8. Charoenkwan P, Chiangjong W, Nantasenamat C. et al. StackIL6: a stacking ensemble model for improving the prediction of IL-6 inducing peptides. Brief Bioinform 2021a;22:bbab172. [DOI] [PubMed] [Google Scholar]
  9. Charoenkwan P, Kanthawong S, Nantasenamat C. et al. iAMY-SCM: improved prediction and analysis of amyloid proteins using a scoring card method with propensity scores of dipeptides. Genomics 2021b;113:689–98. [DOI] [PubMed] [Google Scholar]
  10. Chatterjee S, Khunti K, Davies MJ.. Type 2 diabetes. Lancet 2017;389:2239–51. [DOI] [PubMed] [Google Scholar]
  11. Chu Y, Wang X, Dai Q. et al. MDA-GCNFTG: identifying miRNA-disease associations based on graph convolutional networks via graph sampling through the feature and topology graph. Brief Bioinform 2021;22:bbab165. [DOI] [PubMed] [Google Scholar]
  12. Copeland KC, Silverstein J, Moore KR. et al. ; American Academy of Pediatrics. Management of newly diagnosed type 2 diabetes mellitus (T2DM) in children and adolescents. Pediatrics 2013;131:364–82. [DOI] [PubMed] [Google Scholar]
  13. De S, Banerjee S, Kumar SKA. et al. Critical role of dipeptidyl peptidase IV: a therapeutic target for diabetes and cancer. Mini Rev Med Chem 2019;19:88–97. [DOI] [PubMed] [Google Scholar]
  14. Dhall A, Patiyal S, Sharma N. et al. Computer-aided prediction and design of IL-6 inducing peptides: IL-6 plays a crucial role in COVID-19. Brief Bioinform 2021;22:936–45. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Golightly LK, Drayna CC, McDermott MT.. Comparative clinical pharmacokinetics of dipeptidyl peptidase-4 inhibitors. Clin Pharmacokinet 2012;51:501–14. [DOI] [PubMed] [Google Scholar]
  16. Guan C, Luo J, Li S. et al. Exploration of DPP-IV inhibitory peptide design rules assisted by the deep learning pipeline that identifies the restriction enzyme cutting site. ACS Omega 2023;8:39662–72. [DOI] [PMC free article] [PubMed]
  17. Hasan MM, Schaduangrat N, Basith S. et al. HLPpred-Fuse: improved and robust prediction of hemolytic peptide and its activity by fusing multiple feature representation. Bioinformatics 2020;36:3350–6. [DOI] [PubMed] [Google Scholar]
  18. He K, Zhang X, Ren S. et al. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada: IEEE 2016a, 770–8.
  19. He K, Zhang X, Ren S. et al. Identity mappings in deep residual networks. In: European Conference on Computer Vision. Amsterdam, The Netherlands: Springer, 2016b, 630–45.
  20. He W, Jiang Y, Jin J. et al. Accelerating bioactive peptide discovery via mutual information-based meta-learning. Brief Bioinform 2022;23:bbab499. [DOI] [PubMed] [Google Scholar]
  21. Jarvis CI, Cabrera A, Charron D.. Alogliptin: a new dipeptidyl peptidase-4 inhibitor for type 2 diabetes mellitus. Ann Pharmacother 2013;47:1532–9. [DOI] [PubMed] [Google Scholar]
  22. Jia C, He W.. EnhancerPred: a predictor for discovering enhancers based on the combination and selection of multiple features. Sci Rep 2016;6:38741. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Karimi M, Wu D, Wang Z. et al. DeepAffinity: interpretable deep learning of compound–protein affinity through unified recurrent and convolutional neural networks. Bioinformatics 2019;35:3329–38. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Kikkawa F, Ino K, Kajiyama H. et al. 26 Role of immunohistochemical expression of aminopeptidases in ovarian carcinoma. In: Molecular Genetics, Gastrointestinal Carcinoma, and Ovarian Carcinoma. Elsevier Inc., 2006, 509–17. [Google Scholar]
  25. Kim Y. Convolutional neural networks for sentence classification. In: Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: MIT Press, 2014.
  26. Kingma DP, Ba J. Adam: a method for stochastic optimization. In: International Conference on Learning Representations (ICLR), San Diego, CA, USA, 2015. OpenReview.net 2015.
  27. Li J, Zhang S, Liu T. et al. Neural inductive matrix completion with graph convolutional networks for miRNA-disease association prediction. Bioinformatics 2020;36:2538–46. [DOI] [PubMed] [Google Scholar]
  28. Liang S, Zhao Y, Jin J. et al. Rm-LR: a long-range-based deep learning model for predicting multiple types of RNA modifications. Comput Biol Med 2023;164:107238. [DOI] [PubMed] [Google Scholar]
  29. Lin T-Y, Goyal P, Girshick R. et al. Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE 2017, 2980–8.
  30. Liu K, Chen W.. iMRM: a platform for simultaneously identifying multiple kinds of RNA modifications. Bioinformatics 2020;36:3336–42. [DOI] [PubMed] [Google Scholar]
  31. Liu M, Sun Z-L, Zeng Z. et al. MGF6mARice: prediction of DNA N6-methyladenine sites in rice by exploiting molecular graph feature and residual block. Brief Bioinf 2022;23:bbac082. [DOI] [PubMed] [Google Scholar]
  32. McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. J Open Source Softw, 2018;3:861.
  33. Min J-L, Xiao X, Chou K-C.. iEzy-Drug: a web server for identifying the interaction between enzymes and drugs in cellular networking. Biomed Res Int 2013;2013:701317. [DOI] [PMC free article] [PubMed] [Google Scholar]
  34. Nongonierma AB, FitzGerald RJ.. Features of dipeptidyl peptidase IV (DPP‐IV) inhibitory peptides from dietary proteins. J Food Biochem 2019;43:e12451. [DOI] [PubMed] [Google Scholar]
  35. Phasit C, Jin J, Li Z. et al. StackDPPIV: a novel computational approach for accurate prediction of dipeptidyl peptidase IV (DPP-IV) inhibitory peptides. Methods 2022;204:189–98. [DOI] [PubMed] [Google Scholar]
  36. Ryu JY, Lee MY, Lee JH. et al. DeepHIT: a deep learning framework for prediction of hERG-induced cardiotoxicity. Bioinformatics 2020;36:3049–55. [DOI] [PubMed] [Google Scholar]
  37. Savova GK, Danciu I, Alamudun F. et al. Use of natural language processing to extract clinical cancer phenotypes from electronic medical records. Natural language processing for cancer phenotypes from EMRs. Cancer Res 2019;79:5463–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  38. Selvaraju RR, Cogswell M, Das A. et al. Grad-cam: visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE International Conference on Computer Vision. Venice, Italy: IEEE 2017, 618–26.
  39. Wang Y, Mendez RL, Kwon JY. et al. Functional discovery and production technology for natural bioactive peptides. Sheng wu Gong Cheng Xue Bao Chin J Biotechnol 2021;37:2166–80. [DOI] [PubMed] [Google Scholar]
  40. Wei L, Ye X, Xue Y. et al. ATSE: a peptide toxicity predictor by exploiting structural and evolutionary information based on graph neural network and attention mechanism. Brief Bioinform 2021;22(5). [DOI] [PubMed] [Google Scholar]
  41. Weininger D. SMILES, a chemical language and information system. 1. Introduction to methodology and encoding rules. J Chem Inf Comput Sci 1988;28:31–6. [Google Scholar]
  42. Yang Z, Zhong W, Zhao L. et al. ML-DTI: mutual learning mechanism for interpretable drug–target interaction prediction. J Phys Chem Lett 2021;12:4247–61. [DOI] [PubMed] [Google Scholar]
  43. Zeng Y, Chen X, Luo Y. et al. Deep drug–target binding affinity prediction with multiple attention blocks. Brief Bioinform 2021;22:bbab117. [DOI] [PMC free article] [PubMed] [Google Scholar]
  44. Zhou B, Khosla A, Lapedriza A. et al. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, Nevada: IEEE 2016, 2921–9.
  45. Zou H, Yin Z.. Identifying dipeptidyl peptidase-IV inhibitory peptides based on correlation information of physicochemical properties. Int J Peptide Res Therap 2021;27:2651–9. [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

btae057_Supplementary_Data

Data Availability Statement

All code used in data analysis and preparation of the manuscript, alongside a description of necessary steps for reproducing results, can be found in a GitHub repository accompanying this manuscript: https://github.com/WeiLab-BioChem/Structural-DPP-IV.


Articles from Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES