Abstract
Peptide drugs are becoming star drug agents with high efficiency and selectivity which open up new therapeutic avenues for various diseases. However, the sensitivity to hydrolase and the relatively short half-life have severely hindered their development. In this study, a new generation artificial intelligence-based system for accurate prediction of peptide half-life was proposed, which realized the half-life prediction of both natural and modified peptides and successfully bridged the evaluation possibility between two important species (human, mouse) and two organs (blood, intestine). To achieve this, enzymatic cleavage descriptors were integrated with traditional peptide descriptors to construct a better representation. Then, robust models with accurate performance were established by comparing traditional machine learning and transfer learning, systematically. Results indicated that enzymatic cleavage features could certainly enhance model performance. The deep learning model integrating transfer learning significantly improved predictive accuracy, achieving remarkable R2 values: 0.84 for natural peptides and 0.90 for modified peptides in human blood, 0.984 for natural peptides and 0.93 for modified peptides in mouse blood, and 0.94 for modified peptides in mouse intestine on the test set, respectively. These models not only successfully composed the above-mentioned system but also improved by approximately 15% in terms of correlation compared to related works. This study is expected to provide powerful solutions for peptide half-life evaluation and boost peptide drug development.
Keywords: peptide drugs, half-life, machine learning, transfer learning, enzymatic cleavage features, drug design
Graphical Abstract
Graphical Abstract.
Introduction
Peptides are a class of compounds formed by the connection of multiple amino acids through peptide bonds, usually defined as an entity containing 2–50 amino acid residues [1], or as a polymer composed of 40 or fewer amino acids [2]. Compared to small molecules, peptides generally exhibit higher biological activity and selectivity, resulting in fewer side effects. In contrast to proteins, peptides are smaller in size, making them easier to design and synthesize [3, 4]. These characteristics make peptides promising therapeutics with important applications in various fields [5–11]. Currently, more than 100 peptide-based drugs have been approved in the market for treating various common diseases [12–14]. However, due to low absorption and a high first-pass effect resulting from enzymatic and pH-mediated hydrolysis in the gastrointestinal tract and liver [15], most orally administered peptides have a bioavailability of less than 1%. Among all approved peptide drugs, they are usually administered through injections. Considering the complexity of peptide metabolism in vivo, assessing the worth of peptide drugs must take into account parameters like half-life (T1/2) and clearance rate, which provide crucial information about the residence time and metabolic stability of peptide drugs.
To expedite the development of peptide drugs, researchers are actively searching for new strategies to prolong the residence time of peptide drugs in vivo and enhance their metabolic stability. Applied strategies including amino acid side-chain modification [16], terminal cyclization [17], sequence modification [18], encapsulation in nano-carriers, etc. [2, 19–21] have been adopted to extend the half-life. Although these conventional approaches have led to enhancements in peptide bioactivity, they are plagued by expensive experimental techniques, intricate procedures, and time-consuming processes [22]. In recent years, machine learning (ML) methods have garnered significant attention and have been widely used in the field of peptide drug discovery and development. For example, Sharma et al. [23] employed support vector machine (SVM) to develop models for predicting the half-life of peptides in the mouse intestine. The feature-based models achieved a good correlation of 0.70 and 0.98, respectively. Mathur et al. [24] collected 261 peptides whose half-lives were experimentally validated in mammalian blood for modeling, achieving a maximum correlation of 0.692 on 261 peptides and 0.743 on 163 natural peptides. Additionally, Cavaco et al. [25] developed a multivariate regression model to predict the half-life of peptides in serum and obtained squared correlation coefficients of 0.76 and 0.78 on external validation sets.
Despite the progress made so far, there are still pressing challenges in peptide half-life evaluation. First, significant half-life differences exist for the same peptide across species and organs. For example, the half-life of SUPR [25] in mouse blood is only 1.92 h, while that in human blood is as high as 161.7 h. Mouse obestatin [26] has a half-life of 42.2 min in mouse blood and 12.6 min in mouse liver. However, the currently reported models do not distinguish between peptides from different species and organs, which may lead to a large bias in half-life prediction. Second, natural peptides can have varying degrees of half-life extension after modification. Taking natural GLP1 [26] as an example; its half-life in human blood is 6.2 h. After N-terminal acetylation, its half-life can be extended to more than 12 h. Modeling of natural and modified peptides for different species and organs can provide evidence for changes in the half-life of natural peptides after modification. So, it is important to establish new models for half-life evaluation in changing circumstances. Finally, the data size and model performance still need further improvement. Consequently, there exists an urgent need to develop more comprehensive solutions for peptide half-life evaluation in practical applications.
In this study, to solve these problems, a new generation artificial intelligence-based system for accurate prediction of peptide half-life was first proposed based on a systematic comparison between different molecular representations and advanced ML algorithms. To achieve this goal, it is necessary to consider the following aspects. Firstly, it is essential to establish a specific predictive system and analyze the available species and organs as well as peptide types, based on the requirements and distribution of real-world data in the peptide drug development process. We plan to collect experimental half-life data items from public databases and literature as many as possible and then categorize all the data by different species and organs to establish pretreated datasets. Secondly, unlike conventional practices, another important reason for the poor prediction of peptide half-life is the lack of information on its in vivo influencing factors. Considering the high correlation between the half-life of peptides and their in vivo metabolic processes, most of the current molecular descriptors primarily represent basic physicochemical properties, which are not enough to represent the complexities of peptides’ in vivo metabolic processes. Therefore, enzymatic cleavage features were introduced to achieve a full representation of peptides. Finally, we need to comprehensively improve the methodological approach of the predictive model, starting from traditional ML methods and progressively introducing deep neural network (DNN) algorithms as well as transfer learning to combine the above characterization strategies, with the expectation of obtaining highly effective models. Transfer learning strategy is employed by pre-training on large datasets and fine-tuning on small datasets for half-life prediction which can compensate for the loss of accuracy caused by insufficient data. The whole workflow is shown in Fig. 1. This study is expected to provide the first comprehensive system for half-life prediction of both natural and modified peptides across different species and organs, so as to help researchers design peptides with prior suggestions and save costs in the early stage of peptide drug development.
Figure 1.
Overview of the implementation process for the proposed methods. The process from top to bottom involves data collection and pre-processing, modeling using machine learning algorithms, and the application of transfer learning strategy.
Materials and methods
Data collection
In this study, a strict pipeline was customized for data collection and pre-processing. Initially, 3415 experimental half-life data about natural and modified peptides in different species and organs were collected from public databases and published literature. Specifically, 2200 data were collected from the PEPlife database [26], 112 data from the PepTherDia database [27], 852 data from the THPdb database [28], and 251 data from the literature. Subsequently, we performed a statistical analysis of the collected data by different species, organs, and types. Finally, five datasets were obtained: the half-life of natural peptides in human blood (HBN), the half-life of modified peptides in human blood (HBM), the half-life of natural peptides in mouse blood (MBN), the half-life of modified peptides in mouse blood (MBM), and the half-life of modified peptides in mouse intestine (MIM).
To guarantee the quality and reliability of the data, the following pretreatments were carried out: (i) peptides with exact half-life values were included, while peptides with empty or indeterminate half-life values were excluded. (ii) The SMILES of these peptides were checked one by one to ensure their correctness. For peptides lacking SMILES, the SMILES of natural peptides were converted using the RDKit package (http://www.rdkit.org/), while the SMILES of modified peptides were manually extracted according to publications. (iii) Removing redundant data and peptides with more than 50 amino acids as well as a half-life exceeding 24 h [24]. After a series of pretreatments, 970 high-quality data were finally collected for 117, 187, 106, 182, and 378 peptides and their corresponding half-life on the HBN, HBM, MBN, MBM, and MIM datasets, respectively.
Molecular representation
The half-life of peptides is highly correlated with their in vivo metabolic processes. However, traditional small molecular descriptors and peptide descriptors only capture basic physicochemical properties, which are not enough to represent the complexities of peptides’ in vivo metabolic processes. Consequently, enzymatic cleavage features were introduced to realize the comprehensive representation of molecules for further model building.
Enzymatic cleavage descriptors
Enzymatic cleavage refers to the process in which proteases or peptidases degrade peptides or proteins by cleaving peptide bonds. Enzyme-catalyzed cleavage reactions are the primary pathways for peptide degradation, and the number and position of cleavage sites directly affect the stability and degradation rate of peptides. Enzymatic cleavage features provide information about potential cleavage sites, cleavage frequency, and specific enzyme action patterns within peptide sequences, which not only enrich the model’s input features but also introduce biological background knowledge, enhancing the model’s ability to recognize the degradation patterns of peptides in vivo. Additionally, enzymatic cleavage features help the model better capture the stability changes of peptides in different physiological environments, thereby improving the generalization ability of the model.
In this study, common peptidases in human blood, liver, and kidney were summarized based on Werle’s study [29]. Enzymatic cleavage descriptors were obtained by submitting peptide sequences to the PeptideCutter tool [30]. In this context, the total number of cleavage sites was defined as the total number of enzymes that can cleave the peptide out of the 37 enzymes. The nearest enzymatic cleavage site was defined as the shortest distance between two cleavage sites, and the enzymatic cleavage weight was defined as the number of cleavage sites each enzyme could cleave. Finally, a set of 39 enzymatic cleavage descriptors was customized.
Small molecular descriptors
The modlAMP package [31] was used to calculate 10 descriptors related to the physicochemical properties of the peptide. After that, 489 small molecular descriptors were calculated based on SMILES using the PyBioMed package [32].
Peptide descriptors
363 peptide descriptors were calculated based on peptide sequences using the PyBioMed package [32].
Combination of different descriptors
All descriptors were categorized into three groups: the first group consisted of 499 small molecular descriptors (SM). The second group included a combination of small molecular descriptors and peptide descriptors, including 862 descriptors (PEP). The third group combined small molecular descriptors with peptide descriptors and enzymatic cleavage descriptors, including 901 descriptors (ENZ).
Pre-processing
To reduce noise and improve model efficiency, data pre-processing was performed: (i) descriptors whose variance is 0.05 or less than 0.05 were removed; (ii) descriptors having high correlations were randomly deleted one of the descriptors (correlation >0.99).
ML algorithm
To build half-life prediction models, three representative algorithms, including Random Forest (RF), eXtreme Gradient Boosting (XGBoost), and support vector regression (SVR), were employed to develop regression models. The mentioned algorithms were all implemented on the KNIME platform [33]. Both Leave-One-Out cross-validation (LOO) and 10-fold cross-validation (10-CV) were used to comprehensively evaluate the performance of the models. To obtain models with better performance, a two-step parameter tuning strategy was carried out. First, a coarse-tuning was performed based on ‘grid search’, followed by manual fine-tuning. In addition, two feature selection methods were employed to optimize models: Recursive Feature Elimination on RF, combined with a five-fold cross-validation (RFECV-RF), and SVM combined with ant colony optimization (SVR-ACO) algorithm [34].
DNN was also used to construct models for half-life prediction, which consists of an input layer, three hidden layers activated using ReLU, and an output layer with a linear activation function. Each hidden layer is followed by a dropout layer to prevent overfitting. The hyperparameters of the model (neurons, learning rate, dropout rate, batch size, and epochs) were optimized by grid search, and a 10-CV was performed to evaluate the model’s performance.
Transfer learning
Introducing enzymatic cleavage features could improve model performance, but the enhancement was relatively limited. Therefore, transfer learning was applied to further improve the prediction performance of models. In this study, the AlphaPeptDeep [35] deep learning architecture was employed to implement transfer learning. This architecture was proposed in 2022 and was capable of learning and predicting retention time (RT), fragmentation spectra (MS2), collisional cross sections (CCS), and other properties based on peptide sequences. We attempted to replicate the AlphaRTModel from AlphaPeptDeep to transfer the retention time prediction model to the task of predicting the half-life of peptides. To adapt to this task, we made the following modifications: first, the pre-training dataset was adapted to the ‘tryptic’ type dataset with identifier PXD019086 from the ProteomeXchange database [36], which contains 351 804 entries of proteomic data. The pre-training parameters were set as follows: epoch = 300, warm-up epoch = 30, Ir = 1e-4. The trained model was saved for subsequent transfer learning. Next, the target datasets were adjusted to five peptide datasets, which were split into training and test sets at a ratio of 0.75: 0.25. The fine-tuning parameter was set to epoch = 50 (Fig. 2).
Figure 2.
Workflow of the transfer learning strategy. Pre-training on a large dataset and fine-tuning on the five peptide datasets for half-life prediction. The pre-training parameters were Epoch = 300, warm-up epoch = 30, Ir = 1e-4. The fine-tuning parameter was: Epoch = 50.
Model evaluation metrics
For these regression models, three commonly used parameters were applied to evaluate their quality: the coefficient of determination (R2), the mean absolute error (MAE), and the root mean square error (RMSE). Their formulas are as follows.
![]() |
![]() |
![]() |
Here, N is the number of samples,
and
are the experimental and predicted values of the
th sample, and
is the mean value of the N samples.
Results and discussion
Analysis of peptide half-life data
In general, the model trained on peptides with a wide range of experimental half-life values could have a broad applicability domain. Before modeling, we conducted a detailed analysis of the distribution of data and categorized stability into seven ranges (Table S1) [25]. As shown in Fig. 3, most of the peptides in the HBN dataset exhibited unstable (T ≤ 5 min) and undegradable (T > 720 min), while in the HBM and MBM datasets, the majority of peptides were distributed in the range of low to high stability (15 min < T ≤ 360 min). Peptides in the MBN dataset were concentrated in the range of unstable and low stability (T ≤ 60 min). In the MIM dataset, up to 99% of peptides exhibited unstable characteristics. In summary, our datasets covered a broad range. In addition, we conducted a statistical analysis of the number of amino acids, length, and charge distribution of peptides in the five datasets (Figs S1–S3). We also used heatmaps to evaluate the correlation between 10 variables including length, charge, pI and others with the T1/2 (Figs S4 and S5). The results showed that there is no significant correlation between these variables and half-life. Therefore, it is necessary to construct half-life predictive models for peptides in different species and organs.
Figure 3.
Half-life and stability distribution of peptides within the five datasets. (A) Half-life distribution for the MIM dataset. (B) Half-life distribution excluding the MIM dataset. (C) Stability distribution of peptides across five datasets. 1–7 represents unstable, very low stability, low stability, stable, high stability, very high stability, undegradable, respectively.
Half-life prediction in human blood
The HBN and HBM datasets were randomly split into training sets and test sets at a ratio of 0.75: 0.25, and a total of 18 regression models were constructed by combining different descriptors with different ML algorithms, respectively (Fig. S16). The modeling results can be found in Tables S2 and S3.
As can be seen from the tables, when using 10-CV in the HBN dataset, the best performance was achieved by the ENZ-XGBoost model with R2 = 0.84, correlation = 0.919 for CV and R2 = 0.79, correlation = 0.891 for the test (Fig. 6A(a)). When using LOO, the SVR model constructed based on the PEP descriptors performed the best with R2 = 0.83, the correlation coefficient could go as high as 0.912. Comparing different descriptors under the same algorithms, as shown in Fig. 4A–C, the models based on ENZ combining different levels of descriptors obtained the best performance, although the enhancement was not significant compared to the models developed using the PEP descriptors.
Figure 6.
Combined plots of the best model’s results. Scatter plots showed the performance of the best models developed using (A) machine learning and (B) transfer learning for the five datasets, with R2 as an evaluation metric. (a) HBN, (b) HBM, (c) MBN, (d) MBM, and (e) MIM datasets. (C) the scatter plot displayed the performance of the optimal machine learning models in predicting half-life using leave-one-out cross-validation. (D) R2 comparison before and after fine-tuning by transfer learning. The heatmaps showed the R2 (E) for CV and (F) for the test of models built using different algorithms for the five datasets.
Figure 4.
Comparison results of models built using different descriptors and algorithms. The histograms displayed the prediction results of (A) RF, (B) XGBoost, and (C) SVR models developed using three-group descriptors for the HBN dataset. The prediction results of (D) RF, (E) XGBoost, and (F) SVR models established using three-group descriptors for the HBM dataset. The left Y-axis displays MAE and RMSE, while the right Y-axis corresponds to R2 and correlation. SM: Small molecular descriptors, PEP: Combining small molecular descriptors with peptide descriptors, ENZ: Combining small molecular descriptors with peptide descriptors and enzymatic cleavage descriptors.
Predicting the half-life of modified peptides in human blood (HBM) through the combination of different descriptors and algorithms exhibited slightly weaker predictive performance compared to natural peptides. The R2 of the optimal model was only 0.60 and 0.56 for 10-CV and the test, respectively (Fig. 6A(b)), while an R2 of 0.60 was achieved using LOO (Fig. 6C). We found that whether 10-CV or LOO was performed, the model’s predictive results were quite similar. When comparing the effectiveness of descriptors, small molecular descriptors achieved the worst accuracy among the three molecular representations. However, the ENZ descriptors and PEP descriptors exhibited comparable performance among the models constructed based on three algorithms, indicating that they have similar capabilities in predicting the half-life of modified peptides in human blood (Fig. 4D–F).
Half-life prediction in mouse blood and intestine
The MBN, MBM, and MIM datasets were randomly divided into a training set containing 75% of the data and a test set containing 25% of the data, respectively. Tables S4, S5, and S6 showed detailed information about prediction results.
When predicting the half-life of natural peptides in mouse blood (MBN), the combination of ENZ descriptors and XGBoost algorithm exhibited good performance, with R2 of 0.84 and 0.82 for 10-CV and the test, respectively (Fig. 6A(c)). That can also be supported by the high correlations observed from the optimal model (0.919 for 10-CV and 0.914 for the test). By analyzing different kinds of algorithms and descriptors, the XGBoost models achieved the best performance compared to RF and SVR models (Fig. S16). In particular, after integrating the enzymatic cleavage descriptors, XGBoost improved the R2 of the model by approximately 0.15 compared to the SM group and by around 0.1 compared to the PEP group, with an R2 as high as 0.8. The RF and SVR models, in contrast, improved the R2 by less than 0.1 (Fig. 5A–C). In addition, both of the 10-CV and LOO were suitable for evaluating these models while 10-CV seems to be better in some cases like in the XGBoost-ENZ combination.
Figure 5.
Comparison results of models developed by different descriptors and algorithms. The histograms showed the predictive performance of (A) RF, (B) XGBoost, and (C) SVR models established using three-group descriptors for the MBN dataset. The predictive performance of (D) RF, (E) XGBoost, and (F) SVR models constructed using three-group descriptors for the MBM dataset. The predictive performance of (G) RF, (H) XGBoost, and (I) SVR models built using three-group descriptors for the MIM dataset. The left Y-axis displays MAE and RMSE, while the right Y-axis corresponds to R2 and correlation. SM: small molecular descriptors, PEP: combining small molecular descriptors with peptide descriptors, ENZ: combining small molecular descriptors with peptide descriptors and enzymatic cleavage descriptors.
The tree-based RF algorithm outperformed all other algorithms when predicting the half-life of modified peptides in mouse blood (MBM), and the R2 of the optimal ENZ-RF model was 0.48 for both 10-CV and the test (Fig. 6A(d)). When using LOO, the R2 of the optimal model was 0.49 (Fig. 6C). Although the model can meet the prediction requirements, the R2 was still about 0.25 lower than that of the MBN model. By comparing the effectiveness of different descriptors, the best was ENZ descriptors, while the small molecular descriptors performed not well compared to other descriptors (Fig. 5D–F).
The MIM models, which were constructed to predict the half-life of modified peptides in the mouse intestine, achieved reasonable performance with R2 > 0.6 for CV and the test. The R2 of the optimal model was 0.73 when using 10-CV (Fig. 6A(e)) and 0.78 when using LOO (Fig. 6C). When comparing the power of descriptors, the ENZ descriptors still performed the best among others though the difference is not significant (Fig. 5G–I).
Half-life prediction by transfer learning
Architecture and pre-training
Transfer learning involves applying knowledge, parameters, or models obtained from one task or domain to enhance the performance of another related task or domain. In this study, a knowledge-based transfer learning model was developed to apply the knowledge gained from the prediction model of peptide retention time to predict half-life. The main reason for employing transfer learning is the shared physicochemical properties between retention time and half-life, which establish a connection between the two tasks. Specifically, factors such as peptide sequence, length, and hydrophobicity simultaneously influence the retention time and half-life of peptides. Existing retention time prediction models, by learning complex relationships and patterns within amino acid sequences, successfully capture key properties of peptide molecules. Additionally, pre-training the model on a retention time dataset containing a large number of modification types enables a more comprehensive understanding of the complexity of biomolecules, enhancing sensitivity to changes in half-life.
To illustrate this correlation more intuitively, 5000 examples were randomly sampled from the pre-training dataset and global alignment based on the Needleman–Wunsch algorithm [37] was applied to compare the sequence similarity of these samples with the five target datasets, as shown in Table S7, it is evident that there was significant sequence similarity between the five datasets and the pre-training dataset. In particular, the similarity was as high as 100% for the MIM dataset, while the MBM and HBM datasets showed similarities of 94.5% and 94.1%, respectively. This indicated the potential application of transfer learning to acquire knowledge from pre-trained models for half-life prediction.
Model performance
After training, the pre-trained model demonstrated quite good predictive performance, achieving an R2 of 0.972 on the training set and an R2 of 0.973 on the test set. After that, the pre-trained model was saved for subsequent transfer learning. It can be seen that the pre-trained models performed poorly on the five datasets before fine-tuning, and R2 was negative on all of them (Table S8). After fine-tuning, the pre-trained models significantly improved their performance on the peptide datasets (Fig. 6D). Specifically, transfer learning demonstrated excellent predictive performance on the MBN dataset, with R2 close to 1 on both the training set and test set. For the MIM dataset, the model performed well with R2 = 0.93 for the training set and R2 = 0.94 for the test set. The performance on the HBM (R2 = 0.92 and 0.90) and MBM (R2 = 0.87 and 0.93) datasets was relatively close to each other. Additionally, the model also achieved good performance on the HBN dataset, with R2 = 0.86 for the training set and R2 = 0.84 for the test set (Fig. 6B).
To evaluate the effectiveness of transfer learning in enhancing prediction performance with limited datasets, DNN models were constructed using the ENZ descriptors (Fig. S17). Simultaneously, we chose three ML algorithms that exhibited optimal performance when using ENZ descriptors and compared their predictive results with transfer learning (Table S9). The comparison results in Fig. 6E and F showed that transfer learning consistently outperformed other algorithms on all five datasets. For the HBM dataset, transfer learning improved R2 by approximately 0.3 compared to the three traditional ML algorithms. For the MBM dataset, transfer learning outperformed the DNN model, boosting R2 by around 0.4, while reducing MAE and RMSE by approximately 0.55. Furthermore, for the MBN and MIM datasets, transfer learning achieved an R2 of around 0.95. The above results indicated that transfer learning was effective in improving predictive performance with small datasets.
Impact of different learning strategies on model performance
When analyzing ML predictions, we found that the R2 for the HBM and MBM models were lower by approximately 0.2 and 0.3, respectively, compared to other models. This discrepancy can be attributed to the diversity of modification types, resulting in limited data for each specific modification. In the HBM and MBM datasets, dozens of different modifications were considered. In contrast, the MIM model performed well with only four modifications considered. On this basis, enzymatic cleavage descriptors were introduced to improve the models’ performance, yet the improvement was not as expected. This could be due to several factors: Firstly, peptide degradation is influenced by various biological and metabolic factors, which increases the complexity of prediction. Additionally, challenges in obtaining shear enzyme information for customizing cleavage descriptors further hindered model improvement. Moreover, inter-individual variations in age, gender, and physiological conditions complicate half-life prediction. However, transfer learning significantly outperformed existing methods by leveraging features and patterns learned from larger peptide sequence datasets, providing a robust tool for addressing issues related to data scarcity and domain disparities.
Analysis of structural information affecting peptide half-life
For five different datasets, RFECV-RF and SVR-ACO algorithms were employed to select the optimal feature subsets for subsequent modeling (Figs S6–S15). Here, we take the MIM dataset as a representative example due to its larger size and representativeness, offering richer information on peptide structure for a deeper understanding and interpretation of model results.
During the feature selection process, the built-in importance module of RF was applied to rank the features. The R2 reached its maximum value (0.722) when 31 descriptors were selected (Fig. 7A). When using the SVR-ACO algorithm, the pre-processed MIM dataset was fitted 10 times with 200 iterations each for feature selection, and the top 30 descriptors were selected to form the optimal feature subset (Fig. 7C). We found that the feature subset selected by RFECV-RF exhibited better predictive performance, with the optimal model achieving an R2 of 0.73 and 0.73 for the CV and test, respectively. Therefore, we decided to use this feature subset for model interpretation.
Figure 7.
Feature selection and model interpretation for the MIM dataset. (A) Results of feature selection using RFECV-RF, with R2 as the evaluation metric. (B) the top 20 descriptors and their importance calculated by RF algorithm. (C) Predictive results and feature subset size after 10 feature selections by SVR-ACO. The right Y-axis displays Q2 and RMSECV. (D) SHAP plot of the SVR model constructed using the ENZ descriptors (showing the top 10 features). The R, F, and G represent arginine, phenylalanine, and glycine, respectively.
To explore the effects of molecular properties and in vivo proteases on peptide half-life, the feature importance computed by RF (Table S10) was combined with SHAP values to explain the ENZ-SVR model (Fig. S18). Comparing Fig. 7B and D, six of the top 10 features selected by both methods are common, including R (arginine), S24, Trypsin, AliphaticInd, Smax34, and G (glycine). Among these features, all except G (glycine) have a negative effect on the prediction of half-life, meaning that the half-life becomes shorter as their values increase. S24 and Smax24 are both related to the Energy State (E-State) of atoms and can promote peptide degradation due to their high reactivity. Trypsin is an enzyme capable of hydrolyzing peptides and accelerating peptide degradation as the number of cleavage sites increases. The aliphaticInd reflects the relative content of aliphatic amino acids in a peptide and also has a negative contribution. R (arginine) carries a positive charge that causes peptide instability, while G (glycine) has a simple side chain that facilitates half-life prediction. Here, some of the conclusions overlap with previous studies [23]. Furthermore, the top 10 descriptors simultaneously included small molecular descriptors, peptide descriptors, and enzymatic cleavage descriptors, indicating their coordinated role in prediction.
Representative cases and external validation
The model’s excellent performance not only enables the prediction of the unknown half-life of peptides but also provides multiple levels of evidence and guidance for improving half-life. Firstly, our findings can provide evidence for changes in the half-life of natural peptides after modification. For example, ALP1 (RWCVYARVRGVRYRRCW) and ALP2 (RWCVYACVRGVCYRRCW) were studied as examples of peptides with different modification types. When the ‘C’ (cysteine) at positions 7 and 12 in the ALP2 sequence was replaced by ‘R’ (arginine), the half-life decreased from 24 to 4 h (Fig. 8A and C). Model interpretation revealed a negative influence of the amino acid ‘R’ on the half-life, with its increase leading to a shorter half-life. Furthermore, when the ‘P’ (proline) at position 4 in EM-1 (YPFP-NH2) was replaced by ‘F’ (phenylalanine), the half-life decreased from 0.44 to 0.1 h (Fig. 8B and D), which is consistent with the model’s interpretation. Next, on a broader level, models for different species and organs can provide a more prospective assessment of the half-life of candidate peptides. For instance, Ghrelin was determined to be 27 min in mouse blood compared to 236 min in human blood. Similarly, CSA (LLLLPY) shows half-lives of 6 h in mouse blood and 2.8 h in human blood. On one hand, these predictive results can provide accurate predictions during the preclinical stages of animal testing before human trials. On the other hand, they lay the foundation for establishing scaling factors across species and organs, facilitating the development of new models.
Figure 8.
Cases and validation results by using our method. (A) Structures of naturally occurring ALP1 and ALP2. (B) Chemical structures of EM-1 and EM-2. (C) the half-life of ALP1 and ALP2. (D) the half-life of EM-1 and EM-2. (E) the actual and predicted values of three approved peptides. (F) Structures of trofinetide, motixafortide, and pentagastrin.
To further validate the accuracy of the models, the newly approved peptides in 2023, trofinetide and motixafortide, along with pentagastrin, which was not exposed to the training set, were selected for external validation. The half-lives of these peptides measured in human blood were 90, 120, and 10 min, respectively. Predictions by the HBM model yielded final results of 51, 75, and 22 min, respectively (Fig. 8E and F). At the minute scale, the predicted values were relatively close to the true values, indicating that the model’s predictions were successful. In general, half-life predictions within specific ranges are widely accepted [38–40]. Therefore, different fold errors (Folds) were used to assess model predictions, defined as fold = 1 + |Ypred − Ytrue|/Ytrue. Trofinetide and motixafortide fell within a two-fold error range, while pentagastrin was within a three-fold error range, demonstrating that the model has good predictive performance.
At the early stages of drug discovery, researchers can input peptide data into our predictive models to initially estimate the half-life of peptides in different species and organs, thus providing more accurate guidance for the development of peptide drugs. Furthermore, with the accumulation of experimental data, it may be possible to infer interspecies and interorgan coefficients in the future when data size is sufficient. This not only enhances the predictive accuracy of the model but also contributes to a deeper understanding of the fundamental differences in drug metabolism among different species.
Comparison and future challenges
Predicting the half-life of peptides poses significant challenges. On one hand, the limited training data makes it difficult for models to adequately capture the complex information of peptide structures. On the other hand, traditional peptide descriptors make it difficult to describe the behavior patterns of highly flexible peptides in complex biological processes. As mentioned earlier, although some half-life prediction models have been reported, the focuses are different. The goal of this work is more focused on how to provide a comprehensive solution for the in silico evaluation of peptide half-life. Moreover, we have made breakthroughs in both data and model performance. In terms of data size, previous models usually included only around 300 data, whereas our study collected about 950 peptides of 5 subgroups. Concerning predictive performance, our models improved about 15% (0.78 versus 0.94) in terms of the correlation by introducing enzymatic features and applying a transfer learning strategy. In addition, the model’s predictive ability was successfully interpreted and validated through specific examples. This study provides a more prospective assessment of the half-life of candidate peptides and provides crucial references for structural modifications of peptides.
Although our models have achieved commendable performance, there is still room for further improvement. First, enlarging the training dataset is necessary, which will improve the prediction performance, especially for modified peptides. At the same time, that will ensure a broader application domain of the models. Second, further research is needed to explore the practical application scenarios of the half-life of different species and organs in the drug discovery process, as well as uncover the underlying relationships between them. The extrapolation of scaling factors across species and organs is considered to be an important subject that will provide a more accurate assessment of half-life prediction. In addition, obtaining information about cleavage enzymes is relatively challenging, and some enzymatic cleavage information closely related to peptide hydrolysis cannot be obtained under existing conditions. In future studies, breakthroughs could be considered in obtaining enzymatic information. Some models used for predicting protein and peptide cleavage sites have already been reported [41, 42]. In the future, it may be useful to combine these models with half-life prediction.
Conclusion
In this study, we successfully proposed and implemented a prospective artificial intelligence-based system that achieved accurate half-life prediction based on a systematic comparison between different molecular representations and advanced ML algorithms, successfully bridging the evaluation possibility of two important species (human, mouse) and two organs (blood, intestine). At the initial stage of the study, we constructed high-quality datasets for multiple species and organs, and then skillfully combined enzymatic descriptors and traditional peptide descriptors to achieve better representation of peptides. At the methodological level, we continuously improved the prediction models, starting from traditional ML methods and gradually introducing DNNs as well as a transfer learning strategy to systematically build robust models with accurate performance. Results showed that the introduction of enzymatic cleavage features can improve model performance but not so obviously, the application of transfer learning significantly enhanced the model’s predictive power, and the optimal models achieved excellent R2 of 0.84, 0.90, 0.984, 0.93, and 0.94 for the test, respectively, which improved about 15% in terms of the correlation compared to related work. Furthermore, our findings were supported by specific examples and demonstrated the significant potential for guiding peptides through structural modifications to improve stability. We believe this work will provide the first comprehensive prediction system for half-life prediction across species and organs, helping researchers to evaluate peptide properties at the early drug discovery stage and advancing peptide drug development.
Key Points
Enzymatic cleavage features were first integrated for peptide half-life prediction.
A peptide half-life prediction system across species and organs was first established.
Transfer learning was successfully utilized and improved the model performance.
The best models achieved approximately 15% improvement compared to related studies.
Supplementary Material
Contributor Information
Xiaorong Tan, Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha 410083, P.R. China.
Qianhui Liu, Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha 410083, P.R. China.
Yanpeng Fang, Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha 410083, P.R. China.
Sen Yang, Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha 410083, P.R. China.
Fei Chen, Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha 410083, P.R. China.
Jianmin Wang, The Interdisciplinary Graduate Program in Integrative Biotechnology and Translational Medicine, Yonsei University, 214, Veritas A Hall, Yonsei Univeristy, 85 Songdogwahak-ro, Incheon 21983, Republic of Korea.
Defang Ouyang, Institute of Chinese Medical Sciences (ICMS), State Key Laboratory of Quality Research in Chinese Medicine, University of Macau, Avenida da Universidade, Taipa, Macau 999078, China.
Jie Dong, Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha 410083, P.R. China.
Wenbin Zeng, Xiangya School of Pharmaceutical Sciences, Central South University, No. 172 Tongzipo Road, Yuelu District, Changsha 410083, P.R. China.
Author contributions
Xiaorong Tan: Data Curation, Methodology, Investigation, Writing—original draft. Qianhui Liu: Formal analysis, Software, Validation. Yanpeng Fang: Visualization, Validation. Sen Yang: Data Curation, Software. Fei Chen: Resources, Supervision. Jianmin Wang: Software, Resources. Defang Ouyang: Resources, Supervision. Jie Dong: Conceptualization, Methodology, Writing—review and editing, Supervision, Project administration, Funding acquisition. Wenbin Zeng: Resources, Supervision, Funding acquisition, Writing—review and editing.
Conflict of interest
None declared.
Funding
This work was supported by the Central South University Innovation-Driven Research Program (2023CXQD004 to J. D.); the National Natural Science Foundation of China (82272067 to W. Z.).
Data availability
Additional data are made available in supplementary materials of this manuscript. The datasets will be shared on reasonable request to the corresponding author.
References
- 1. IUPAC-IUB Joint Commission on Biochemical Nomenclature (JCBN) . Nomenclature and symbolism for amino acids and peptides. Recommendations 1983. Biochem J 1984;219:345–73. 10.1042/bj2190345. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2. Cooper BM, Iegre J, OD DH. et al. Peptides as a platform for targeted therapeutics for cancer: peptide-drug conjugates (PDCs). Chem Soc Rev 2021;50:1480–94. 10.1039/D0CS00556H. [DOI] [PubMed] [Google Scholar]
- 3. Wei L, Ye X, Sakurai T. et al. ToxIBTL: prediction of peptide toxicity based on information bottleneck and transfer learning. Bioinformatics 2022;38:1514–24. 10.1093/bioinformatics/btac006. [DOI] [PubMed] [Google Scholar]
- 4. Wei L, Ye X, Xue Y. et al. ATSE: a peptide toxicity predictor by exploiting structural and evolutionary information based on graph neural network and attention mechanism. Brief Bioinform 2021;22:bbab041. 10.1093/bib/bbab041. [DOI] [PubMed] [Google Scholar]
- 5. Erzina D, Capecchi A, Javor S. et al. An immunomodulatory peptide dendrimer inspired from Glatiramer acetate. Angew Chem Int Ed Engl 2021;60:26403–8. 10.1002/anie.202113562. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. George NL, Orlando BJ. Architecture of a complete Bce-type antimicrobial peptide resistance module. Nat Commun 2023;14:3896. 10.1038/s41467-023-39678-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Koo JH, Kim GR, Nam KH. et al. Unleashing cell-penetrating peptide applications for immunotherapy. Trends Mol Med 2022;28:482–96. 10.1016/j.molmed.2022.03.010. [DOI] [PubMed] [Google Scholar]
- 8. Kumar R, Chaudhary K, Sharma M. et al. AHTPDB: a comprehensive platform for analysis and presentation of antihypertensive peptides. Nucleic Acids Res 2015;43:D956–62. 10.1093/nar/gku1141. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9. Tyagi A, Tuknait A, Anand P. et al. CancerPPD: a database of anticancer peptides and proteins. Nucleic Acids Res 2015;43:D837–43. 10.1093/nar/gku892. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10. Vilas Boas LCP, Campos ML, Berlanda RLA. et al. Antiviral peptides as promising therapeutic drugs. Cell Mol Life Sci 2019;76:3525–42. 10.1007/s00018-019-03138-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11. Zorko M, Jones S, Langel Ü. Cell-penetrating peptides in protein mimicry and cancer therapeutics. Adv Drug Deliv Rev 2022;180:114044. 10.1016/j.addr.2021.114044. [DOI] [PubMed] [Google Scholar]
- 12. Madsen CT, Refsgaard JC, Teufel FG. et al. Combining mass spectrometry and machine learning to discover bioactive peptides. Nat Commun 2022;13:6235. 10.1038/s41467-022-34031-z. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 13. Muttenthaler M, King GF, Adams DJ. et al. Trends in peptide drug discovery. Nat Rev Drug Discov 2021;20:309–25. 10.1038/s41573-020-00135-8. [DOI] [PubMed] [Google Scholar]
- 14. Sharma K, Sharma KK, Sharma A. et al. Peptide-based drug discovery: current status and recent advances. Drug Discov Today 2023;28:103464. 10.1016/j.drudis.2022.103464. [DOI] [PubMed] [Google Scholar]
- 15. Zhu Q, Chen Z, Paul PK. et al. Oral delivery of proteins and peptides: challenges, status quo and future perspectives. Acta Pharm Sin B 2021;11:2416–48. 10.1016/j.apsb.2021.04.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16. Swain JA, Walker SR, Calvert MB. et al. The tryptophan connection: cyclic peptide natural products linked via the tryptophan side chain. Nat Prod Rep 2022;39:410–43. 10.1039/D1NP00043H. [DOI] [PubMed] [Google Scholar]
- 17. Zhang Y, Zhang Q, Wong CTT. et al. Chemoselective peptide cyclization and Bicyclization directly on unprotected peptides. J Am Chem Soc 2019;141:12274–9. 10.1021/jacs.9b03623. [DOI] [PubMed] [Google Scholar]
- 18. Fetse J, Kandel S, Mamani UF. et al. Recent advances in the development of therapeutic peptides. Trends Pharmacol Sci 2023;44:425–41. 10.1016/j.tips.2023.04.003. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19. Malhaire H, Gimel JC, Roger E. et al. How to design the surface of peptide-loaded nanoparticles for efficient oral bioavailability? Adv Drug Deliv Rev 2016;106:320–36. 10.1016/j.addr.2016.03.011. [DOI] [PubMed] [Google Scholar]
- 20. Canalle LA, Löwik DW, Hest JC. Polypeptide-polymer bioconjugates. Chem Soc Rev 2010;39:329–53. 10.1039/B807871H. [DOI] [PubMed] [Google Scholar]
- 21. Wang Y, Cheetham AG, Angacian G. et al. Peptide-drug conjugates as effective prodrug strategies for targeted delivery. Adv Drug Deliv Rev 2017;110-111:112–26. 10.1016/j.addr.2016.06.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Lee MF, Poh CL. Strategies to improve the physicochemical properties of peptide-based drugs. Pharm Res 2023;40:617–32. 10.1007/s11095-023-03486-0. [DOI] [PubMed] [Google Scholar]
- 23. Sharma A, Singla D, Rashid M. et al. Designing of peptides with desired half-life in intestine-like environment. BMC Bioinform 2014;15:282. 10.1186/1471-2105-15-282. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24. Mathur D, Singh S, Mehta A. et al. In silico approaches for predicting the half-life of natural and modified peptides in blood. PloS One 2018;13:e0196829. 10.1371/journal.pone.0196829. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25. Cavaco M, Valle J, Flores I. et al. Estimating peptide half-life in serum from tunable, sequence-related physicochemical properties. Clin Transl Sci 2021;14:1349–58. 10.1111/cts.12985. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Mathur D, Prakash S, Anand P. et al. PEPlife: a repository of the half-life of peptides. Sci Rep 2016;6:36617. 10.1038/srep36617. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. D'Aloisio V, Dognini P, Hutcheon GA. et al. PepTherDia: database and structural composition analysis of approved peptide therapeutics and diagnostics. Drug Discov Today 2021;26:1409–19. 10.1016/j.drudis.2021.02.019. [DOI] [PubMed] [Google Scholar]
- 28. Usmani SS, Bedi G, Samuel JS. et al. THPdb: database of FDA-approved peptide and protein therapeutics. PloS One 2017;12:e0181748. 10.1371/journal.pone.0181748. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29. Werle M, Bernkop-Schnürch A. Strategies to improve plasma half life time of peptide and protein drugs. Amino Acids 2006;30:351–67. 10.1007/s00726-005-0289-3. [DOI] [PubMed] [Google Scholar]
- 30. Wilkins MR, Gasteiger E, Bairoch A. et al. Protein identification and analysis tools in the ExPASy server. Methods Mol Biol 1999;112:531–52. [DOI] [PubMed] [Google Scholar]
- 31. Müller AT, Gabernet G, Hiss JA. et al. modlAMP: python for antimicrobial peptides. Bioinformatics 2017;33:2753–5. 10.1093/bioinformatics/btx285. [DOI] [PubMed] [Google Scholar]
- 32. Dong J, Yao ZJ, Zhang L. et al. PyBioMed: a python library for various molecular representations of chemicals, proteins and DNAs and their interactions. J Chem 2018;10:16. 10.1186/s13321-018-0270-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Berthold MR, Cebron N, Dill F. et al. KNIME-the Konstanz information miner: version 2.0 and beyond. ACM SIGKDD Explorations Newsletter 2009;11:26–31. 10.1145/1656274.1656280. [DOI] [Google Scholar]
- 34. Liu ZZ, Huang JW, Wang Y. et al. ECoFFeS: a software using evolutionary computation for feature selection in drug discovery. IEEE Access 2018;6:20950–63. 10.1109/ACCESS.2018.2821441. [DOI] [Google Scholar]
- 35. Zeng W-F, Zhou X-X, Willems S. et al. AlphaPeptDeep: a modular deep learning framework to predict peptide properties for proteomics. Nat Commun 2022;13:7238. 10.1038/s41467-022-34904-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Deutsch EW, Bandeira N, Perez-Riverol Y. et al. The ProteomeXchange consortium at 10 years: 2023 update. Nucleic Acids Res 2023;51:D1539–d1548. 10.1093/nar/gkac1040. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970;48:443–53. 10.1016/0022-2836(70)90057-4. [DOI] [PubMed] [Google Scholar]
- 38. Dong J, Wang NN, Yao ZJ. et al. ADMETlab: a platform for systematic ADMET evaluation based on a comprehensively collected ADMET database. J Chem 2018;10:29. 10.1186/s13321-018-0283-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39. Berellini G, Springer C, Waters NJ. et al. In silico prediction of volume of distribution in human using linear and nonlinear models on a 669 compound data set. J Med Chem 2009;52:4488–95. 10.1021/jm9004658. [DOI] [PubMed] [Google Scholar]
- 40. Obach RS, Baxter JG, Liston TE. et al. The prediction of human pharmacokinetic parameters from preclinical and in vitro metabolism data. J Pharmacol Exp Ther 1997;283:46–58. [PubMed] [Google Scholar]
- 41. Maasch J, Torres MDT, Melo MCR. et al. Molecular de-extinction of ancient antimicrobial peptides enabled by machine learning. Cell Host Microbe 2023;31:1260–1274.e6. 10.1016/j.chom.2023.07.001. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42. Kroll A, Ranjan S, Engqvist MKM. et al. A general model to predict small molecule substrates of enzymes based on machine and deep learning. Nat Commun 2023;14:2787. 10.1038/s41467-023-38347-2. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Additional data are made available in supplementary materials of this manuscript. The datasets will be shared on reasonable request to the corresponding author.












