Abstract
Cyclic peptides are promising drug candidates due to their ability to modulate intracellular protein–protein interactions, a property often inaccessible to small molecules. However, their typically poor membrane permeability limits therapeutic applicability. Accurate computational prediction of permeability can accelerate the identification of cell-permeable candidates, reducing reliance on time-consuming and costly experimental screening. Although deep learning has shown potential in predicting molecular properties, its application in permeability prediction remains underexplored. A systematic evaluation of these models is important to assess current capabilities and guide future development. In this study, we conduct a comprehensive benchmark of 13 machine learning models for predicting cyclic peptide membrane permeability. These models cover four types of molecular representations: fingerprints, SMILES strings, molecular graphs, and 2D images. We use experimentally measured PAMPA permeability data from the CycPeptMPDB database, comprising nearly 6000 cyclic peptides, and evaluate performance across three prediction tasks: regression, binary classification, and soft-label classification. Two data-splitting strategies, random split and scaffold split, are used to assess the generalizability of trained models. Our results show that model performance depends strongly on molecular representation and model architecture. Graph-based models, particularly the Directed Message Passing Neural Network (DMPNN), consistently achieve top performance across tasks. Regression generally outperforms classification. Scaffold-based splitting, although intended to more rigorously assess generalization, yields substantially lower model generalizability compared to random splitting. Comparing prediction errors with experimental variability highlights the practical value of current models while also indicating room for further improvement.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13321-025-01083-4.
Keywords: Cyclic peptide, Membrane permeability prediction, Deep learning, Benchmark study
Scientific contributions
This study presents the first comprehensive benchmark of 13 machine learning models for predicting the membrane permeability of cyclic peptides. By evaluating multiple molecular representations and task formulations on a curated experimental dataset, the work reveals key performance trends, highlighting the superiority of graph-based models, the advantages of regression over classification, and the limitations of scaffold-based data splitting. These findings offer actionable insights for future model development and evaluation in cyclic peptide research.
Supplementary Information
The online version contains supplementary material available at 10.1186/s13321-025-01083-4.
Introduction
Cyclic peptides typically consist of 5–15 amino acid residues arranged in a ring structure, with molecular weights ranging from 500 to 2000 Daltons. They have emerged as promising scaffolds in drug discovery, capable of overcoming the intrinsic limitations of small molecule drugs (< 500 Daltons) in modulating protein–protein interactions (PPIs), particularly in cancer-related pathways [1–3]. However, cyclic peptides generally exhibit lower membrane permeability compared to small-molecule drugs, limiting their therapeutic application in targeting intracellular PPIs. Similar to small molecules, most cyclic peptides enter cells via passive diffusion, driven by the concentration gradient across the membrane [4]. The low membrane permeability of cyclic peptides is primarily due to their larger size and the presence of polar atoms in the peptide backbone, which collectively impose a high free energy barrier during membrane translocation. While the detailed mechanisms remain unclear, the “chameleon” property of peptides is believed to play a critical role in enhancing permeability. This property allows some peptides to undergo significant conformational changes, thereby reducing the exposure of polar atoms upon entering the membrane environment [5–7]. In fact, several strategies have been proposed to enhance membrane permeability by reducing the exposure of backbone NH groups, including N-methylation [8], amide-to-ester substitution [9], and the promotion of internal hydrogen bonding [10, 11].
Despite a few successful examples of cell-permeable cyclic peptides, the de novo design of such peptides remains challenging due to the lack of established design principles for enhancing permeability [4, 11, 12]. Assessing peptide permeability still relies heavily on experimental screening, which is time-consuming and costly. In silico prediction of cyclic peptide permeability could greatly expedite the development of cyclic peptide therapeutics. Conventional computational approaches typically rely on a quantitative structure–permeability relationship (QSPR) [13–15], which use a small set of manually selected molecular descriptors believed to correlate with permeability, such as partition coefficient (logP), total polar surface area (TPSA), and number of hydrogen bond acceptors and donors. These handcrafted features are then used to train simple statistical or machine learning models. Although QSPR methods have been effective for small molecules, they perform less well for cyclic peptides, likely because they fail to capture conformational flexibility, a key factor in peptide membrane interactions. Alternatively, molecular dynamics (MD) simulations can efficiently sample the conformational space of cyclic peptides and estimate transfer free energies or diffusion coefficients at different membrane depths [16]. Although this method offers detailed physical insights, it is computationally expensive and often suffers from insufficient sampling in practice, leading to inaccurate predictions.
Recently, machine learning has shown promise in predicting molecular properties [17–31], where molecules can be represented through four primary approaches. The first approach involves the use of molecular descriptors or “fingerprints” derived from prior physicochemical knowledge, which are then fed into conventional machine learning models, such as Random Forest (RF) [32] and Support Vector Machine (SVM) [33]. The second approach represents molecules as strings, primarily using the Simplified Molecular Input Line Entry System (SMILES), enabling the application of natural language processing models, such as recurrent neural networks (RNNs). The third and increasingly popular approach treats molecules as graphs, where atoms serve as nodes and bonds as edges. When 3D structural information is available, pairwise atom distances can also be incorporated as edges [21]. Graph neural networks (GNNs) [22, 26, 27, 26, 27, 34] leverage this molecular representation effectively. Additionally, transformer-based methods can be adapted to handle molecular graphs [23–25], further expanding the versatility of graph-based molecular representations. Finally, a less commonly explored approach converts SMILES into 2D images, enabling the application of computer vision models, such as convolutional neural networks (CNNs) [28, 29].
Despite the increasing interest in AI-driven molecular property prediction, membrane permeability prediction has received limited attention in deep learning research. Corresponding data are rarely, if ever, used as benchmark datasets for model development. This gap is largely due to the scarcity and heterogeneity of experimental data for cyclic peptides, stemming from variations in assay conditions across different studies. The recent release of Cyclic Peptide Membrane Permeability Database (CycPeptMPDB) [35], a curated database of 7334 cyclic peptides compiled from 47 studies, has sparked a growing interest in this area. Notable efforts include a multimodal graph convolutional network-transformer model developed by Cao et al. [36] and a data augmentation approach by Yu et al. [37] aimed at enhancing model feature extraction. However, both studies compared their models only against a few relatively basic baselines. Given the broad availability of machine learning methods for molecular property prediction, a systematic benchmark is essential to evaluate their performance in membrane permeability prediction. This would provide a solid foundation for assessing and developing future models specifically designed for this task.
In this study, we comprehensively evaluated 13 machine learning and deep learning models for predicting the membrane permeability of cyclic peptides. These models span the four major representation strategies discussed above, enabling a holistic evaluation of their capabilities. Using a subset of nearly 6000 peptides from CycPeptMPDB, we assessed model performance by formulating permeability prediction as a regression task, binary classification, or soft-label classification.
By comparing the performance of models adapting different molecular representation approaches, our study shows that graph-based models, particularly the Direct Message Passing Neural Network (DMPNN) [26], exhibit superior performance across most metrics. Nonetheless, other methods, including relatively simple models such as RF and SVM, can achieve comparable performances. In terms of the Area Under the Receiver Operating Characteristic Curve (ROC-AUC), the regression approach generally outperforms classification approaches. To evaluate model robustness and generalizability, we also examined performance across two data-splitting approaches: random split and scaffold split. Contrary to common belief, models validated via the more rigorous scaffold split exhibited lower generalizability, likely due to reduced chemical diversity in the training data. Additionally, we explored whether incorporating auxiliary tasks, specifically predicting logP and TPSA, could improve feature extraction and model performance. Our study shows that introducing these auxiliary tasks, together or separately, provided limited or no benefit in terms of permeability prediction. Altogether, this benchmark study provides a comprehensive evaluation of state-of-the-art machine learning methods for cyclic peptide permeability prediction and offers practical guidance for future model development in this emerging area.
Methods
Dataset and Data Splitting
The cyclic peptide data used in this study were obtained from the CycPeptMPDB dataset, which contains 7334 peptides with sequence lengths ranging from 2 to 15 amino acids (Fig. 1a). Permeability values were compiled from 47 studies using various assays, including Parallel Artificial Membrane Permeability Assay (PAMPA) [38], Caco-2 cell monolayer assay (Caco2) [39], Madin-Darby canine kidney (MDCK) assay [40], and Ralph Russ canine kidney (RRCK) assay [41] (Fig. 1b). These values were reported on a logarithmic scale and clipped to the range of − 10 to − 4 in CycPeptMPDB. Peptides with permeability values of − 6 or higher are generally considered cell-permeable [35].
Fig. 1.
Statistics of the CycPeptMPDB dataset. a Distribution of peptide lengths. b Distribution of assays used for permeability measurement. c Permeability value distribution of peptides measured via the PAMPA assay. Peptides with lengths 6, 7, and 10 were grouped, while peptides with lengths 8 and 9 were grouped separately
To narrow down the chemical space for model training, we focused on peptides with sequence lengths of 6, 7, or 10, as the data for other sequence lengths is too sparse. Permeability measurements from non-PAMPA assays were excluded to reduce experimental variability, resulting in a subset of 5758 peptides, including 68 peptides with multiple PAMPA measurements. These duplicate measurements were treated as independent samples but were consistently allocated to the training set during data splitting to avoid data leakage. This results in a total of 5826 samples in the subset. Alternatively, one could average the multiple PAMPA measurements and assign the mean value to each corresponding peptide. However, since only around of peptides have duplicate measurements, it is likely that the choice of strategy has a negligible impact on the overall benchmarking results.
Two data-splitting strategies were employed: random split and scaffold split. For the random split, the dataset was divided in an 8:1:1 ratio, resulting in 4674 samples for training, 576 for validation, and 576 for testing. This process was repeated 10 times with different random seeds. For the scaffold split, Murcko scaffolds were generated using the RDKit library [42], ignoring chirality differences. The scaffolds were sorted by sample frequency, assigning the most common scaffolds to the training set and the most diverse scaffolds to the test set. This split was performed within each sequence length in an 8:1:1 ratio and then merged, yielding 4721 samples for training, 554 for validation, and 551 for testing. The chemical similarity between the validation/test sets and the training set, under both random and scaffold splits, is shown in Fig. S1.
Additionally, the CycPeptMPDB database curated 201 peptides with sequence lengths of 8 and 9, with permeability measured by PAMPA. These peptides were used as an external test set to evaluate the model’s generalizability.
Label annotation
In this study, the prediction of permeability was formulated as three separate tasks: regression, binary classification, and soft-label classification. Sample annotations of each task were defined as follows:
Regression: The original permeability values were clipped to the range and linearly normalized to . This adjustment was made because experimental measurements become increasingly error-prone below − 8, and predicting extremely low permeabilities provides limited practical value for drug discovery.
Binary classification: Samples with permeability values were classified as positive (labeled as 1), while those with were classified as negative (labeled as 0). This threshold aligns with previous studies and reflects a general consensus that peptides with exhibit sufficient membrane permeability for intracellular activity [35–37].
- Soft-label classification: To capture gradual differences in permeability and avoid the limitations of hard classification thresholds, we defined a soft label y as a function of the original permeability value P:
The values − 6.5 and − 5.5 were selected empirically, as a 0.5-unit change in permeability corresponds to roughly a threefold difference in peptide concentration, rendering these cutoffs biologically meaningful.
Sample weight adjustment
As shown in Fig. 1c, in the classification tasks, there are more positive samples than negative samples: positive samples for peptides with lengths 6, 7, and 10. To mitigate this mild class imbalance, we applied sample weighting in the cross-entropy loss function:
| 1 |
where denotes the sample weight, the prediction, and the label for sample n. N is the batch size.
A uniform sample weight was assigned to all positive samples and another to all negative samples. These weights were computed from the training set to meet two criteria: (1) the total weight of positive samples equals that of negative samples, and (2) the sum of all sample weights equals the number of training samples. The validation and test sets inherited these class-specific weights from the training set.
Model implementation
The 13 models evaluated in this study were implemented using different libraries, depending on their architecture and representation type. A brief description of each model is provided in the Supplementary Information.
Fingerprint-based models: SVM and RF were implemented using the scikit-learn library [43] with default parameters. The input features are molecular fingerprints obtained from CycPeptMPDB, comprising 210 descriptors: 208 calculated using RDKit and 2 principal components derived from them.
Graph-based models: Six GNN models were implemented using the DeepChem library, including AttentiveFP [31], DMPNN [26], GAT [34], GCN [27], MPNN [22], and PAGTN [30]. All models used their default hyperparameters, loss functions, and optimizers.
String-based models: Three RNN models, including vanilla RNN [44], GRU [45], and LSTM [46], were implemented using PyTorch library, with tokenization provided by DeepChem. Their hyperparameters were optimized via grid search.
Image-based models: Two CNN models were evaluated. ChemCeption [28] was implemented using DeepChem with default settings, and ImageMol [29] was implemented using publicly available source code provided by the authors.
To support soft-label classification, we modified DeepChem’s loss function for classification tasks, as detailed in the Supplementary Information. Additionally, we modified the code of ImageMol to save predictions for generating performance metrics. These modifications did not affect model training or performance.
Model training
All deep learning models, except ImageMol, were pre-trained before training on the CycPeptMPDB dataset. For regression tasks, models were pre-trained on the Delaney dataset [47]. For classification tasks, models were pretrained using the Blood-Brain Barrier Penetration dataset [48]. Both datasets are available in DeepChem. For ImageMol, we directly fine-tuned the pre-trained model shared by its authors. More details are provided in Table S1.
Each model was trained 10 times with different random seeds to account for initialization variability. For random split data, each random seed corresponds to a distinct split of data. In contrast, for scaffold split data, the split remained fixed across runs, with the random seed only affecting the initial model weights.
Results
Experimental data analysis
Experimental measurements of peptide membrane permeability are inherently noisy, introducing inconsistencies that complicate the development and evaluation of machine learning models. To better understand the reliability of these measurements, we systematically analyzed four types of variability in peptide permeability data:
Intra-record variability: In standard protocols, the permeability assay for each peptide was repeated two to three times, with the average taken as a single record in the CycPeptPMDB dataset. Intra-record variability represents variability among these repeated measurements and is generally irreducible, as the repetitions occur under nearly identical conditions.
Intra-report variability: In the report by Furukawa et al. [49], 21 peptides were recorded multiple times across different sub-libraries. Each record represents an average of repeated experiments and is considered a valid measurement in CycPeptMPDB. We define intra-report variability as the variability among these records, assuming that identical protocols and equipment were used, with only minor experimental variations, such as differences in buffer conditions.
Inter-report variability: Peptides measured in multiple studies using the same assay type can show discrepancies due to variations in equipment, protocols, or other experimental factors. This variability is defined as inter-report variability.
Inter-assay variability: Certain peptides are tested with different permeability assays, such as PAMPA, Caco-2, and RRCK. We define this variability as inter-assay variability, arising from inherent differences between assays.
The CycPeptPMDB dataset only records the average permeability value of 2 to 3 repeated experiments, without the value for each experiment. Thus, intra-measurement variability was estimated based on a report that provides such data [49]. Table 1 summarizes the statistics of the first three types of variability, focusing on PAMPA assay as the sample sizes for other assays are too small (Fig. 1b). Measurement variability statistics across assay types are also shown in Table 1. As expected, the intra-record variability is relatively small, establishing a lower boundary for measurement error. The intra-report variability is slightly higher but remains smaller than the inter-report variability. Inter-report variability are significantly larger, nearly an order of magnitude higher than intra-report variability, which may pose a challenge for developing highly accurate permeability prediction models. The inter-assay variability is also high; however, this issue can be partially mitigated by excluding data from non-PAMPA assays, without significantly reducing the sample size.
Table 1.
List of measurement variabilities and corresponding values
| Variability type | Assay | Average variability | Peptides |
|---|---|---|---|
| Intra-record | PAMPA | 688 | |
| Intra-report | PAMPA | 21 | |
| Inter-report | PAMPA | 19 | |
| Inter-assay | PAMPA vs. Caco-2 | 231 | |
| Inter-assay | PAMPA vs. MDCK | 17 | |
| Inter-assay | PAMPA vs. RRCK | 113 | |
| Inter-assay | Caco-2 vs. RRCK | 4 |
Performance comparison across models
We first evaluated 13 models by formulating permeability prediction as a regression task with random split data. Summary statistics are shown in Table 2. Performance metrics for two additional models, Multi_CycGT and MuCoCP, were sourced from their original reports but excluded from direct comparisons due to dataset differences. Although their code is publicly available, we were unable to train or fine-tune these models on our dataset due to missing instructions for MuCoCP and inconsistencies between the provided instructions and code for Multi_CycGT.
Table 2.
Performance of models in regression task on random-split test data for peptides of lengths 6, 7, and 10
| Represent molecule as | Method | MAE | RMSE | ROC-AUC | ||
|---|---|---|---|---|---|---|
| Fingerprint | RF | 0.397 ± 0.014 | 0.556 ± 0.022 | 0.475 ± 0.038 | 0.692 ± 0.026 | 0.853 ± 0.011 |
| SVM | 0.383 ± 0.012 | 0.566 ± 0.022 | 0.456 ± 0.040 | 0.683 ± 0.031 | 0.851 ± 0.012 | |
| Graph | AttentiveFP | 0.367 ± 0.015 | 0.542 ± 0.023 | 0.502 ± 0.042 | 0.713 ± 0.028 | 0.860 ± 0.011 |
| DMPNN | 0.347* ± 0.014 | 0.512* ± 0.026 | 0.555* ± 0.034 | 0.750* ± 0.022 | 0.887* ± 0.011 | |
| GAT | 0.385 ± 0.021 | 0.548 ± 0.028 | 0.489 ± 0.056 | 0.701 ± 0.041 | 0.854 ± 0.028 | |
| GCN | 0.403 ± 0.022 | 0.562 ± 0.033 | 0.463 ± 0.055 | 0.686 ± 0.037 | 0.851 ± 0.016 | |
| MPNN | 0.378 ± 0.010 | 0.536 ± 0.025 | 0.512 ± 0.043 | 0.718 ± 0.028 | 0.862 ± 0.013 | |
| PAGTN | 0.416 ± 0.020 | 0.582 ± 0.024 | 0.424 ± 0.050 | 0.656 ± 0.036 | 0.833 ± 0.013 | |
| String | RNN | 0.475 ± 0.017 | 0.686 ± 0.023 | 0.201 ± 0.048 | 0.473 ± 0.048 | 0.761 ± 0.021 |
| LSTM | 0.391 ± 0.033 | 0.590 ± 0.055 | 0.404 ± 0.118 | 0.664 ± 0.066 | 0.854 ± 0.022 | |
| GRU | 0.378 ± 0.022 | 0.575 ± 0.030 | 0.437 ± 0.066 | 0.681 ± 0.036 | 0.859 ± 0.016 | |
| Image | ChemCeption | 0.454 ± 0.019 | 0.640 ± 0.029 | 0.304 ± 0.054 | 0.565 ± 0.043 | 0.798 ± 0.018 |
| ImageMol | 0.403 ± 0.018 | 0.561 ± 0.025 | 0.464 ± 0.045 | 0.688 ± 0.030 | 0.851 ± 0.018 | |
| Mix | Multi_CycGTa | 0.394 ± 0.011 | 0.518 ± 0.034 | 0.338 ± 0.090 | – | – |
| MuCoCPa | – | 0.845 | 0.503 | – | – |
and indicate that higher or lower values are better, respectively. The best value of each metric is indicated in bold, and * denotes the best metric value statistically superior to the second-best ()
aMetric values were taken from original reports; these datasets differ but are all subsets of CycPeptMPDB
Most models achieved comparable performance, while the vanilla RNN showed the lowest accuracy. Unlike the more advanced string-based models, LSTM and GRU, the vanilla RNN suffers from vanishing gradients, limiting its ability to capture long-range dependencies. In contrast, LSTM and GRU incorporate gating mechanisms that improve memory retention and gradient flow.
As shown in Fig. 2, the prediction errors of these models are significantly lower than the inter-report variability of the PAMPA assay but higher than the intra-report MAE. This suggests that while these benchmark methods provide substantial practical value for permeability prediction, there is still considerable room for improvement.
Fig. 2.
Comparison of measurement variability of PAMPA (Table 1) and prediction errors across various methods for regression tasks on random-split test data for peptides with lengths 6, 7, and 10 (Table 2). Models with current accuracy may be suitable for coarse screening. For more demanding downstream applications, the model’s MAE should be comparable to intra-report variability, or ideally, match intra-record variability
Among these methods, DMPNN consistently achieved the highest performance across all evaluation metrics, statistically outperforming () the second-best model in every metric. A distinctive characteristic of DMPNN is that it updates edge features during the message-passing phase, whereas other GNN-based benchmarks primarily update node features. This design aims to avoid “totters” [50], a phenomenon where a node sends back a message it previously received from a neighboring node in the previous graph layer. In molecular graphs, where each node (atom) typically has no more than four edges (covalent bonds), minimizing such loops could help improve information flow and representation. Since molecular properties often depend on interactions spanning multiple bonds, avoiding redundant message passing helps DMPNN to more effectively capture long-range dependencies while preserving meaningful structural information. By ensuring that information flows efficiently without unnecessary cycles, this design likely contributes to DMPNN’s superior predictive performance.
It is worth noting that in the report of Multi_CycGT [36], LSTM significantly underperforms the graph-based method across many metrics, particularly in and ROC-AUC. However, such a performance gap was not observed in our study, and the difference is unlikely to be attributed to variations in the datasets used in the two studies. Upon examining the published code of Multi_CycGT, we found that the embedding layer (torch.nn.embedding) did not explicitly specify the padding token by setting argument padding_idx. Padding is commonly used in string-based models to handle varying string lengths by filling shorter SMILES strings with a special token that is ignored during training. However, in Multi_CycGT, the model mistakenly treated these tokens as meaningful input, causing artificial signals from padded regions to distort model predictions. We tested this effect on our dataset and found that failing to specify padding_idx caused the MAE of the LSTM model to increase from 0.391 to 0.483 (data not shown), highlighting the critical importance of properly handling padding tokens in string-based models.
Regression vs classification
Besides the regression task, permeability is often formulated as a binary classification task by setting a cutoff permeability value of [36, 37]. Table S2 presents the performance metrics for this task. DMPNN demonstrates statistically significant superiority over other models in terms of accuracy (Acc), F1 score, and ROC-AUC, whereas it has the best performance in precision (Pre) without statistical significance.
The ROC-AUC scores of binary-label classification decreased slightly for all models (Fig. 3), except for ImageMol. This decrease may be attributed to the strict cutoff threshold used for binarization, which can hinder model training. For instance, a small permeability change from to results in a significant label switch. To mitigate this issue, we also evaluated all deep learning models using soft labels, excluding ChemCeption and ImageMol due to the extensive code revisions required. Implementation details are provided in “Methods” section. As shown in Table S3 and Fig. 3, soft labels improved ROC-AUC for all models except PAGTN. For example, the best ROC-AUC, achieved by DMPNN, increased from 0.872 with binary labels to 0.884 with soft labels, approaching that of the regression task 0.887.
Fig. 3.
Comparison of ROC-AUC scores across various methods for regression and classification (binary and soft labels) tasks on random-split test data for peptides with lengths 6, 7, and 10
Among all the performance metrics mentioned above, we place greater emphasis on ROC-AUC for two key reasons. First, it is practically useful when users need to adjust the threshold to increase true positives or reduce false negatives. Second, it facilitates comparison between regression and classification tasks. Based on ROC-AUC, permeability prediction generally performs best when formulated as a regression task, followed by soft-label classification, with binary-label classification yielding the lowest performance.
Random split vs scaffold split
It is commonly recommended that model performance should be evaluated using scaffold split rather than random split, as scaffold split is believed to better assess a model’s ability to generalize to novel chemical spaces, which is an essential requirement for advancing drug discovery efforts [26, 51]. Scaffold split ensures that peptides sharing the same scaffold are grouped within either the training, validation, or test datasets. Grouping similar scaffolds together prevents information leakage between splits and ensures a more rigorous assessment of the model’s generalization to unseen chemical structures. Further details are provided in “Methods” section.
The performance of models trained on the scaffold split is shown in Tables S4 to S6. All models exhibit a significant performance drop compared to those trained on random split. This outcome is expected, as scaffold split reduces diversity in the training set while increasing diversity in the test set, making it a more rigorous evaluation method. A quantitative comparison of the chemical diversity in the training and test sets, split either randomly or by scaffold, is shown in Fig. S2. However, this stricter approach did not necessarily lead to better generalizability in chemical space. To examine this, well-trained models were evaluated on peptides with sequence lengths of 8 and 9 from CycPeptMPDB. Table 3 shows that models trained on random split data significantly outperformed those trained on scaffold split data. This discrepancy may be due to the greater chemical diversity present in the random split training set, which seems to enhance the model’s generalizability.
Table 3.
ROC-AUC Scores: models trained on peptide lengths 6, 7, and 10; tested on lengths 8 and 9
| Method | Random split | Scaffold split | ||||
|---|---|---|---|---|---|---|
| Regression | Classification (binary label) | Classification (soft label) | Regression | Classification (binary label) | Classification (soft label) | |
| RF | 0.450 ± 0.031 | 0.651 ± 0.041 | – | 0.638 ± 0.024 | 0.553 ± 0.034 | – |
| SVM | 0.655 ± 0.035 | 0.596 ± 0.018 | – | 0.557 | 0.539 ± 0.001 | – |
| AttentiveFP | 0.858 ± 0.081 | 0.862 ± 0.018 | 0.880 ± 0.048 | 0.641 ± 0.116 | 0.661 ± 0.054 | 0.644 ± 0.146 |
| DMPNN | 0.885 ± 0.046 | 0.848 ± 0.032 | 0.851 ± 0.070 | 0.723 ± 0.082 | 0.716 ± 0.079 | 0.720 ± 0.096 |
| GAT | 0.861 ± 0.058 | 0.839 ± 0.044 | 0.843 ± 0.083 | 0.695 ± 0.066 | 0.659 ± 0.096 | 0.695 ± 0.117 |
| GCN | 0.866 ± 0.064 | 0.844 ± 0.031 | 0.874 ± 0.048 | 0.655 ± 0.141 | 0.669 ± 0.104 | 0.694 ± 0.125 |
| MPNN | 0.862 ± 0.067 | 0.772 ± 0.055 | 0.826 ± 0.088 | 0.756 ± 0.080 | 0.734 ± 0.087 | 0.701 ± 0.143 |
| PAGTN | 0.858 ± 0.029 | 0.780 ± 0.062 | 0.847 ± 0.042 | 0.485 ± 0.181 | 0.682 ± 0.032 | 0.691 ± 0.037 |
| RNN | 0.601 ± 0.055 | 0.552 ± 0.080 | 0.590 ± 0.086 | 0.469 ± 0.047 | 0.525 ± 0.034 | 0.509 ± 0.028 |
| LSTM | 0.645 ± 0.150 | 0.568 ± 0.150 | 0.631 ± 0.066 | 0.505 ± 0.039 | 0.531 ± 0.100 | 0.501 ± 0.005 |
| GRU | 0.790 ± 0.093 | 0.731 ± 0.146 | 0.769 ± 0.108 | 0.682 ± 0.121 | 0.617 ± 0.138 | 0.635 ± 0.127 |
| ChemCeption | 0.484 ± 0.047 | 0.466 ± 0.051 | – | 0.436 ± 0.037 | 0.403 ± 0.043 | – |
| ImageMol | 0.839 ± 0.035 | 0.804 ± 0.019 | – | 0.732 ± 0.045 | 0.661 ± 0.037 | – |
The best metric value and those are not statistically inferior to the best () are presented in bold
Auxiliary tasks
One common technique in deep learning to improve model performance is the use of auxiliary tasks [52], which provide additional guidance to help the model learn the main task more effectively. We explored this strategy by introducing two auxiliary tasks related to molecular solubility, a property strongly correlated with permeability [49]. High solubility increases a molecule’s concentration in extracellular environments but reduces its partition into lipid membranes. Conversely, low solubility favors membrane partition but reduces the solution concentration of the peptides. Consequently, membrane-permeable peptides typically exhibit moderate solubility.
We selected two widely used descriptors as solubility indicators: logP [53] and TPSA [54]. These descriptors were incorporated as auxiliary tasks during model training. A representative scatter plot of a DMPNN model trained with two auxiliary tasks is shown in Fig. 4. Although the auxiliary tasks can be well performed, with values exceeding 0.9, no significant improvement was observed in the permeability prediction performance. More systematic experiments are summarized in Table S7, showing that neither logP nor TPSA, used individually or in combination, significantly improves permeability prediction. A possible explanation is that both logP and TPSA in CycPeptMPDB are calculated using fragment-based additive methods, which sum contributions from predefined chemical groups. These methods do not account for 3D structure or the burial of polar surfaces, which is especially relevant in cyclic peptides due to their conformational variability. As a result, these descriptors may fail to accurately capture solubility-related features essential for membrane translocation.
Fig. 4.
A representative scatter plot of a DMPNN model trained with two auxiliary tasks is shown for the random-split test data of peptides with lengths 6, 7, and 10. All three prediction tasks, permeability (a), LogP (b), and TPSA (c), were formulated as regression problems. To enable fair comparison across tasks, the ground truth values of all three properties were normalized to the range
Discussion
The low membrane permeability of cyclic peptides remains a significant barrier to their development as orally available therapeutics. Predictive modeling offers a potential solution for rapidly screening candidates with favorable permeability profiles. In this benchmark study, we present a comprehensive evaluation of 13 machine learning models for predicting cyclic peptide membrane permeability. Our evaluation spans four molecular representations, three task formulations, and two data-splitting strategies. While previous studies have focused on individual models, datasets, or feature engineering, our work systematically compares the performance of a broad range of approaches, offering a robust benchmark to guide future model development.
Our benchmark study suggests the following four preliminary observations from the CycPeptMPDB dataset:
Model architecture and molecular representation significantly affect performance. Graph-based models, especially DMPNN, consistently outperform fingerprint, string, and image-based models because they effectively capture both local chemical features and long-range dependencies. Although ImageMol has demonstrated strong performance on several benchmark datasets, it underperformed compared to multiple GNN-based models on the CycPeptMPDB dataset. Our results further indicate that, in this dataset, updating edge features during message passing may be more effective than the conventional approach of updating node features in GNN architectures.
Regression outperforms classification. Treating permeability as a continuous variable leads to higher accuracy, and soft-label classification serves as a useful compromise between regression and binary classification.
Random split may generalize better than scaffold split. Although scaffold splitting enables models to be validated and optimized on a chemically distinct validation set, which is intended to enhance generalizability, our results suggest that the diversity of the training data plays an even more crucial role in achieving generalization.
Auxiliary tasks may need to be related to the 3D or even 4D (dynamic) structure of molecules. Adding logP and TPSA as auxiliary tasks did not improve performance, likely because these descriptors fail to reflect conformational features important for permeability. Future models may benefit more from structure-aware auxiliary tasks.
It is worth noting that hyperparameters for the benchmarked models were not exhaustively tuned due to time constraints. As such, the conclusions may differ if each model were evaluated under its optimal hyperparameter configuration.
A notable limitation of all the evaluated models is the neglect of 3D structural information, which plays a critical role in determining the membrane permeability [5–7]. Although CycPeptMPDB provides 3D structures in chloroform, water, and vacuum environments, these were generated via simple energy minimization and do not reflect realistic peptide conformations. A curated dataset of high-quality 3D structures, derived from extensive sampling of conformational space, remains an unmet need. Moreover, conformational flexibility cannot be adequately represented by a single static structure; future models that incorporate multiple conformations may better capture the dynamic nature of peptide–membrane interactions, potentially improving prediction accuracy.
Supplementary Information
Acknowledgements
This work is partially supported by the Biomedical Research Council of the Agency for Science, Technology, and Research, Singapore. The computational work for this article was partially performed on the resources of the National Supercomputing Centre (NSCC), Singapore (https://www.nscc.sg). The authors also thank WuXi AppTec for their support of this project.
Abbreviations
- SMILES
Simplified Molecular-Input Line-Entry System
- PAMPA
Parallel Artificial Membrane Permeability Assay
- PPI
Protein–protein interaction
- QSPR
Quantitative structure–permeability relationship
- MD
Molecular dynamics
- RF
Random Forest
- SVM
Support Vector Machine
- RNN
Recurrent neural network
- GNN
Graph neural network
- CNN
Convolutional neural network
- CycPeptMPDB
Cyclic Peptide Membrane Permeability Database
- DMPNN
Directed Message Passing Neural Network
- ROC-AUC
Receiver Operating Characteristic Curve
- TPSA
Total polar surface area
- MDCK
Madin-Darby canine kidney
- RRCK
Ralph Russ canine kidney
- AttentiveFP
Attentive fingerprint network
- GAT
Graph Attention Network
- GCN
Graph Convolutional Network
- MPNN
Massage-Passing Neural Network
- PAGTN
Path-Augmented Graph Transformer Network
- GRU
Gated Recurrent Unit
- LSTM
Long Short-Term Memory
Author contributions
W.L. and J.L. jointly conceptualized the study and proposed the research idea. W.L. implemented the computational framework, conducted the AI analysis, and prepared the initial draft. J.L. performed data acquisition, conducted chemical analysis, and contributed to manuscript revision. C.V. and H.K.L. provided overall supervision throughout the project, involved in data analysis and manuscript editing. All authors reviewed and approved the final manuscript.
Funding
This research was supported by A*STAR Central Research Fund and A*STAR-WuXi AppTec Solitaire Grant.
Data availability
The processed data is available at https://github.com/Gobliu/BenchmarkCycPeptMP.
Declarations
Competing interests
The authors declare no competing interests.
Footnotes
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Wei Liu and Jianguo Li contributed equally to this work.
Contributor Information
Wei Liu, Email: liuwei@bii.a-star.edu.sg.
Hwee Kuan Lee, Email: leehk@bii.a-star.edu.sg.
References
- 1.Vinogradov AA, Yin Y, Suga H (2019) Macrocyclic peptides as drug candidates: recent progress and remaining challenges. J Am Chem Soc 141(10):4167–4181. 10.1021/jacs.8b13178 [DOI] [PubMed] [Google Scholar]
- 2.Buckton LK, Rahimi MN, McAlpine SR (2021) Cyclic peptides as drugs for intracellular targets: the next frontier in peptide therapeutic development. Chem Eur J 27(5):1487–1513. 10.1002/chem.201905385 [DOI] [PubMed] [Google Scholar]
- 3.Muttenthaler M, King GF, Adams DJ, Alewood PF (2021) Trends in peptide drug discovery. Nat Rev Drug Discov 20(4):309–325. 10.1038/s41573-020-00135-8 [DOI] [PubMed] [Google Scholar]
- 4.Dougherty PG, Sahni A, Pei D (2019) Understanding cell penetration of cyclic peptides. Chem Rev 119(17):10241–10287. 10.1021/acs.chemrev.9b00008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 5.Ahlbach CL, Lexa KW, Bockus AT, Chen V, Crews P, Jacobson MP, Lokey RS (2015) Beyond cyclosporine a: conformation-dependent passive membrane permeabilities of cyclic peptide natural products. Future Med Chem 7(16):2121–2130. 10.4155/fmc.15.78 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Wang CK, Swedberg JE, Harvey PJ, Kaas Q, Craik DJ (2018) Conformational flexibility is a determinant of permeability for cyclosporin. J Phys Chem B 122(8):2261–2276. 10.1021/acs.jpcb.7b12419 [DOI] [PubMed] [Google Scholar]
- 7.Li J, Kannan S, Aronica P, Brown CJ, Partridge AW, Verma CS (2022) Molecular descriptors suggest stapling as a strategy for optimizing membrane permeability of cyclic peptides. J Chem Phys 156(6):065101. 10.1063/5.0078025 [DOI] [PubMed] [Google Scholar]
- 8.Biron E, Chatterjee J, Ovadia O, Langenegger D, Brueggen J, Hoyer D, Schmid HA, Jelinek R, Gilon C, Hoffman A, Kessler H (2008) Improving oral bioavailability of peptides by multiple n-methylation: somatostatin analogues. Angew Chem Int Ed 47(14):2595–2599. 10.1002/anie.200705797 [DOI] [PubMed] [Google Scholar]
- 9.Hosono Y, Uchida S, Shinkai M, Townsend CE, Kelly CN, Naylor MR, Lee H-W, Kanamitsu K, Ishii M, Ueki R, Ueda T, Takeuchi K, Sugita M, Akiyama Y, Lokey SR, Morimoto J, Sando S (2023) Amide-to-ester substitution as a stable alternative to n-methylation for increasing membrane permeability in cyclic peptides. Nat Commun 14(1):1416. 10.1038/s41467-023-36978-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- 10.Taechalertpaisarn J, Ono S, Okada O, Johnstone TC, Lokey RS (2022) A new amino acid for improving permeability and solubility in macrocyclic peptides through side chain-to-backbone hydrogen bonding. J Med Chem 65(6):5072–5084. 10.1021/acs.jmedchem.2c00010 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 11.Bhardwaj G, O’Connor J, Rettie S, Huang Y-H, Ramelot TA, Mulligan VK, Alpkilic GG, Palmer J, Bera AK, Bick MJ, Di Piazza M, Li X, Hosseinzadeh P, Craven TW, Tejero R, Lauko A, Choi R, Glynn C, Dong L, Griffin R, van Voorhis WC, Rodriguez J, Stewart L, Montelione GT, Craik D, Baker D (2022) Accurate de novo design of membrane-traversing macrocycles. Cell 185(19):3520–353226. 10.1016/j.cell.2022.07.019 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 12.Rezai T, Yu B, Millhauser GL, Jacobson MP, Lokey RS (2006) Testing the conformational hypothesis of passive membrane permeability using synthetic cyclic peptide diastereomers. J Am Chem Soc 128(8):2510–2511. 10.1021/ja0563455 [DOI] [PubMed] [Google Scholar]
- 13.Refsgaard HHF, Jensen BF, Brockhoff PB, Padkjær SB, Guldbrandt M, Christensen MS (2005) In silico prediction of membrane permeability from calculated molecular parameters. J Med Chem 48(3):805–811. 10.1021/jm049661n [DOI] [PubMed] [Google Scholar]
- 14.Oja M, Maran U (2015) Quantitative structure-permeability relationships at various ph values for acidic and basic drugs and drug-like compounds. SAR QSAR Environ Res 26(7–9):701–719. 10.1080/1062936X.2015.1085896 [DOI] [PubMed] [Google Scholar]
- 15.Lanevskij K, Didziapetris R (2019) Physicochemical QSAR analysis of passive permeability across caco-2 monolayers. J Pharm Sci 108(1):78–86. 10.1016/j.xphs.2018.10.006 [DOI] [PubMed] [Google Scholar]
- 16.Sugita M, Sugiyama S, Fujie T, Yoshikawa Y, Yanagisawa K, Ohue M, Akiyama Y (2021) Large-scale membrane permeability prediction of cyclic peptides crossing a lipid bilayer based on enhanced sampling molecular dynamics simulations. J Chem Inf Model 61(7):3681–3695. 10.1021/acs.jcim.1c00380 [DOI] [PubMed] [Google Scholar]
- 17.Koutroumpa N-M, Tsoumanis A, Sarimveis H, Lynch I, Melagraki G, Afantitis A (2025) Prediction of blood-brain barrier and caco-2 permeability through the enalos cloud platform: combining contrastive learning and atom-attention message passing neural networks. J Chem 17(1):68. 10.1186/s13321-025-01007-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Ulrich N, Voigt K, Kudria A, Böhme A, Ebert R-U (2025) Prediction of the water solubility by a graph convolutional-based neural network on a highly curated dataset. J Chem 17(1):55. 10.1186/s13321-025-01000-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Naseem A, Alturise F, Alkhalifah T, Khan YD (2023) Bbb-pep-prediction: improved computational model for identification of blood-brain barrier peptides using blending position relative composition specific features and ensemble modeling. J Chem 15(1):110. 10.1186/s13321-023-00773-1 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Bao Z, Tom G, Cheng A, Watchorn J, Aspuru-Guzik A, Allen C (2024) Towards the prediction of drug solubility in binary solvent mixtures at various temperatures using machine learning. J Chem 16(1):117. 10.1186/s13321-024-00911-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Garcia Satorras V, Hoogeboom E, Welling M (2021) E(n) equivariant graph neural networks. arXiv preprint arXiv:2102.09844. Accessed 14 Nov 2024
- 22.Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum chemistry. arXiv:1704.01212
- 23.Fuchs FB, Worrall DE, Fischer V, Welling M (2020) SE(3)-Transformers: 3D Roto-Translation Equivariant Attention Networks. arXiv:2006.10503
- 24.Liao Y-L, Smidt T (2023) Equiformer: equivariant graph attention transformer for 3D atomistic graphs. arXiv:2206.11990
- 25.Müller L, Galkin M, Morris C, Rampášek L (2024) Attending to graph transformers. arXiv:2302.04181
- 26.Yang K, Swanson K, Jin W, Coley C, Eiden P, Gao H, Guzman-Perez A, Hopper T, Kelley B, Mathea M, Palmer A, Settels V, Jaakkola T, Jensen K, Barzilay R (2019) Analyzing learned molecular representations for property prediction. J Chem Inf Model 59(8):3370–3388. 10.1021/acs.jcim.9b00237 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv:1609.02907
- 28.Goh GB, Siegel C, Vishnu A, Hodas NO, Baker N (2017) Chemception: a deep neural network with minimal chemistry knowledge matches the performance of expert-developed QSAR/QSPR Models. arXiv:1706.06689
- 29.Zeng X, Xiang H, Yu L, Wang J, Li K, Nussinov R, Cheng F (2022) Accurate prediction of molecular properties and drug targets using a self-supervised image representation learning framework. Nat Mach Intell 4(11):1004–1016. 10.1038/s42256-022-00557-6 [Google Scholar]
- 30.Chen B, Barzilay R, Jaakkola T (2019) Path-augmented graph transformer network. arXiv:1905.12712
- 31.Xiong Z, Wang D, Liu X, Zhong F, Wan X, Li X, Li Z, Luo X, Chen K, Jiang H, Zheng M (2020) Pushing the boundaries of molecular representation for drug discovery with the graph attention mechanism. J Med Chem 63(16):8749–8760. 10.1021/acs.jmedchem.9b00959 [DOI] [PubMed] [Google Scholar]
- 32.Breiman L (2001) Random forests. Mach Learn 45(1):5–32 [Google Scholar]
- 33.Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297 [Google Scholar]
- 34.Veličković P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y (2018) Graph Attention Networks. arXiv:1710.10903
- 35.Li J, Yanagisawa K, Sugita M, Fujie T, Ohue M, Akiyama Y (2023) Cycpeptmpdb: a comprehensive database of membrane permeability of cyclic peptides. J Chem Inf Model 63(7):2240–2250. 10.1021/acs.jcim.2c01573 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.Cao L, Xu Z, Shang T, Zhang C, Wu X, Wu Y, Zhai S, Zhan Z, Duan H (2024) Multi_cycgt: a deep learning-based multimodal model for predicting the membrane permeability of cyclic peptides. J Med Chem 67(3):1888–1899. 10.1021/acs.jmedchem.3c01611 [DOI] [PubMed] [Google Scholar]
- 37.Yu Y, Gu M, Guo H, Deng Y, Chen D, Wang J, Wang C, Liu X, Yan W, Huang J (2024) Mucocp: a priori chemical knowledge-based multimodal contrastive learning pre-trained neural network for the prediction of cyclic peptide membrane penetration ability. Bioinform 40(8):473. 10.1093/bioinformatics/btae473 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Ottaviani G, Martel S, Carrupt P-A (2006) Parallel artificial membrane permeability assay: a new membrane for the fast prediction of passive human skin permeability. J Med Chem 49:3948–54. 10.1021/jm060230+ [DOI] [PubMed] [Google Scholar]
- 39.Hidalgo IJ, Raub TJ, Borchardt RT (1989) Characterization of the human colon carcinoma cell line (caco-2) as a model system for intestinal epithelial permeability. Gastroenterology 96(3):736–749 [PubMed] [Google Scholar]
- 40.Dahlgren D, Lennernäs H (2019) Intestinal permeability and drug absorption: predictive experimental, computational and in vivo approaches. Pharm 11(8):411. 10.3390/pharmaceutics11080411 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Di L, Whitney-Pickett C, Umland JP, Zhang H, Zhang X, Gebhard DF, Lai Y, Federico JJ, Davidson RE, Smith R, Reyner EL, Lee C, Feng B, Rotter C, Varma MV, Kempshall S, Fenner K, El-Kattan AF, Liston TE, Troutman MD (2011) Development of a new permeability assay using low-efflux mdckii cells. J Pharm Sci 100(11):4974–4985. 10.1002/jps.22674 [DOI] [PubMed] [Google Scholar]
- 42.Landrum G, contributors: RDKit: open-source cheminformatics. https://www.rdkit.org
- 43.Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830 [Google Scholar]
- 44.Elman JL (1990) Finding structure in time. Cogn Sci 14(2):179–211 [Google Scholar]
- 45.Ravanelli M, Brakel P, Omologo M, Bengio Y (2018) Light gated recurrent units for speech recognition. IEEE Trans Emerg Topics Comput Intell 2(2):92–102. 10.1109/tetci.2017.2762739 [Google Scholar]
- 46.Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780 [DOI] [PubMed] [Google Scholar]
- 47.Delaney JS (2004) ESOL: estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci 44(3):1000–1005. 10.1021/ci034243x [DOI] [PubMed] [Google Scholar]
- 48.Martins IF, Teixeira AL, Pinheiro L, Falcao AO (2012) A Bayesian approach to in silico blood-brain barrier penetration modeling. J Chem Inf Model 52(6):1686–1697. 10.1021/ci300124c [DOI] [PubMed] [Google Scholar]
- 49.Furukawa A, Townsend CE, Schwochert J, Pye CR, Bednarek MA, Lokey RS (2016) Passive membrane permeability in cyclic peptomer scaffolds is robust to extensive variation in side chain functionality and backbone geometry. J Med Chem 59(20):9503–9512. 10.1021/acs.jmedchem.6b01246 [DOI] [PubMed] [Google Scholar]
- 50.Mahé P, Ueda N, Akutsu T, Perret J-L, Vert J-P (2004) Extensions of marginalized graph kernels. In: Proceedings of the twenty-first international conference on machine learning. ICML ’04, p 70. Association for computing machinery, New York, NY, USA. 10.1145/1015330.1015446
- 51.Wu Z, Ramsundar B, Feinberg EN, Gomes J, Geniesse C, Pappu AS, Leswing K, Pande V (2018) Moleculenet: a benchmark for molecular machine learning. Chem Sci 9:513–530. 10.1039/C7SC02664A [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Liebel L, Körner M (2018) Auxiliary tasks in multi-task learning. arXiv:1805.06334
- 53.Wildman SA, Crippen GM (1999) Prediction of physicochemical parameters by atomic contributions. J Chem Inf Comput Sci 39(5):868–873. 10.1021/ci990307l [Google Scholar]
- 54.Ertl P, Rohde B, Selzer P (2000) Fast calculation of molecular polar surface area as a sum of fragment-based contributions and its application to the prediction of drug transport properties. J Med Chem 43(20):3714–3717. 10.1021/jm000942e [DOI] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
The processed data is available at https://github.com/Gobliu/BenchmarkCycPeptMP.




