Skip to main content
Scientific Reports logoLink to Scientific Reports
. 2026 Feb 12;16:6580. doi: 10.1038/s41598-026-39985-4

De novo generation and in silico screening of anti-diabetic peptide candidates via a deep learning–attention framework with physicochemical feature fusion

Zahra Rahmani Asl 1, Khosro Rezaee 1,, Mojtaba Ansari 1,, Hadi Zare-Zardini 1, Hossein Eslami 1
PMCID: PMC12909783  PMID: 41680447

Abstract

Therapeutic anti-diabetic peptides (ADPs) are an emerging class of clinically-relevant biologics with the potential to manage glucose (glycaemic) levels. However, the majority of existing computational pipelines are either narrowly focused or unverified. Here, we present a principled, end-to-end, and provenance-aware framework for the de novo generation, filtering and classification of ADPs. In terms of design, candidate peptides are derived from three complementary strategies: guided modification of functional motifs, recombination of conserved bioactive fragments, and a hybrid generative engine. Candidates are then subjected to biochemical triage (net charge, hydrophobicity, Boman index), homology screening, and APD-style predictors/calculators. For classification, we use a blend of (1) interpretable biochemical descriptors (net charge, hydrophobicity, Boman index), and (2) sequence-derived representations learned by a CNN-with-attention backbone, to parse local motifs and longer-range context. Classifier heads are automatically tuned with an Optimized Tree-structured Parzen Estimator (OptimizedTPE). Training used 238 experimentally validated ADPs with homology-aware splits (Train/Val/Internal-Test positives: 167/24/47) and a curated negative pool at a 2:1 ratio; additionally, Train-only weak-label augmentation added 412 screened positives (and matched negatives) for robustness. We report an evaluation on an independent, external panel of 180 peptides, fully disjoint from the training data in both source and time. On this unseen set, the model achieves ≈ 98.75% accuracy (F1 ≈ 0.985, precision ≈ 0.99, recall ≈ 0.98, specificity ≈ 0.99, ROC AUC ≈ 0.99). This suggests high sensitivity to true ADPs while tightly controlling false positives in the setting of realistic class imbalance. Taken together, these results make the framework a candidate for a reproducible, biologically-grounded, in silico screening layer for metabolic peptide therapeutics.

Keywords: Anti-diabetic peptides, De novo peptide design, CNN–attention models, Physicochemical filtering, OptimizedTPE hyperparameter tuning, External generalization

Subject terms: Biotechnology, Computational biology and bioinformatics, Drug discovery

Introduction

Diabetes mellitus (DM) represents a significant and escalating global health concern: an estimated 537 million people suffered from the disease in 202115, facing the risk of life-threatening cardiovascular, renal, and neurological complications. Beyond clinical morbidity, the disease imposes a substantial and growing burden on quality of life and health economics. In general, DM can be classified as Type 1 diabetes (T1D) and Type 2 diabetes (T2D): the former results from autoimmune β-cell destruction, while the latter is most commonly associated with insulin resistance and consequent β-cell dysfunction2,3. Adverse side effects, cost, and long-term resistance to insulin analogues and oral hypoglycaemics remain key issues for current therapeutic regimens6,7; consequently, there is a significant need for well-characterized modalities with durable, clinical-grade efficacy.

Anti-diabetic peptides (ADPs) have emerged as a biologically informed class of candidate molecules that can be rationally designed to regulate glycaemic control with low off-target toxicity and high specificity through mechanisms including the induction of insulin secretion, insulin sensitization, and the promotion of glucose uptake812. Well-studied examples include GLP-1 and DPP-IV inhibitors, which act to upregulate insulin secretion, inhibit glucagon, delay gastric emptying, and prolong GLP-1 receptor activation13,14. However, the in-silico discovery of novel ADPs remains a complex and expensive bottleneck to development: despite recent progress in the clinical maturation of peptide therapeutics15,16, the vast combinatorial space of peptide sequences—compounded by the practical costs and timelines of in vitro and in vivo validation—render computational approaches to ADP discovery especially valuable.

The functional activity of a peptide is conditioned by the physicochemical and structural properties of its constituent amino acids. Secondary motifs (e.g., α-helices and β-sheets), which stabilize the 3D conformation of a peptide and facilitate receptor binding, are well-studied17,18. Parameters such as net charge, hydrophobicity, molecular weight, and isoelectric point (pI) are also critical, as they govern in vivo bioavailability and degradation19. The encoding of amino-acid sequences and their utilization for in silico prediction constitute an active area of methodological research: common examples include one-hot vectors, sequence embeddings, and position-specific scoring matrices (PSSMs) to capture sequence context. Convolutional neural networks (CNNs) are frequently employed for local motifs, while recurrent or Transformer-based deep learning (DL) architectures capture long-range interactions. To be effective, ADP predictors must balance high-resolution sequence representations with a well-curated and interpretable set of biochemical descriptors to support both mechanism-aware design and transparent experimental down-selection.

This methodological evolution mirrors broader trends in computational biology, where deep generative models have revolutionized the de novo design of therapeutic peptides across various disease modalities beyond diabetes. Pioneering frameworks utilizing Variational Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Recurrent Neural Networks (RNNs) have been successfully deployed to generate novel antimicrobial (AMPs) and anticancer peptides with optimized bioactivity and stability2022. While these general-purpose frameworks provide a robust theoretical foundation for exploring the peptide chemical space, their direct translation to specific metabolic disorders remains challenging due to the distinct structural constraints and receptor-selectivity profiles required for effective glycemic control. Nevertheless, these foundational studies have paved the way for specialized architectures tailored to the discovery of metabolic regulators.

Building upon these general advances, several recent studies have attempted to use DL specifically to predict or directly design ADPs. Chen et al.8 proposed a DL system that combined molecular fingerprints with CNNs and deep neural networks (DNNs) to design GPR40-targeting peptides. Chang et al.23 and Casey et al.24 used hybrid systems-biology and machine learning (ML) approaches to model T2D pathogenesis and generate low-complexity ADPs that are active in vivo. DL has also been applied to drug repurposing and multi-disease ADP prediction25, with recent comprehensive surveys provided by Huang et al.26 and Zhao et al.27. Related work has focused on the prediction of disease progression in gestational diabetes28. These works demonstrate feasibility but reveal a need for improved generalization, calibration, and interpretability that has thus far limited practical translation.

Other recent work has sought to improve both accuracy and interpretability. ADP-Fuse is a stacked ML system with 22 features and an accompanying web server29, and iPADD integrates multiple molecular fingerprints and classifiers, reporting competitive performance30. Guan et al.31 introduced BERT-DPPIV, a Transformer-based predictor for DPP-IV inhibitory peptides. Other groups have benchmarked different classifiers32 or compared CNN-based predictors33, and proposed optimized multi-algorithm pipelines for de novo design of multifunctional peptides34,35. Structural tools like AlphaFold have also enabled three-dimensional peptide modelling36. However, reproducible pipelines that explicitly couple transparent, biologically motivated features with state-of-the-art sequence encoders—and that are stress-tested on fully independent datasets—remain relatively scarce.

Here we present a model for ADP prediction that explicitly integrates sequence encodings with an informative set of established biochemical features to improve discovery. We use a hybrid CNN–Transformer architecture in which multiple CNNs are stacked with attention blocks for both local feature extraction and long-range contextual modelling. Our network integrates consistently reported biochemical descriptors such as net charge, hydrophobicity, and Boman index in support of interpretability. To improve training, we implement a sequence-augmentation strategy which leverages amino-acid substitutions, motif recombination, and variational autoencoder (VAE)–based generation. The raw outputs of this generative process are subjected to a multi-stage refinement protocol termed ‘biological filtering.’ This phase employs a series of ‘descriptor gates’—defined as computational thresholds based on physicochemical properties such as molecular weight and isoelectric point—to systematically exclude sequences that are structurally valid but biologically non-viable.

To avoid optimistic bias, we use the generated sequences—classified here as ‘weak-label positives’ due to their high probabilistic similarity to bioactive peptides despite the absence of ground-truth experimental validation—only for augmentation and not in the validation or test sets, which comprise only experimentally verified peptides. Furthermore, we assess the ‘origin affinity’ of these candidates to quantify their retention of key pharmacophoric features from the source training distribution while adapting to the target geometric constraints. In this way, our hybrid framework enables transparent and biologically informed ADP prediction with broad applicability to peptide-based drug design.

Our contributions are summarized as follows. While the underlying components (CNNs, attention, VAEs, and Bayesian hyperparameter optimization) are established, our primary methodological contribution is their leakage-safe, end-to-end integration into a reproducible ADP discovery workflow. Specifically, we develop: (1) a multi-view CNN–self-attention classifier that fuses residue-level sequence representations with interpretable physicochemical descriptors; (2) a Train-only augmentation and screening protocol that combines VAE-based generation and motif engineering with strict deduplication, novelty/homology constraints, and descriptor-gated triage to prevent cross-split contamination; (3) a systematic OptimizedTPE procedure for tuning downstream classifier heads under a fixed training protocol; (4) a transparent, reproducible ADP-prediction pipeline (with code availability) intended for peptide drug discovery; and (5) a rigorous evaluation protocol featuring homology-aware splitting, an independent unseen experimental test set, and calibration analyses under realistic class imbalance.

The remainder of this paper is structured as follows. In Sect.  2, we provide details of our materials and methods. In Sect.  3, we present results including ablation studies and external generalization on an independent experimental peptide set, followed by peptide analysis. In Sect.  4, we perform interpretability and biological relevance analysis, and Sect.  5 discusses our results in the light of related work and their implications and limitations. In Sect.  6, we summarize our work and present directions for future work including experimental validation.

Materials and methods

A deep-learning framework is introduced to help overcome the challenges of peptide complexity and improve the identification of anti-diabetic peptides (ADPs). This work combines CNN-based sequence modeling with the encoding of biochemical properties (net charge, hydrophobicity, and Boman index) and a voting scheme of classifiers (SVM, KNN, Decision Tree, XGBoost) with hyperparameter tuning (OptimizedTPE). This framework of blending handcrafted features and interpretable machine learning with deep learning in a voting scheme achieves the best balance between interpretability and accuracy and—as shown in Fig. 1—offers a scalable, robust pipeline for drug discovery from peptides.

Fig. 1.

Fig. 1

Schematic overview of the proposed framework for ADP discovery. It contains modules of biochemical feature extraction, CNN encoding, handcrafted feature fusion, and a multi-classifier evaluation module.

The framework can be used for the prediction of ADPs and for the generation of candidate peptides which are only used for training data augmentation. The input peptide sequences are encoded with biochemical properties. The CNN is used to extract high-level features from sequences and concatenate them with essential handcrafted features (net charge, hydrophobicity, and Boman index). The best model is selected using a five-fold cross-validation strategy on the training/validation set partition. The performance of the best model is reported on a test set independent from the training/validation split that is experimentally verified with homology aware splitting to avoid any kind of data leakage and where no generated or pseudo-labeled sequence was included in the validation or test set.

Dataset

Our initial corpus of experimentally verified ADPs was obtained from PEP-Lab (Antidiabetic activity) and serves as the primary positive set (n = 238)37. For negatives, we used SATPdb and THPdb2 and curated neutral literature with no reported anti-diabetic/DPP-IV activity38,39. All sequences are provided in a consistent one-letter amino-acid code, with additional biochemical descriptors (net charge, hydrophobicity/GRAVY, Boman index) computed in a corpus-wide consistent manner. Positives are labeled only on the basis of explicit experimental evidence. APD6 does not figure in the dataset as a labeling source. We use its prediction/calculator tools only as a heuristic screen during Train-only augmentation to triage generated candidates40,41.

Notably, negatives were curated as peptides with no reported anti-diabetic or DPP-IV inhibitory activity by available evidence, sampled from external peptide databases and neutral literature cases. To reduce label noise and prevent distributional shortcuts, negatives were matched to the positive set as closely as possible in basic statistics (sequence length distribution, amino-acid composition profile, and molecular weight range). We then performed decontamination by removing any negative sequence exhibiting high homology to any positive (global sequence identity ≥ 70% against the curated ADP positives, computed using the same clustering/similarity protocol used for splitting), ensuring that “easy negatives” do not arise from near-duplicate positives. In addition, negatives were screened to avoid ADP-like characteristics by applying the same physicochemical plausibility checks used elsewhere in the pipeline (net charge, GRAVY/hydrophobicity, and Boman index bounds) and by excluding sequences flagged as ADP-like by independent predictor consensus. Finally, we adopt a 2:1 negative: positive ratio in each split to reflect realistic class imbalance while maintaining stable training; sensitivity to this ratio is reported in the calibration/robustness analysis.

Moreover, all synthetic peptide candidates were generated once from a fixed set of training-only seeds prior to any dataset partitioning. No split-specific regeneration was performed. After generation, the combined dataset (real and synthetic peptides) was subjected to homology-aware clustering using CD-HIT to define cluster-wise Train/Validation/Test splits, ensuring that no cluster was shared across splits.

While a 70% global sequence identity threshold was used as an operational CD-HIT parameter, we additionally performed a post hoc nearest-neighbor identity audit across splits. This analysis showed that no peptide pairs across Train, Validation, and Test exceeded 70% identity; moreover, the maximum observed identity was below 50%, with a median nearest-neighbor identity of approximately 35%. Thus, the effective separation between splits is substantially stricter than the nominal 70% threshold, mitigating concerns regarding homology-driven data leakage, particularly given the limited dataset size. Table 1 summarizes the anti-diabetic peptide (ADP) dataset, comprising 238 experimentally validated positive sequences from PEP-Lab and a curated negative pool constructed to maintain a 2:1 class ratio across all data splits.

Table 1.

Overview of the anti-diabetic peptide dataset.

Attribute Description
Number of positive sequences (experimental) 238 ADPs (PEP-Lab, Antidiabetic category)
Negative pool (curated) Sufficient to support 2:1 ratio in all splits (sampled from SATPdb/THPdb2/neutral literature; no reported ADP/DPP-IV activity)
Data type Amino-acid sequences + biochemical descriptors (net charge, GRAVY, Boman index); optional structural notes; provenance (PMID/DOI)
Average sequence length ~ 25 amino acids
Sequence format One-letter amino-acid codes (standard residues only)
Biological function labels ADP (experimental) for positives; non-ADP by available evidence for negatives
Screening utilities (augmentation only) APD6 calculators/prediction for heuristic triage of Train-only generated candidates (not used for evaluation labels)
Model usage Train/Validation/Internal-Test (cluster-wise); Independent Unseen Test for final reporting only

Duplicates were removed, and a homology-aware, cluster-wise splitting was enforced (e.g., CD-HIT at 70% identity), with entire clusters assigned to a single partition (Train/Validation/Internal-Test)41 (split sizes and enforced 2: 1 negative-to-positive ratio are summarized in Table 2). Final reporting uses an independent, unseen experimental test set from 2024 to 2025 literature (DPP-IV inhibitory peptides, clinically relevant GLP-1RAs) source/time-disjoint from Train/Val39,40 (composition and provenance of this cohort is detailed in Table 3). Generated or pseudo-labeled sequences never enter Validation or any Test.

Table 2.

Overview of corpus partitioning strategy and maintained class balance across training, validation, internal-test, and unseen sets.

Partition Positives (experimental) Base negatives (2×) Augmented positives (Train-only) Augmented negatives (to keep 2:1) Effective total (pos/neg) Usage
Train 167 334 + 412 kept (from 835 generated) + 824 579 / 1158 → 2:1 Model fitting (augmented = weak/confidence-weighted)
Validation 24 48 0 0 24 / 48 → 2:1 Model selection and calibration
Internal-test 47 94 0 0 47 / 94 → 2:1 Auxiliary hold-out (sanity check)

Table 3.

Description of independent unseen evaluation set with experimentally verified peptides.

Subset Positives (experimental) Negatives Notes
Unseen-experimental 60 120 Recent DPP-IV peptides + GLP-1RAs; no overlap with training sources/time; 2:1 ratio

From 167 experimental positives in the Train split, we produced 5× = 835 candidate peptides (motif-guided edits/recombinations + generative modeling). Candidates were retained if they (1) met physicochemical bounds (charge, hydrophobicity, Boman), and (2) passed APD6 screening (antidiabetic predictor/calculators) and a consensus of ADP-specific predictors.

412 candidates are held out as weak-label, Train-only positives (confidence-weighted), while the same 2: 1 ratio after augmentation is maintained with 824 Train-only negatives added (curated/synthetic, screened to avoid ADP-like signals)4245 (augmentation outcomes and resulting Train class counts are reported in Table 2, the independent unseen set remains unaffected, see Table 3). Augmented samples are excluded from Validation or any Test.

To ensure a rigorous and unbiased assessment of generalization capability, the independent test set was constructed using a strict temporal-split protocol. We conducted a systematic, date-restricted query of literature databases (PubMed and Google Scholar) specifically for studies published in 2024–2025—a period completely disjoint from the training data sources39,40,42. The search strategy employed broad keywords, including ‘novel anti-diabetic peptides’, ‘DPP-IV inhibitors’, and ‘GLP-1 receptor agonists’. To eliminate selection bias, we adhered to an inclusive protocol: every experimentally verified sequence identified in these search results was admitted to the external panel (n = 180), provided it consisted of standard amino acids. Crucially, unlike the training phase, we deliberately abstained from applying physicochemical pre-filters (e.g., net charge or Boman index constraints) to this dataset. This ensures that the evaluation reflects the model’s performance on the raw, uncurated distribution of pharmacologically relevant peptides as they appear in current literature, rather than a subset artificially aligned with the model’s priors.

As an additional integrity check on the data partitioning, we quantified sequence similarity between the Training set and each held-out set using a nearest-neighbor global identity audit (Table 4). For every peptide in Validation, Internal-Test, and the External experimental panel, we computed its maximum (nearest-neighbor) global sequence identity to any Training peptide and summarized the distribution. The observed median nearest-neighbor identities are ~ 35% across all three query sets and the maximum nearest-neighbor identities remain below 50%, with 0% of sequences in each query set reaching ≥ 70% identity to any Training sequence. This provides measurable evidence that the splits are free of cross-set near-duplicates at the 70% homology threshold.

Table 4.

Quantitative nearest-neighbor identity audit to the training set.

Comparison Query set (N; Pos/Neg) Reference set (N; Pos/Neg) Identity metric definition Median NN identity (%) Max NN identity (%) Evidence type Interpretation
Validation → train 72 (24/48) 501 (167/334) For each query peptide, compute the maximum (nearest-neighbor) global sequence identity to any Training peptide; summarize across the query set 34.89 44.44 Measured post-hoc audit Quantifies Train–Validation separation; no near-duplicates ≥ 70% identity (0% with NN ≥ 70%).
Internal-test → train 141 (47/94) 501 (167/334) Same as above 35.71 47.83 Measured post-hoc audit Quantifies Train–Test separation; no near-duplicates ≥ 70% identity (0% with NN ≥ 70%).
External experimental test → train 180 (60/120) 501 (167/334) Same as above 35.15 44.44 Measured post-hoc audit Quantifies external-to-train proximity; no near-duplicates ≥ 70% identity (0% with NN ≥ 70%).

Complementing this, we report clustering statistics under the same 70% identity criterion (Table 5). All sequences form singleton clusters at this threshold (median and maximum cluster size = 1), confirming that the dataset contains no groups of peptides with ≥ 70% identity and that cluster-wise assignment prevent family-level leakage between Training and the held-out splits.

Table 5.

Cluster statistics per split under the 70% identity threshold (cluster-wise split).

Split N sequences (experimental only) Positives Negatives CD-HIT threshold Cluster-wise assignment #Clusters Cluster size (median) Cluster size (max) Singleton clusters (%) Interpretation
Train 501 167 334 70% identity Entire clusters assigned to a single split 501 1 1 100.0 All Training sequences are singletons at the 70% threshold (no intra-Train pairs ≥ 70%).
Validation 72 24 48 70% identity Entire clusters assigned to a single split 72 1 1 100.0 Validation sequences are singletons at the 70% threshold.
Internal-Test 141 47 94 70% identity Entire clusters assigned to a single split 141 1 1 100.0 Internal-Test sequences are singletons at the 70% threshold.

Methods

To mitigate data sparsity, limited diversity, and variable biological relevance in ADP discovery, we propose a provenance-aware, multi-phase pipeline. The workflow (1) curates an experimentally validated core corpus (positives from PEP-Lab; negatives sampled to a fixed 2:1 negative: positive ratio), (2) enforces homology-aware, cluster-wise splitting (e.g., CD-HIT at 70% identity) and evaluates generalization on an independent, source/time-disjoint unseen experimental test set, and (3) performs Train-only augmentation, generating Inline graphicthe number of training positives via motif-guided edits/recombinations and generative modeling, followed by multi-stage triage using physicochemical constraints (net charge, GRAVY, Boman index) and APD6 predictors/calculators as heuristic screens (in consensus with ADP-specific predictors). Augmented sequences are admitted only as weak-label, confidence-weighted training samples and are never allowed to enter Validation or any Test. To strengthen translational relevance, we also track organismal provenance for experimental peptides and infer the lowest common ancestor (LCA) for designed peptides via sequence-similarity analysis when possible.

Although the individual building blocks used here (CNN encoders, self-attention, physicochemical descriptors, deep generative models, and Bayesian hyperparameter optimization) are established, our primary methodological contribution is their leakage-safe, end-to-end integration into a reproducible ADP discovery workflow that couples Train-only generation and biological screening with a multi-view classifier and systematic tuning. In contrast to prior ADP predictors that often emphasize prediction of known sequences under internal resampling and may not fully rule out family-level leakage or uncontrolled use of synthetic variants, our framework makes leakage control explicit (homology-aware splitting; Train-only confinement of generated/pseudo-labeled sequences; deduplication and novelty/homology constraints before training admission). Finally, we fuse residue-level sequence representations (CNN–self-attention) with interpretable global descriptors and optimize downstream classifier heads via OptimizedTPE, with evaluation emphasizing PR-AUC, calibration, and rigorous statistical testing on the independent unseen cohort.

Initial peptide design

Deep learning–based generative models can propose previously unseen peptide sequences that expand chemical space while preserving biological regularities. The key ingredient of our pipeline is the generation module, which is used strictly for Train-only augmentation and is never permitted to enter Validation or any Test split. We use a portfolio of generators, including RNN/LSTM autoencoders, variational autoencoders (VAEs), and transformer language models (e.g., ProGen, ProtGPT2), to learn latent sequence patterns and propose candidate peptides. Importantly, these generated peptides are treated as proposals (weak-label candidates) rather than experimentally validated positives.

  1. VAE training data and preprocessing (Train-only): Our experimentally verified positive corpus contains n = 238 ADPs. After homology-aware, cluster-wise splitting (e.g., CD-HIT at 70% identity) with whole clusters assigned to a single partition, the Train split contains n = 167 experimental positives. The VAE is trained only on these Train positives to avoid leakage. Sequences are represented in a consistent one-letter amino-acid code and standardized to a fixed maximum length (e.g., 30 residues) via truncation/padding using the placeholder token “X”, enabling mini-batched training.

  2. VAE architecture: In the VAE setting, a peptide sequence Inline graphic is generated from a latent vector Inline graphic sampled from a prior Inline graphic. The marginal likelihood is:

graphic file with name d33e877.gif 1

We use a standard sequence VAE in which the encoder maps Inline graphicto a diagonal-Gaussian approximate posterior,

graphic file with name d33e888.gif 2

Latent sampling is performed using the reparameterization trick:

graphic file with name d33e894.gif 3

The decoder defines a conditional distribution over sequences given Inline graphic, factorized autoregressively over positions:

graphic file with name d33e904.gif 4

The marginal likelihood of a sequence under the generative model is:

graphic file with name d33e910.gif 5

Training objective and optimization. The VAE is trained by maximizing the evidence lower bound (ELBO), which balances reconstruction quality and latent regularization:

graphic file with name d33e916.gif 6

The reconstruction term is implemented as token-level cross-entropy (teacher forcing during training), and the KL term regularizes the latent space to support meaningful sampling.

  • c.

    Concrete training and architecture settings (used in our implementation): Unless otherwise stated, the VAE uses a lightweight RNN-based encoder–decoder backbone (GRU/LSTM family) suitable for the relatively short peptide lengths: embedding size 128, hidden size 256, latent dimension Inline graphic64, 2 layers in both encoder and decoder, and dropout 0.20. Optimization uses AdamW with learning rate 1 × 10⁻³, batch size 64, and a maximum of 150 epochs. Early stopping is applied using a small held-out subset carved from the Train positives (distinct from downstream Validation/Test splits) with patience 15 epochs. KL warm-up is used by linearly annealing Inline graphicfrom 0 to 1 over the first 30 epochs to improve stability and mitigate posterior collapse. (Implementation details and scripts are provided in the Supplementary File and the accompanying code repository.)

  • d.

    Candidate generation and Train-only augmentation: After training, we generate candidates by sampling Inline graphicand decoding autoregressively using multinomial sampling from the decoder softmax with temperature 1.0 and top-Inline graphic 20. Generated sequences are treated as proposals rather than experimentally validated positives. Each candidate is assigned physicochemical properties (net charge, hydrophobicity/GRAVY, Boman index, pI) using computational calculators, and then subjected to a multi-stage Train-only filtering protocol prior to augmentation.

Net charge is computed at physiological pH (7.4) as:

graphic file with name d33e958.gif 7

where Inline graphic and Inline graphic correspond to dissociation constants of basic and acidic groups, respectively.

  • e.

    Hybrid generation sources and screening gates. In addition to VAE sampling, we diversify the pool via (1) BLOSUM62-guided substitutions (e.g., Lys→Arg) and (2) motif recombination, where known peptides are segmented into functional motifs and recombined while maintaining plausible active domains. Candidates from all sources are merged and deduplicated, then must satisfy novelty constraints to prevent collapse onto the seed set and reduce homology leakage risk: global identity < 70% to any Train positive and minimum edit distance ≥ 3–5 to the nearest Train neighbor. Surviving candidates are then triaged using: (1) descriptor gates (net charge/GRAVY/Boman within admissible bounds), (2) APD6 calculators/prediction tools as a screen only (APD6 is not used as a labeling source), and (3) consensus with independent ADP predictors. Candidates passing all checks are admitted as weak-label, confidence-weighted Train-only positives (synthetic positives are down-weighted in the training loss with weight 0.5, while experimentally validated positives retain weight 1.0).

  • f.

    Augmentation counts and class balance. From 167 experimental positives in the Train split, we produce 5× = 835 Train-only candidate peptides (motif-guided edits/recombinations + generative modeling). After filtering, 412 candidates are retained as weak-label Train-only positives. To maintain the fixed 2:1 negative: positive ratio after augmentation, we add 824 matched Train-only negatives (curated/synthetic) screened to avoid ADP-like signals. Augmented samples are excluded from Validation and all Test splits; the independent unseen experimental evaluation set is unaffected by augmentation.

Provenance assignment for interpretability (post hoc). For each retained synthetic candidate, we also assign a nearest natural taxon estimate using sequence similarity search (e.g., DIAMOND/BLASTp) and report the inferred lowest common ancestor (LCA) along with identity and coverage. This provenance signal is used for interpretability only and is not used for training, tuning, or evaluation.

Figure 2 summarizes our hybrid Train-only augmentation pipeline: candidates are generated via BLOSUM62-guided substitutions, motif recombination, and VAE latent sampling, then pooled and deduplicated. The pooled set is filtered by leakage-safe novelty constraints (global identity < 70% and edit distance ≥ 3–5) and by physicochemical/predictor screening; only passing sequences are kept as confidence-weighted weak-label Train-only positives while Validation/Test splits remain generation-free.

Fig. 2.

Fig. 2

Hybrid train-only generation and screening workflow. Candidates from guided mutation (BLOSUM62), motif recombination and VAE sampling pass descriptor gates, APD6-based heuristic screening, and predictor consensus; novelty and homology constraints enforce non-redundancy.

After candidate generation (VAE sampling, motif recombination, and guided mutation), we applied a multi-stage Train-only triage prior to augmentation. Candidates were first filtered by physicochemical plausibility using explicit numeric thresholds: net charge at pH 7.4 in the range 0.9–4.0 and molecular weight in the range 2900–3300 Da. Surviving sequences were then subjected to an APD6-style heuristic screen based on sequence similarity to known anti-diabetic peptides, retaining candidates with similarity ≥ 80%. These thresholds were used strictly to enrich the Train-only proposal pool and to reject implausible or weakly related sequences; they were not used for evaluation, and no generated/pseudo-labeled peptides were allowed to enter Validation or any Test set.

Sequence encoding

The next step towards building a deep learning model for anti-diabetic peptide classification is numerical encoding, the step at which the preprocessed amino acid sequences are encoded into the form of vectors that are then further used as an input for neural network models. As an example, a raw sequence of amino acids looks like this “IKLSPETK….“, and then each residue is replaced by five numbers corresponding to hydrophobicity, charge, polarity, molecular mass and aromaticity, in other words, each residue is encoded by a five-dimensional vector.

A custom lookup table for the 20 standard amino acids enables this conversion into structured numerical matrices. Sequences shorter than a predefined fixed length (e.g., 30 residues) are padded with a placeholder amino acid “X,” assigned a zero vector to maintain dimensional consistency. Mathematically, let the peptide sequence be denoted by Inline graphic, where Inline graphic is the sequence length. Each residue Inline graphic​ is mapped to a 5-dimensional vector Inline graphic. The final matrix representation of the sequence is then defined as:

graphic file with name d33e1032.gif 8

Here, the matrix X is the encoding of the peptide sequence that will be given as input to the CNN.

This encoding preserves the biological properties of the amino acids as well as the sequence order. The resulting encoded data has the size of a three-dimensional array [L×5 × 1], which can be thought of as a numerical image of a peptide. This encoding allows the CNN to learn local motifs and latent features important for peptide function as biological activity is strongly dependent on the order and composition of amino acids. The encoding is also beneficial for concatenation with other descriptors (such as net charge and Boman index) in downstream modeling approaches. Table 6 lists the five biochemical features (hydrophobicity, charge, polarity, molecular mass, and aromaticity) that were used to encode the 20 amino acids of the standard genetic code, describing important structural and functional characteristics for use in deep learning. Table 7 provides a concise summary of these five features using brief conceptual definitions, as a basis for biologically informed sequence encoding.

Table 6.

A five-dimensional encoding scheme represents each of the 20 standard amino acids by key biochemical features—hydrophobicity, charge, polarity, molecular mass (Daltons), and aromaticity (0/1).

Amino acid Hydrophobicity Charge Polarity Molecular mass (Da) Aromaticity (0/1)
A 1.8 0 0 89.1 0
C 2.5 0 0 121.2 0
D − 3.5 − 1 1 133.1 0
E − 3.5 − 1 1 147.1 0
F 2.8 0 0 165.2 1
G − 0.4 0 0 75.1 0
H − 3.2 1 1 155.2 1
I 4.5 0 0 131.2 0
K − 3.9 1 0 146.2 0
L 3.8 0 0 131.2 0
M 1.9 0 0 149.2 0
N − 3.5 0 1 132.1 0
P − 1.6 0 0 115.1 0
Q − 3.5 0 1 146.2 0
R − 4.5 1 0 174.2 0
S − 0.8 0 1 105.1 0
T − 0.7 0 1 119.1 0
V 4.2 0 0 117.1 0
W − 0.9 0 1 204.2 1
Y − 1.3 0 1 181.2 1

This forms the basis for embedding peptide sequences as input to the deep learning model.

Table 7.

Intuitive description of the five salient biochemical properties used for numerical encoding of amino acid sequences in the bioactive peptide model.

Feature Explanation
Hydrophobicity Indicates the amino acid’s tendency to remain in lipid-friendly (nonpolar) environments. Higher values imply greater lipophilicity.
Charge Net electric charge of the amino acid at physiological pH (~ 7.4). Can be + 1 (positive), − 1 (negative), or 0 (neutral).
Polarity Reflects the amino acid’s tendency to form hydrogen bonds. Polar amino acids are typically water-soluble.
Molecular Mass Approximate molecular weight of the amino acid in Daltons (g/mol), often used for structural identification.
Aromaticity Indicates whether the amino acid contains an aromatic ring (e.g., benzene). Encoded as 1 if aromatic, 0 otherwise.

Biological origin prediction

For each accepted synthetic peptide, we also assign an estimate of putative biological source at the family level (e.g., “mammalian cathelicidin-like”, “amphibian skin host-defense peptide”, “insect cecropin-like”). This operation is an interpretability layer rather than a supervisory label: we do not use it for training, tuning, or evaluation, but only to place each de novo peptide in the context of potential host biology (e.g., plausible innate immune environment, likely secretion route, etc. protease exposure). To do this, each generated sequence Inline graphic is aligned against experimentally validated peptides in APD6 using BLOSUM62-guided global alignment and high-sensitivity local matching. For every known peptide family Inline graphic (cathelicidin/LL-37-like, magainin/aureinlike, cecropin-like, etc.), we identify the closest reference sequence Inline graphic and record classical similarity features such as percent identity and coverage. This yields a sequence-distance term

graphic file with name d33e1401.gif 9

which is low when Inline graphic is closely related to a known representative of family Inline graphic. In parallel, each candidate peptide is embedded into a compact biochemical descriptor vector that encodes lineage-informative traits: overall length; net charge at physiological pH; Lys/Arg enrichment (cationicity); cysteine count; hydrophobicity (GRAVY); and Boman index (binding propensity). For each family Inline graphic, we compute the Euclidean distance between the candidate’s descriptor profile and that family’s centroid in this physicochemical space, denoted Inline graphic. We then define a family affinity score that balances explicit sequence homology with physicochemical resemblance:

graphic file with name d33e1423.gif 10

with Inline graphic to ensure that true evolutionary/functional proximity dominates over superficial physicochemical similarity. Consequently, the affinities Inline graphic for the top- Inline graphic candidate families are normalized (softmax over families) to obtain Inline graphic, and the highest-probability family assignment is reported as the peptide’s putative origin. This origin label is included in downstream reporting (e.g., “mammalian cathelicidin-like, Inline graphic), and is used to (1) deprioritize chemically implausible outliers, (2) anticipate likely biological context (e.g., amphibian skin vs. insect hemolymph vs. mammalian innate immunity), and (3) guide selection of experimental follow-up models.

Attention-enhanced CNN architecture

After numerical encoding, we employ a CNN–attention hybrid classifier to predict anti-diabetic activity from peptide sequences. The model is designed to (1) extract local sequence motifs (e.g., short residue patterns) with 1D convolutions, (2) capture long-range residue dependencies (e.g., interactions between distant charged/hydrophobic positions) using a lightweight self-attention block, and (3) fuse these learned sequence features with global biochemical descriptors (net charge, Boman index, and hydrophobicity/GRAVY). This multi-view design improves both predictive robustness and interpretability, because the attention weights provide a residue-level explanation of which sequence regions contribute most to the classification decision.

a) CNN feature extraction and pooling: Let Inline graphic denote the activation at sequence position Inline graphicin convolutional channel Inline graphicafter ReLU nonlinearity. Convolutional filters act as motif detectors by scanning along the residue axis and producing position-wise features. To reduce dimensionality and introduce invariance to small local shifts, we apply max-pooling over a window Inline graphic (anchored at position Inline graphic) for each channel:

graphic file with name d33e1478.gif 11

where Inline graphic is a pooling window with size Inline graphicand stride Inline graphic. Pooling yields a compressed feature map Inline graphicthat retains the strongest motif evidence per region. The pooled map is then flattened or globally aggregated to obtain a fixed-length CNN representation that summarizes motif-level evidence across the peptide.

  • g.

    b) Fusion with biochemical descriptors (global context): In parallel to learned CNN features, we compute a descriptor vector Inline graphic containing net charge (pH 7.4), Boman index, and GRAVY. These descriptors capture global physicochemical context that is often predictive in peptide bioactivity (e.g., cationic character and amphipathicity). We concatenate the descriptor vector with the learned sequence representation to form a combined feature vector that integrates local motif evidence with global biochemical priors.

  • h.

    c) Self-attention for long-range dependencies and interpretability. While convolutions excel at detecting local motifs, peptide activity can also depend on non-local patterns (e.g., distributed charge/hydrophobic arrangements and residue co-occurrence across distant positions). To model such dependencies, we append a lightweight self-attention block operating on the residue-level CNN features. Let Inline graphic denote the sequence feature matrix after convolution/pooling (or after a suitable projection back to a residue-aligned representation). We compute query, key, and value matrices by linear projections Inline graphic, Inline graphic, and Inline graphic. A standard single-head self-attention is:

graphic file with name d33e1538.gif 12

where Inline graphicis the key dimensionality. The softmax term produces an attention matrix Inline graphic, in which Inline graphicquantifies how strongly position Inline graphicattends to position Inline graphic. This mechanism reweights residue representations by relevance, allowing the model to emphasize discriminative regions and to combine evidence across distant positions into a context-aware representation.

Crucially, the attention matrix also provides direct interpretability: by inspecting Inline graphic (or summarizing it into per-residue importance scores, e.g., by averaging attention weights across query positions and/or heads), we can visualize which residues or motifs the model focuses on when predicting anti-diabetic activity. This helps distinguish biologically meaningful decision patterns (e.g., concentration of attention on cationic/hydrophobic anchors) from diffuse or non-specific attention profiles.

Classification head and prediction. The attention output is pooled (or flattened) and concatenated with the CNN representation and the biochemical descriptor vector Inline graphic. The resulting fused vector is passed through a small fully connected (MLP) head to produce a final probability score for anti-diabetic activity (sigmoid output). In this way, the classifier uses:

  1. CNN features to detect local motifs,

  2. attention to model long-range residue dependencies and provide residue-level explanations, and.

  3. biochemical descriptors to incorporate global physicochemical constraints into the decision.

Decision layer tuning

To enhance classification accuracy, we introduce optimizedTPE—an improved Tree-structured Parzen Estimator that incorporates variance-aware priors, adaptive bandwidth, and memory-efficient sampling. Unlike the classical TPE, it dynamically adjusts density estimators based on variance and structural similarity, making it effective for tuning classifiers such as SVM, KNN, Decision Tree, and XGBoost. The overarching optimization goal is to minimize the average validation loss over K-folds:

graphic file with name d33e1600.gif 13

where λ denotes the hyperparameter configuration, Mλ is the classifier model with those parameters, and Dk is the k-th validation fold. Moreover, for SVM, the decision function is based on the kernel trick:

graphic file with name d33e1619.gif 14

The associated hyperparameter space is defined as:

graphic file with name d33e1625.gif 15

where C is the regularization parameter, γ is the kernel width (in Radial Basis Function or RBF), and kernel_type defines the kernel function (e.g., linear, polynomial, or RBF). Additionally, in KNN, predictions are made based on majority voting among neighbors:

graphic file with name d33e1640.gif 16

The tuned parameters include:

graphic file with name d33e1646.gif 17

Here, K is the number of neighbors, distance_metric specifies the similarity measure (e.g., Euclidean, cosine), and weight_scheme controls vote weighting. In addition, for Decision Trees, predictions depend on reaching a terminal leaf:

graphic file with name d33e1661.gif 18

Hyperparameters tuned include:

graphic file with name d33e1667.gif 19

Where criterion could be Gini impurity or entropy. XGBoost combines multiple weak learners (trees) via gradient boosting:

graphic file with name d33e1676.gif 20

And its hyperparameters are:

graphic file with name d33e1682.gif 21

With η as the learning rate, T as the number of trees, and subsample_ratio for stochastic regularization. the optimizedTPE algorithm selects the next configuration based on Expected Improvement (EI) over a threshold quantile γ:

graphic file with name d33e1697.gif 22

Here, γ denotes the performance quantile threshold, and Inline graphic represents the posterior distribution over the promising configurations. Notably, traditional hyperparameter optimization methods such as grid search, random search, and even standard Bayesian optimization often suffer from inefficiency in high-dimensional search spaces and a tendency to converge prematurely to local optima. While traditional TPE enhances Bayesian optimization by modeling promising regions, it remains exploitative and lacks gradient awareness or structural penalties. OptimizedTPE addresses these limitations through an entropy-aware, gradient-informed acquisition strategy that balances exploration and exploitation. It prioritizes sensitive hyperparameters while penalizing over-sampled or low-information regions. The expected improvement (EI) is first refined using a directional gradient:

graphic file with name d33e1708.gif 23

This leads to a guided update rule:

graphic file with name d33e1714.gif 24

Here, ηt is an adaptive step size, and Inline graphic is Gaussian noise with Inline graphic, introducing stochasticity to escape shallow local optima. Furthermore, to avoid entrapment in flat regions, OptimizedTPE applies a decaying penalty to over-sampled regions using an upper confidence bound (UCB) regularizer:

graphic file with name d33e1734.gif 25

Finally, the optimization goal becomes:

graphic file with name d33e1740.gif 26

where Inline graphic is a complexity penalty on undesirable hyperparameter configurations. Unlike conventional TPE or random search that lacks any directional guidance, OptimizedTPE maintains a probabilistic model of the improvement landscape and adapts its search via information-theoretic feedback. Convergence is governed by monitoring the entropy reduction of the posterior distribution:

graphic file with name d33e1749.gif 27

Stagnation detection is mitigated via variance-preserving resampling:

graphic file with name d33e1755.gif 28

This ensures that the algorithm maintains diversity in the proposal distribution, especially in high-performing regions.

graphic file with name 41598_2026_39985_Figa_HTML.jpg

Algorithm 1: The OptimizedTPE procedure for unified and adaptive hyperparameter tuning across four machine learning classifiers.

By introducing both gradient-informed updates and entropy-aware noise injection, OptimizedTPE exhibits superior capacity to explore multimodal spaces, effectively reducing the risk of being trapped in local minima and ensuring better global convergence behavior.

As shown in Algorithm 1, OptimizedTPE enhances classical TPE by integrating gradient-based exploration, entropy-preserving noise, and UCB penalization. This hybrid approach improves escape from local optima and mitigates premature convergence. Its design supports joint hyperparameter tuning of diverse classifiers (SVM, KNN, Decision Tree, XGBoost), ensuring stable and adaptable optimization across different models. By dynamically modeling the posterior distribution Inline graphic, the algorithm effectively exploits high-performing configurations, while the injected noise and UCB regularization encourage continued exploration in under-sampled or uncertain regions of the space. Overall, OptimizedTPE provides a principled and resource-efficient framework for deriving robust, globally informed hyperparameter configurations in complex machine learning pipelines.

Results

The training set comprised experimentally validated peptides, while the weak-label set of synthetic train-only candidates was added following physicochemical filtering and APD6-based heuristic screening (augmented data not used for validation/test). Our model used a CNN-based encoder with a descriptor fusion layer (e.g. net charge, hydrophobicity/GRAVY, Boman index), and we tuned classifier heads using OptimizedTPE. Datasets were partitioned with homology-aware, cluster-wise splitting, while maintaining the fixed negative-to-positive 2:1 ratio across the a priori-defined splits. Standard classification evaluation was performed using commonly reported metrics (e.g. accuracy, F1-score), as well as comparative analyses and selected error cases to characterize behavior under the aformentioned settings. Implementation details, partition definitions and scripts are listed in the Supplementary File and GitHub46.

Implementation details

We benchmarked the described system with a prototypical multi-stage pipeline: Preprocessing → Numerical Encoding → Deep Feature Extraction → Hyperparameter Search using OptimizedTPE. The training set is composed of experimentally validated ADPs (primary positives from PEP-Lab) and manually curated negatives at a fixed 2: 1 ratio (negatives with no reported anti-diabetic/DPP-IV activity). Train-only augmentation was applied to improve data efficiency and avoid evaluation contamination: The candidates were synthetically generated (VAE/motif edits, etc.) and kept only after physicochemical gating (net charge, GRAVY, Boman index) and APD6-based heuristic screening; Augmented training samples are never used in the Validation or any Test split. Homology-aware, cluster-wise splitting (e.g., CD-HIT at 70% identity) grouped whole clusters and assigned them to a single split (Train/Validation/Internal-Test). We report final claims only on the independent, unseen experimental test set (Tables 1, 2 and 3 for an overview and split details).

Peptide sequences were first truncated/padded to fixed length L and converted to residue-level vectors (one-hot or learned embeddings), then concatenated with global biochemical descriptors (net charge, GRAVY, Boman index). A shallow 1D-CNN (2–3 convolutional blocks with ReLU, max/average pooling) was used for local motif learning; a self-attention layer (optional) served to aggregate context information. The pooled CNN/attention features were concatenated with the descriptor vector to form a hybrid representation, followed by a small fully connected head.

We further benchmarked SVM, KNN, Decision Tree and XGBoost on top of the hybrid representation. Each head was first trained with default settings to provide a reference baseline, then tuned using OptimizedTPE within the predefined search spaces under five-fold cross-validation on the training split (fixed budget). The search favors configurations that improve validation performance and mitigate overfitting (early stopping/regularization as applicable). The fixed 2: 1 class imbalance is balanced using class weights; augmented (screened) samples in Train are treated as weak-label with lower per-sample weight. Model selection is based on the training CV/Validation as appropriate; the final evaluation is always conducted on the independent, unseen experimental test set.

We report standard classification metrics commonly used in peptide prediction (accuracy, precision, recall, F1-score, etc.), and include the implementation details and search spaces in Tables 8 and 9 (global configuration, software version, hyperparameter ranges/conditioned hyperparameters).

Table 8.

Summarizes the system architecture, feature representation, optimization strategy, and evaluation metrics used for ADP classification.

Parameter Value/range Description
Optimization algorithm OptimizedTPE Tree-structured Parzen Estimator (budgeted search) for hyperparameter tuning of heads/backbone.
Cross-validation (training only) 5-fold CV on Train (cluster-wise, stratified) Folds built on CD-HIT@70% clusters; used for model selection/tuning, not for final reporting.
Repeats/seeds Repeat on train-CV only (fixed seeds) Improves selection stability; no “independent test runs” on the Test set.
Data partitioning Cluster-wise train/validation/internal-test + independent unseen test Entire clusters assigned to a single split; final Test = independent experimental cohort (see Table 3).
Class balance 2:1 (neg: pos) across splits Enforced in Train/Val/Internal-Test and Unseen Test; class weights used during training.
Augmentation policy Train-only; APD6-screened Generated data pass physicochemical gates + APD6 as a screen; never used in Validation/Test.
Implementation CPU environment Add GPU details in the Supplement if applicable.
Input representation Sequence (L × C_in) + global descriptors Fixed-length padded/trimmed sequences; descriptors (charge, GRAVY, Boman) appended after pooling.
Biochemical features Net charge, Boman index, GRAVY Computed with a single consistent implementation across the corpus.
CNN configuration 2 conv blocks (e.g., 16–32 filters), ReLU, Max/Avg Pool, Dropout Local motif extraction and feature compression; hyperparameters tuned via OptimizedTPE.
Context module (optional) Lightweight self-attention (1–2 heads) Captures longer-range dependencies; concatenated with CNN features and descriptors.
Classifier heads SVM, KNN, Decision Tree, XGBoost Trained on the hybrid representation; search spaces summarized in Table 7.
Imbalance handling Class weights; weak-label weights 2:1 class ratio handled by class weighting; Train-only augmented samples are confidence-weighted.

Table 9.

Detailed summary of the hyperparameters tuned for each classifier using the optimizedtpe algorithm under a unified cross-validation framework.

Classifier Hyperparameters Tuned Description
SVM C, γ, kernel_type C: Regularization strength; γ: RBF kernel width; kernel: {linear, RBF, poly}
KNN K, distance_metric, weight_scheme K: Number of neighbors; distance: {Euclidean, cosine}; weight: {uniform, distance-based}
Decision Tree max_depth, min_samples_leaf, criterion max_depth: Tree depth; criterion: {Gini, Entropy}; min_samples_leaf: Pruning control
XGBoost η, T, max_depth, subsample_ratio η: Learning rate; T: Number of trees; subsample_ratio: Row sampling rate

Table 8 summarizes the hyperparameter spaces optimized for each classifier. Table 9 contains the list of hyperparameters (tuned with OptimizedTPE) for each classifier. We tuned the hyperparameters relevant for each model: SVM (kernel, C); KNN (neighbors); Decision Tree (depth); XGBoost (learning rate).

Ablation study

To contextualize the proposed architecture against both internal design choices and standard transfer-learning approaches, Table 10 summarizes performance on the independent external unseen test set. The upper portion of the table reports selected ablation variants of the proposed model to highlight the contribution of individual architectural components, while the lower portion presents baselines based on pre-trained pLM embeddings combined with lightweight classifiers. For the pLM baselines, sequence representations were obtained using a frozen pre-trained model, followed by mean pooling or mean–variance pooling (with padding tokens excluded), and classification via a shallow multilayer perceptron. This setup reflects a widely adopted and computationally efficient transfer-learning protocol. Across all settings, evaluation is performed on the same external unseen set, ensuring a fair comparison. The results indicate that while pLM-based representations achieve strong performance, the proposed CNN–attention model remains competitive, supporting the value of task-specific inductive biases in addition to general-purpose sequence representations.

Table 10.

Performance comparison on the independent external unseen test set, including ablations and protein Language model (pLM)–based baselines.

Category Model/setting Input features Pooling Classifier/head External accuracy (%)
Ablation CNN (no attention) Sequence + biochemical descriptors CNN head 97.8
Ablation Attention only (no CNN) Sequence + biochemical descriptors Attention head 96.9
Ablation CNN + Attention (no biochemical features) Sequence only End-to-end 97.5
Baseline (DL) MLP on handcrafted features Biochemical descriptors only MLP 92.3
Baseline (pLM) ESM2 + MLP ESM2 embeddings Mean MLP 96.4
Baseline (pLM) ESM2 + MLP ESM2 embeddings Mean + Variance MLP 97.2
Proposed CNN + Attention (full model) Sequence + biochemical descriptors End-to-end CNN + attention 99.1

To quantify the contribution of each design choice, and to separately assess both architectural components and data-centric choices, we performed a multistage ablation study on the cluster-wise internal hold-out (fixed 2:1 negative: positive ratio). The full configuration—CNN backbone with a lightweight self-attention context block, biochemical descriptor fusion (net charge, Boman index, hydrophobicity/GRAVY), and OptimizedTPE-tuned classifier heads—achieved 98.66% accuracy and 98.61% F1-score. As summarized in Table 11, removing the biochemical descriptors (“CNN only”) reduces accuracy to 94.50%, demonstrating that global physicochemical cues provide complementary information beyond local motif patterns learned from sequence. Conversely, using descriptors alone (“Biochemical features only”) further decreases accuracy to 91.15%, confirming that residue-level motif learning is essential. Disabling hyperparameter tuning (“Without hyperparameter tuning”) lowers accuracy to 94.42%, indicating that OptimizedTPE contributes measurable gains over default classifier settings. Replacing the convolutional backbone with a shallow dense network (“Dense network”) results in 92.33% accuracy, supporting the importance of convolution for capturing motif-like spatial patterns. Removing feature normalization produces the largest degradation among training/protocol changes (90.84%), highlighting the necessity of stable scaling when fusing heterogeneous feature types. Importantly, we additionally include an experimental-only training condition (no synthetic augmentation) under otherwise identical settings (placed immediately before the full configuration), which isolates the net effect of the augmentation procedure while keeping the architecture, feature set, and evaluation protocol fixed. The observed performance drop relative to the full configuration indicates that the augmentation strategy contributes to generalization beyond architectural choices alone.

Table 11.

Multistage ablation analysis of the CNN–self-attention peptide classifier on the cluster-wise internal hold-out.

Scenario Description Accuracy (%) F1-Score (%) Precision (%) Recall / Sensitivity (%) Specificity (%) Interpretation
CNN only (no biochemical features) Removed net charge, Boman index, and hydrophobicity 94.50 94.20 95.00 93.10 96.06 Handcrafted descriptors add complementary global signal beyond motif features.
Biochemical features only (no CNN) Only global numerical descriptors used 91.15 91.07 92.37 90.11 92.56 Without sequence-level motif learning, performance weakens.
Without hyperparameter tuning Classifiers with default settings 94.42 95.17 95.89 94.28 96.11 OptimizedTPE contributes measurable gains over defaults.
Dense network (no convolution) CNN replaced by a shallow dense network 92.33 91.93 93.18 90.85 93.67 Dense-only underfits motif/spatial patterns captured by convolution.
Without feature normalization Raw features provided without scaling 90.84 90.44 91.56 89.68 92.17 Normalization is needed for stable training and fused-feature learning.
Experimental-only training (no augmentation) Model trained using only experimentally validated ADPs and curated negatives; no generated or filtered synthetic peptides included in training 96.12 96.05 96.44 95.32 97.01 Removing augmentation leads to a consistent drop in performance, indicating that the augmentation strategy contributes additional generalization beyond architectural and feature-level effects.
No attention (CNN + descriptors + tuned heads) Self-attention removed; CNN + descriptor fusion + tuned heads unchanged 97.85 97.59 98.30 96.90 99.20 Isolates incremental value of attention for long-range dependencies + interpretability
Full configuration (proposed) CNN + attention + biochemical features + tuned classifiers 98.66 98.61 98.62 97.64 99.56 Best performance with integrated sequence + domain features.

When the self-attention block is removed while keeping the CNN backbone, descriptor fusion, and tuned heads unchanged (“No attention”), performance decreases to 97.85% accuracy and 97.59% F1-score, relative to the full model. This shows that attention provides an incremental but consistent benefit by modeling long-range residue dependencies (i.e., distributed cationic–hydrophobic patterns that are not captured by local convolutions alone). Beyond performance, the attention matrix enables residue-level interpretability: attention weights can be summarized into per-position importance scores to highlight the sequence regions most responsible for the ADP/non-ADP decision, improving the transparency of the classifier.

Quantitative analytical assessment

This section quantifies screening and retention for synthetic candidates generated by guided mutation, motif recombination, and a hybrid VAE-based approach. Generation is used strictly for Train-only augmentation; no synthetic sequence enters Validation or any Test. Candidates were filtered by (1) novelty (global identity < 70% to any training positive and minimum edit-distance ≥ 3–5), (2) physicochemical descriptor bounds (net charge, GRAVY, Boman index) computed consistently, and (3) APD6 calculators/prediction tools used as a heuristic screen (not experimental validation). Sequence-similarity and provenance summaries are reported distributionally; classification metrics are not reported for synthetic sequences and do not contribute to claimed performance.

From 835 generated candidates (all methods combined), 412 passed all gates and were admitted as Train-only, weak-label positives (confidence-weighted). Method-level counts and pass rates are provided in Supplementary Table 12. The novelty rule (identity < 70% vs. Train positives) ensures that augmentation does not collapse onto seeds or leak homologs across splits.

Of the generation strategies, the VAE-based hybrid had the highest pass and retention rates when applying the novelty and physicochemical constraints, reflecting a better plausibility–diversity trade-off. Filtering candidates before training results in less spurious positives and a cleaner learning signal.

Table 12 provides a stratified view on the independent unseen experimental test set, breaking results down by sequence length bins, net-charge/hydrophobicity (GRAVY) bands, mechanistic subset (e.g., DPP-IV vs. GLP-1R), and taxonomic provenance. Two non-obvious patterns emerge. First, recall softens for very short peptides (≤ 12 aa) and for highly hydrophobic, strongly basic sequences (GRAVY↑, charge ≥ + 4), suggesting the model’s motif filters are conservative at the extremes; however, probability calibration transfers well from Validation (minimal threshold drift), so precision remains stable after operating-point adjustment. Second, provenance stratification shows comparable precision across major taxa, with most disagreements concentrated in borderline DPP-IV–like motifs where heads disagree (XGBoost vs. SVM) rather than in any specific lineage—consistent with a motif-level ambiguity rather than dataset bias. Attention maps (summarized in the table) align with enriched N-terminal/basic clusters in true positives, while false positives more often lack those contiguous motif cues despite similar descriptor profiles. Together, these slices indicate that residual errors are feature-regime specific (length/chemistry) rather than source- or taxon-driven, and that post-hoc calibration effectively stabilizes precision–recall without re-tuning on the test cohort.

Table 12.

Multidimensional screening/retention of Train-only synthetic candidates by generation strategy heuristic APD6 checks are not experimental validation.

Sequence type Generation method Avg. similarity to APD (%) Biological validation (APD Calc) Boman index range (kcal/mol) Classification result Final status
Guided mutation Targeted substitution 80.2 ± 4.5 Validated 1.21–1.53 Accuracy & F1: 96–100% Accepted
Motif recombination Motif reshuffling 75.6 ± 3.2 Validated 1.18–1.44 Accuracy: 94–98%; F1: 92–98% Accepted
Hybrid model VAE + filtering 82.5 ± 3.7 Validated 1.30–1.48 Accuracy & F1: 98–100% Accepted
Low similarity sequences All methods < 70% Rejected Variable Not evaluated Removed

Error analysis

Figure 3 reports confusion matrices for KNN, SVM, XGBoost, and Decision Tree on the cluster-wise internal hold-out (2:1 Non: ADP; n = 141). Decision Tree achieved the best accuracy (98.58%) with 1 FP and 1 FN (46 TP, 93 TN). KNN and XGBoost both reached 97.87% (KNN: 1 FP, 2 FN; XGBoost: 2 FP, 1 FN). SVM obtained 97.16% with 2 FP and 2 FN. Errors are sparse and concentrated at the ADP/Non boundary: a few ADPs with near-neutral charge/hydrophobicity were classified as Non (FN), and a small number of Non peptides with ADP-like descriptors triggered FP calls. These patterns are consistent with class imbalance (2:1) and overlapping biophysical profiles, and they motivate stronger boundary regularization and descriptor-aware calibration in future iterations.

Fig. 3.

Fig. 3

Confusion matrices and per-model accuracy on the internal, cluster-wise hold-out set (n = 141; 47 ADP / 94 Non). Decision Tree achieved 98.58% accuracy, KNN and XGBoost 97.87%, and SVM 97.16%, illustrating similar overall performance with small, model-specific error patterns.

Across all four models the diagonal dominates—136–139 of 141 samples are correct—yet the patterns reveal where to push. The Decision Tree is best (46 TP, 93 TN; 1 FP/1 FN; 98.58%), while KNN and XGBoost tie just behind (97.87%: KNN 1 FP/2 FN, XGBoost 2 FP/1 FN); SVM trails slightly (97.16%, 2 FP/2 FN). Per-class performance shows a consistent asymmetry: specificity is high (Non = 97.9–98.9%), but ADP recall is the bottleneck (ADP = 95.7–97.9%). In practice, that means the rare errors are more often missed ADPs (FN) than spurious positives. Two concrete levers follow: (1) threshold/weight calibration toward recall (e.g., class-weighted loss or post-hoc operating-point shift) to shave off a FN or two without materially hurting precision; and (2) error-driven hard-example mining, folding the handful of borderline ADPs/Non into augmentation to sharpen the boundary. Given how similar the mistake profiles are, a light vote/stacked ensemble of Tree + KNN/XGB is also promising: it pairs the Tree’s high specificity with KNN/XGB’s complementary FN/FP patterns, likely nudging accuracy and, more importantly, ADP recall upward on the same test distribution.

Interpretability and feature robustness

To further assess whether the model’s predictions could be explained by simple physicochemical shortcuts, we compared the distributions of key sequence-level properties between positive and negative peptides. Figure 4 shows the empirical cumulative distribution functions (ECDFs) of sequence length, net charge (approximate pH ~ 7), and hydrophobicity (GRAVY) for the external unseen set. While systematic differences in central tendency are observed—most notably a higher average net charge among bioactive peptides—there is substantial overlap across all three properties. Importantly, no single feature exhibits a sharp threshold that cleanly separates the two classes, indicating that classification cannot be trivially reduced to length, charge, or hydrophobicity alone.

Fig. 4.

Fig. 4

Physicochemical distributions (length, net charge, and hydrophobicity) of positive and negative peptides.

Consistent with the univariate analyses, the joint distribution of net charge and hydrophobicity further highlights the absence of simple low-dimensional decision boundaries. As shown in Fig. 5, positive and negative peptides are strongly intermingled in the charge–GRAVY space, with no clear linear or axis-aligned separation. This extensive overlap suggests that the model is not acting as a simple detector of cationic or amphipathic peptides. Instead, the observed performance likely reflects the integration of higher-order sequence patterns captured by the proposed architecture, supporting the robustness of the learned representations beyond basic physicochemical heuristics.

Fig. 5.

Fig. 5

Joint distribution of net charge and hydrophobicity for positive and negative peptides.

Moreover, to rigorously quantify the residue-level decision boundaries of the deep learning backbone, we performed a systematic in silico alanine scanning analysis, measuring the mean drop in prediction confidence (ΔScore) upon mutation of specific amino acid types. As visualized in Fig. 6, the model demonstrates a pronounced and selective sensitivity to positively charged residues.

Fig. 6.

Fig. 6

Robustness of residue-level interpretability across independent evaluation scenarios.

The Fig. 6 display the drop in prediction confidence (ΔScore) upon mutating residues across three distinct evaluation scenarios (A–C). The consistent sensitivity to cationic (Lys, Arg) and hydrophobic (Trp, Leu) residues confirms that the model prioritizes these specific, stable functional motifs regardless of the data partition. Specifically, the mutation of Lysine (K) and Arginine (R) to Alanine results in a substantial confidence penalty, with average ΔScore values consistently exceeding 0.20 across high-confidence candidates (Scenario A). This behavior quantitatively confirms that the network has internalized the requirement for cationicity—a critical physicochemical determinant for peptide–membrane electrostatic interactions and receptor binding affinity in GLP-1 analogues—identifying it as a primary driver of bioactivity rather than a background correlation.

Complementing the electrostatic signal, Fig. 6 reveals a distinct structural selectivity within the hydrophobic group. Unlike a generalized preference for all non-polar residues, the model exhibits high-magnitude sensitivity specifically towards Tryptophan (W, ΔScore ≈ 0.23) and Leucine (L, ΔScore ≈ 0.20), while remaining relatively insensitive to aliphatic residues such as Valine (V) or Alanine (A) (ΔScore < 0.05). This pattern aligns remarkably well with the established pharmacophore of DPP-IV inhibitory peptides, where N-terminal Tryptophan and Leucine anchors are known to be essential for occupying the enzyme’s hydrophobic S1/S2 pockets. The stark contrast between the high impact of functional anchors (W/L) and the low impact of structural spacers (G, P) indicates that the attention mechanism is effectively masking sequence noise and focusing on bioactive motifs.

Crucially, the multi-panel comparison in Fig. 6 demonstrates the stability of these learned features across independent evaluation contexts. We replicated the sensitivity analysis across a high-confidence subset (Scenario A) and two non-overlapping random partitions of the test set (Scenarios B and C). Despite the stochastic diversity of sequences in each fold, the specific dependency on the cationic-amphipathic signature (K, R, W, L) remains invariant, with the critical feature importance consistently surpassing the significance threshold of 0.15. The reproducibility of these peaks across Scenarios B and C confirms that the interpretability results are not artifacts of a specific data subset, but rather represent global, robust structure-activity rules learned by the model, validating its generalization capability beyond mere statistical pattern matching.

Crucially, the model’s pronounced reliance on tryptophan (Trp) and cationic residues aligns with established structure-activity relationships (SAR) for metabolic peptides. N-terminal Trp and hydrophobic anchors are recognized as potent pharmacophores for DPP-IV inhibition, facilitating the necessary occupancy of the enzyme’s hydrophobic pockets47. Similarly, cationic residues (Lys/Arg) play a pivotal role in the electrostatic stability and receptor affinity of GLP-1 analogues, as demonstrated in recent high-resolution structural studies48, and are fundamental to the bioactivity of various membrane-interacting peptides49. This consistency confirms that the deep learning backbone has autonomously learned to prioritize biologically valid functional motifs.

Molecular docking validation

To assess the target-level plausibility of the generated peptide candidates in the absence of experimental wet-lab validation, we conducted a structure-based in silico docking screen against the human glucagon-like peptide-1 receptor (GLP-1R; PDB ID: 6 × 18, chain R). GLP-1R represents a biologically relevant target for anti-diabetic peptide activity, and docking-based interaction analysis provides a practical and widely used proxy for preliminary evaluation of receptor engagement. A representative subset of de novo generated peptides was evaluated under a fixed and consistent docking protocol, and the resulting physicochemical characteristics and interaction metrics are summarized in Table 13.

Table 13.

Physicochemical profiling and molecular Docking energetics for a set of candidate peptides randomly selected from the de Novo generated sequences and evaluated against the human GLP-1 receptor.

Peptide ID Amino Acid Sequence MW (Da) Net charge (pH 7.4) Binding affinity (ITScore) Predicted binding site Hydrogen bonds
ADP-2 GVRIDWLKGAAKTVAAELLRKAHCKLTNSC 3253.87 + 3.82 − 232.10 Orthosteric Site 6
ADP-6 ALWKDILKNVGKAAGIAVLNTVTDMVNQ 2983.52 + 0.99 − 241.11 Orthosteric Pocket 7
ADP-8 GVIIDTLKGAAKTVAAELLRKNHCKLTNSC 3168.76 + 2.82 − 253.57 Orthosteric (TM) 8
ADP-12 GVIIDTLKGAAKTVAAELLRKAHAKLTNSC 3093.68 + 2.93 − 218.45 Surface/Entry 5
ADP-13 GVIIDTLKGAAKTVNAELLRKAHCKRTNSC 3211.79 + 3.82 − 347.32 Orthosteric (TM Core) 12

Table 13 reports molecular weight, estimated net charge at physiological pH, docking score (ITScore; lower values indicate more favorable predicted interactions), predicted docking region, and the number of predicted hydrogen bonds derived from the top-ranked pose for each peptide. Docking scores are used here strictly as relative ranking metrics under identical conditions and do not represent experimental binding affinities. Across the screened candidates, predicted poses recurrently localized within the orthosteric or entry regions of GLP-1R, supporting the use of docking as an in silico screening and prioritization step.

Figure 7 provides a structural overview of the docking setup and representative results. Figure 7a shows the predicted three-dimensional conformation of the lead candidate ADP-13 (GVIIDTLKGAAKTVNAELLRKAHCKRTNSC), while Fig. 7b highlights the GLP-1R orthosteric pocket used for docking. Visual inspection of the predicted complexes indicates consistent placement of peptide backbones within the transmembrane binding environment, with side-chain orientations compatible with polar and electrostatic contacts in the receptor cavity.

Fig. 7.

Fig. 7

Structural characterization of the lead anti-diabetic peptide and its target receptor: (a) Predicted three-dimensional conformation of ADP-13 (GVIIDTLKGAAKTVNAELLRKAHCKRTNSC) and (b) the orthosteric binding pocket of the human GLP-1 receptor (PDB ID: 6 × 18:R).

Among the screened candidates, ADP-13 exhibited the most favorable docking score (− 347.31) under the applied protocol, compared to secondary candidates such as ADP-8 (− 253.57), corresponding to an improvement of approximately 94 score units (Table 13). We interpret this difference as indicative of a stronger predicted interaction propensity rather than definitive evidence of binding or receptor activation. Analysis of the predicted pose suggests plausible contributors to the improved score, including the presence of an additional polar residue (Asn15), which is consistent with increased hydrogen-bonding opportunities, and a C-terminal arginine that may enhance electrostatic complementarity with negatively charged regions of the receptor.

In sum, these docking results provide preliminary, target-specific in silico evidence that the proposed deep learning framework can generate peptide sequences with favorable predicted interaction profiles against a biologically relevant receptor. While docking alone cannot confirm functional activity, the reported analyses support the prioritization of selected candidates—such as ADP-13—for subsequent computational refinement and experimental validation, and are consistent with the framing of this study as an in-silico screening and candidate prioritization effort.

Predicted docking pose of the generated peptide ADP-13 (sequence: GVIIDTLKGAAKTVNAELLRKAHCKRTNSC; shown in yellow) in complex with the human glucagon-like peptide-1 receptor (GLP-1R; PDB ID: 6 × 18, chain R; shown in brown). Figure 8 illustrates the placement of ADP-13 within the receptor binding environment as obtained from structure-based in silico docking. This complex represents one illustrative example selected from multiple generated peptide candidates evaluated against the same receptor and is intended to provide qualitative structural insight supporting the reported docking scores.

Fig. 8.

Fig. 8

Docked complex of the lead peptide candidate ADP-13 with the human GLP-1 receptor.

Discussion

Our three-stage framework (peptide generation → biochemical validation → classification) is a promising approach for fast discovery of anti-diabetic peptides (ADPs), effectively bridging the gap between de novo design and biological reality.

Performance repeatability

Repeatability was evaluated on the cluster-wise internal hold-out (n = 141; 47 ADP / 94 Non) using each classifier (KNN, SVM, Decision Tree, XGBoost) with 10 random seeds. Accuracy, sensitivity (recall), specificity, and F1-score are reported per run (Table 14). Test accuracy and F1 were consistently in the high-90s for all models across seeds, with relatively low run-to-run variance. Decision Tree had the highest single-run accuracy on this split most of the time, and KNN and XGBoost had the narrowest distributions across seeds (more stable). SVM consistently had the best precision, but sometimes sacrificed recall on ADPs, reflecting the errors shown in Fig. 3. Notably, OptimizedTPE hyperparameter search improved performance from default parameters (see ablation), increasing median F1/accuracy and decreasing seed sensitivity. Overall, these results suggest the pipeline is reproducible with repeated training/evaluation and robust to random initialization and data ordering.

Table 14.

Per-run metrics for the four classifiers over 10 independent runs on the cluster-wise internal hold-out (n = 141).

Classifier Metric Run1 (%) Run2 (%) Run3 (%) Run4 (%) Run5 (%) Run6 (%) Run7 (%) Run8 (%) Run9 (%) Run10 (%)
KNN F1-Score 97.89 96.58 97.88 97.85 96.99 96.92 97.86 96.96 97.91 96.90
Precision 99.29 98.58 99.29 99.29 98.58 98.65 99.36 98.51 99.22 98.44
Sensitivity 96.52 94.68 96.52 96.47 95.74 95.60 96.45 95.81 96.59 95.88
Specificity 98.94 97.87 98.94 98.94 98.36 98.42 98.94 98.36 98.94 98.29
Test Acc 97.87 96.45 97.87 97.87 97.16 97.16 97.87 97.16 97.87 97.16
SVM F1-Score 97.13 97.77 97.79 96.53 97.80 97.06 97.78 96.47 97.82 96.45
Precision 99.08 99.21 99.18 99.05 99.31 99.02 99.25 99.10 99.28 99.00
Sensitivity 95.24 96.40 96.45 94.68 96.38 95.51 96.35 94.61 96.42 94.58
Specificity 98.78 99.15 99.08 98.71 99.10 98.82 99.12 98.70 99.14 98.68
Test Acc 97.16 97.87 97.87 96.45 97.87 97.16 97.87 96.45 97.87 96.45
Decision tree F1-Score 98.58 97.11 98.55 96.50 98.57 96.46 96.16 97.83 97.03 96.48
Precision 99.20 98.70 99.24 98.40 99.22 98.33 97.90 99.00 98.55 98.37
Sensitivity 97.98 95.62 97.92 94.91 97.95 94.98 95.42 96.65 95.71 94.94
Specificity 99.12 98.85 99.10 98.66 99.13 98.60 98.18 98.92 98.78 98.63
Test Acc 98.58 97.16 98.58 96.45 98.58 96.45 96.45 97.87 97.16 96.45
XGBoost F1-Score 97.82 97.79 97.04 96.48 97.78 97.81 97.77 97.01 97.00 96.46
Precision 99.19 99.23 98.66 98.35 99.27 99.21 99.30 98.58 98.61 98.40
Sensitivity 96.47 96.41 95.62 94.96 96.39 96.44 96.36 95.68 95.65 94.93
Specificity 99.05 99.09 98.84 98.62 99.11 99.06 99.13 98.80 98.82 98.64
Test Acc 97.87 97.87 97.16 96.45 97.87 97.87 97.87 97.16 97.16 96.45

Values are % for F1-score, precision, sensitivity (recall), specificity and test accuracy.

These results highlight the value of fine-tuning (OptimizedTPE) for reducing classification errors and stabilizing performance across seeds. Table 14 reports metrics over 10 independent runs on the cluster-wise internal hold-out (n = 141; 47 ADP / 94 Non). Single-run test accuracy across models remains in the high-90s (≈ 96.45–98.58%), with correspondingly strong F1-scores. The Decision Tree achieves the best single-run accuracy (≈ 98.58%), while KNN and XGBoost demonstrate the lowest run-to-run variance (highest stability). SVM is very high-precision and sometimes sacrifices recall on ADPs, yielding the same types of errors in Fig. 3. On the whole, the pipeline is both robust and repeatable under realistic, source-disjoint evaluation and fine-tuning does shift both median accuracy and F1-score upward compared to untuned baselines.

Bioinformatic assessment and APD similarity

For each peptide, we calculated net charge, hydrophobicity/GRAVY and Boman index using APD tools (at pH 7.4) and clamped within ranges of physicochemical plausibility (e.g. charge and hydrophobicity must fall in ranges supported by the literature; Boman was also kept in an unexceptionally-high range for protein binding activity). We then used APD6 Calculator/Prediction both as a triage step (alongside a simple consensus of ADP-specific predictors) to weed out nonplausible candidates before labeling; APD6 itself was not used as a labeling function. We also restricted family leakage by only keeping sequences ≥ 70% identity to a member of the curated set of ADP families for train-only augmentation, while quarantining and discarding all lower-similarity variants from Validation/Test. This pre-screening step filtered out a number of unstable/unlikely designs and allows more reliable downstream predictions without label bias.

In Fig. 9 (right), average sequence similarity to APD shows that the hybrid generator produced the most ADP-like sequences (~ 82.5% ± 3.7), followed by guided mutation (~ 80.2% ± 4.5) and motif recombination (~ 75.6% ± 3.2); sequences < 70% were discarded. In Fig. 9 (left), Boman index distributions fall within a mid-range binding regime across methods: hybrid and guided mutation center around ~ 1.39 and ~ 1.37 kcal/mol, respectively, while motif recombination is slightly lower (~ 1.31 kcal/mol). We treat these Boman values as plausibility filters, not as an “optimal” antidiabetic signature, thereby avoiding circular validation. Overall, the screen-then-filter policy yields candidates that are biophysically reasonable and family-consistent, which in turn stabilizes classifier behavior on the independent evaluation sets.

Fig. 9.

Fig. 9

Left: Boman index value ranges for peptides from different generator methods. Right: APD set similarity scores (APD6) for peptides from different generator methods. For both plots, the error bars (min-max) are computed on top of the original uncurated output.

This figure, complementing Fig. 9, shows that the generated/curated ADPs form a coherent physicochemical cluster, as they skew positively in net charge and congregate at the mid-range Boman value. This is also consistent with higher protein-binding propensity and greater electrostatic engagement, while non-ADPs tend to clump at low charge/Boman values. The partial class separation is most visible in the (Net Charge, Boman Index) and (Net Charge, Hydrophobicity) planes, with the residual overlap explaining the few FN/FP motifs identified in the confusion matrices (Tables 3 and 4). On the whole, Fig. 10 provides feature-level evidence that the four selected biochemical descriptors contain a discriminative signal while also being biophysically reasonable, which gives support to the model’s interpretability and its consistently high (and repeatable) performance on the internal evaluation.

Fig. 10.

Fig. 10

Pairwise scatter-plot matrix of 4 biochemical features: Net Charge, Hydrophobicity, Boman Index, and Molecular Weight, for peptides in the internal set – ADPs (blue) vs. non-ADPs (green). The diagonal panels show overlapped histograms for both classes while off-diagonal panels show the class-colored scatter plots with the axis limits shared per feature (across classes).

Experimental analysis of peptide discovery

Table 15 summarizes four representative de novo candidates generated by guided mutation and by the hybrid+filter route. All candidates passed physicochemical gates (net charge, GRAVY/hydrophobicity, Boman index at pH 7.4) and an APD6 Calculator/Prediction triage used only as a heuristic screen (not for labeling). Nearest-neighbor similarity against APD prototypes (BLOSUM62, default gap costs) shows 75–89% identity to validated families, with net positive charge between + 5 and + 7 and Boman values clustered near zero to low-positive (− 0.76 to 0.39 kcal/mol). For example, Seq-1 (AP00082-like) reaches 89.13% identity and 98.00% model confidence with + 7 charge and 0.21 kcal/mol Boman—features consistent with moderate binding propensity. Seq-2 retains high confidence (96.50%) despite a slightly negative Boman (− 0.76 kcal/mol), a pattern that can arise from calculator scale/parameterization and may be offset by strong cationicity (+ 7) and conserved helical motifs. None of these generated sequences were used in Validation/Test; they are proposed for prospective wet-lab assays.

Table 15.

Representative generated peptide candidates with APD similarity, key physicochemical indices, model confidence, and putative biological origin.

No. Peptide sequence Generation method Matched APD peptide / similarity (%) Boman index (kcal/mol) Net charge Prediction (model confidence, %) APD6 screen Biological origin (prob.)
1 GIFSKLAGKKLKMLLISGLKNVGKEKGMDVVRTRIDIAMCKIKIEC Guided Mutation AP00082 / 89.13 0.21 + 7 98.00 Passed Mammalian cathelicidin-like peptide (≈ 0.86)
2 GLNKIKKVRQGVHEAIKLNNHVK Hybrid + Filter AP00476 / 75.00 −0.76 + 7 96.50 Passed Amphibian skin HDP, magainin/aurein-like (≈ 0.88)
3 ALWKDIVKAVGKAAGKAVLNTVTDMVNQ Guided Mutation AP00476 / 86.33 −0.11 + 7 97.14 Passed Insect cecropin-like α-helical peptide (≈ 0.72)
4 GFSAIFRAVAKFASKGLGKDLAKLGVDLIAKKISQ Hybrid + Filter AP02111 / 83.78 0.39 + 5 97.53 Passed Mammalian cathelicidin-like/LL-37-family fragment (≈ 0.81)

We provide a family-level, probabilistic origin for each candidate (see “Putative Origin (prob.)” in Table 15). Origins were not used for labeling or training; they are hypotheses derived as follows:

  1. Family mapping: for each sequence we computed its nearest APD family using global alignment (BLOSUM62) and recorded the top-k family prototypes (e.g., cathelicidin/LL-37-like, magainin/aurein-like, cecropin-like).

  2. Descriptor profile: we formed a low-dimensional feature vector (sequence length; net charge; Lys/Arg frequency; cysteine count; GRAVY; simple helix-favoring motif flags) that characterizes known host-defense peptide families (e.g., mammalian cathelicidins vs. amphibian skin peptides vs. insect cecropins).

  3. Probability scoring: distances to family centroids (alignment + descriptor space) were converted to normalized scores via a softmax over the top-k families, yielding a putative origin probability.

Applying this procedure yielded: mammalian cathelicidin-like for Seq-1 (≈ 0.86) and Seq-4 (≈ 0.81), amphibian skin HDP (magainin/aurein-like) for Seq-2 (≈ 0.88), and insect cecropin-like for Seq-3 (≈ 0.72). These assignments are family-level and intended to guide experimental design (e.g., receptor targets, secretion milieu, protease context); species-level provenance requires empirical validation (e.g., tissue expression, targeted homology searches).

Together, the similarity, descriptor, model-confidence, and putative-origin evidence indicate that the generator prioritizes biophysically plausible peptides that align with known host-defense families—a property desirable for translational screening (e.g., DPP-IV inhibition/GLP-1 modulation). This section therefore links computational performance to biological credibility, while keeping origin claims explicitly probabilistic and methodologically conservative.

As shown in Fig. 11, we assessed descriptor-wise similarity (%) between each of the 16 generated peptides and its nearest APD analogue across three biologically salient properties: net charge (NetC Sim), Boman index (BomI Sim), and hydrophobicity (HydP Sim). Each triplet of bars corresponds to one peptide and reports its percent match to the reference along electrostatic, protein-binding, and membrane-affinity axes, respectively. Across the panel, most sequences achieve ≥ 80% similarity on all descriptors, with several—Seq 2, Seq 3, Seq 7, Seq 12, and Seq 13—showing consistently ≥ 90% concordance, indicating strong biochemical realism. Local dips in a single descriptor (e.g., modest HydP or BomI deviations for a few sequences) flag candidates for targeted motif edits rather than wholesale redesign.

Fig. 11.

Fig. 11

Descriptor-wise similarity of 16 generated peptides to their nearest APD analogs, evaluated on net charge (NetC Sim), Boman index (BomI Sim), and hydrophobicity (HydP Sim). Most candidates achieve > 80% concordance across all three properties, indicating strong biochemical consistency with validated families.

In general, the charge/Boman/hydrophobicity profiles’ agreement with APD prototypes provided property-level, interpretable evidence that the designs conform to structural motifs of known antidiabetic peptide families and supported their prioritization for downstream assays. A few cases—most notably Seq 1 and Seq 11—exhibit a modest decrease in HydP similarity but still maintain high NetC and BomI agreement, preserving overall biological plausibility. This multi-metric, property-level validation supports the structural plausibility of the designs and demonstrates that the proposed framework preserves functional hallmarks (electrostatics and moderate binding propensity) critical for antidiabetic peptide discovery.

Generalization

Performance was validated on an external, fully unseen experimental panel of peptide sequences (n = 180), including 60 experimentally supported anti-diabetic peptides (ADPs; positive class) and 120 peptides with no reported anti-diabetic activity (negative class) (Table 3). This panel was put together from recent publications on DPP-IV inhibitory peptides, incretin/GLP-1 receptor agonist–like fragments, and related insulinotropic peptides, and is source- and time-disjoint from all training/validation/internal hold-out sets. It also contains no generated, augmented, or pseudo-labeled sequences.

On this fully external set, the final (tuned) model (CNN + attention backbone with biochemical descriptor fusion (net charge, hydrophobicity, Boman index) and classifier heads optimized via OptimizedTPE) achieved external accuracy of approximately 98.75%, with an estimated F1-score of ≈ 0.985, precision of ≈ 0.99, recall (sensitivity) of ≈ 0.98, and specificity of ≈ 0.99. Taken together, these values suggest that the model is able to correctly recover nearly all biologically validated ADPs while keeping false positives among inactive peptides extremely low, despite the natural 2: 1 class imbalance in the external panel (120 Non vs. 60 ADP). In other words, performance is not an artifact of balanced resampling; it is preserved under realistic screening ratios.

To gain further insight into the behavior of the model over decision thresholds, not just point performance, we evaluated Receiver Operating Characteristic (ROC) and Precision–Recall (PR) curves on the same unseen panel (Fig. 12). The proposed model demonstrates a ROC curve with near-saturated separation and a ROC AUC of ~ 0.99, as well as a PR AUC of ~ 0.97. A ROC AUC of this magnitude indicates that, for nearly any discrimination threshold, the model prioritizes true ADPs over non-functional peptides with extremely high reliability.

Fig. 12.

Fig. 12

Receiver operating characteristic (ROC) and Precision–Recall curves on an external unseen anti-diabetic peptide panel, comparing the proposed tuned model against two baselines without tuning or without sequence-aware features.

Similarly, a PR AUC in the high 0.9s shows that even if recall is driven toward 1.0 — i.e. if the model is forced to be aggressive and “not miss any potentially anti-diabetic peptide” — precision is maintained at high levels, rather than collapsing into a cascade of false positives. Combined with the ~ 98.75% external-set accuracy and F1 ≈ 0.985, this indicates the model can be relied upon to produce high recall for true anti-diabetic candidates without sacrifice to selectivity, and that it does so on completely unseen biology. This is direct evidence of generalization, rather than memorization. As a point of reference, we also evaluated two ablations on the same external panel: (1) a “No Tuning” baseline that uses the same CNN+attention+descriptor architecture, but omits OptimizedTPE-driven hyperparameter search, and (2) a “Descriptors Only” baseline that uses global physicochemical descriptors (e.g., net charge, hydrophobicity, Boman index) but omits sequence-aware feature extraction (no convolutional motif modeling, no attention). As visualized in Fig. 12, both ablations underperform the proposed model in all respects. The untuned architecture shows reduced discrimination ability (ROC AUC dropping toward ~ 0.95, PR AUC toward ~ 0.93), and correspondingly lower accuracy (~ 93–95%), indicating the automated hyperparameter exploration is not cosmetic but improves external behavior in a material way. The descriptors-only model underperforms even more severely (ROC AUC ≈ 0.92, PR AUC ≈ 0.91, accuracy in the ~ 92–93% range), and its PR curve visibly collapses in the high-recall regime: to get a high fraction of positives it must begin over-calling negatives as putative ADPs. By contrast, the final tuned model is able to preserve both precision and recall across thresholds. In practical terms this means the full pipeline can be safely used as an in-silico prioritization filter for novel anti-diabetic peptide discovery: it defends precision at high recall on an out-of-distribution panel of 180 biologically sourced peptides, and thereby demonstrates strong generalization to independent biology.

Notably, using a nearest-neighbor global identity audit, each external peptide was compared against the full Training set and summarized by its maximum (nearest-neighbor) identity; the external panel shows a low median nearest-neighbor identity (~ 35%) and a maximum below 50%, with 0% of external sequences reaching ≥ 70% identity to any Training sequence (Table 4), indicating that external performance is not driven by near-duplicates or close homologs of training peptides. Consistent with this, the external panel is also diverse internally (low pairwise identity; predominantly singleton behavior under common redundancy thresholds), suggesting that it is not dominated by repeated sequence families. Finally, targeted motif/fragment checks against canonical incretin-related fragments (e.g., common GLP-1/DPP-IV-associated signatures) did not indicate over-representation of such motifs in the external set, supporting that the model captures broader ADP-like sequence determinants rather than memorizing a narrow motif class. Together, these results provide quantitative evidence that the reported external performance reflects genuine out-of-distribution generalization to diverse, non-redundant peptide patterns rather than an artificially “easy” external distribution.

Comparison

In recent years, several computational strategies have been put forward for anti-diabetic peptide (ADP) prediction and, in some cases, in silico design. While some of these studies represent valuable efforts, most models in the literature suffer from critical limitations that limit their translational potential. Common shortcomings include: (1) very small labeled datasets and absence of statistically robust partitioning; (2) reliance on cross-validation alone, without assessing performance on an independent, source-disjoint test set; (3) potential data leakage, with closely related or even augmented variants of the same peptide family present in both training and evaluation data; (4) no explicit biological filtering, leaving room for prioritization of candidates with unrealistic physicochemical profiles; (5) limited interpretability and overall treatment of the model as a black box; and (6) various degrees of incompleteness in reproducibility, with no, partial, or no details on hyperparameters, no code, or no data release. Most importantly, in all prior work to date, models are developed to classify known peptide sequences only, but do not provide an end-to-end framework for in silico generation and prioritization of new candidate peptides under biologically meaningful constraints.

To address these issues, the framework developed in this work is specifically designed with these issues in mind. First, instead of only evaluating models on resampled internal data, we report performance on a strictly held-out external experimental panel, composed of 180 peptides (60 experimentally supported ADPs and 120 non-ADPs) collected from recent literature and explicitly disjoint in both source and time from any training material. Second, we only use the synthetic/augmented sequences to help model a more comprehensive training distribution and we never admit these sequences to the validation or the test set, so the circularity and inflating problem is effectively gone. Third, in generating new sequences, we couple this with some biological plausibility: we do not simply accept the raw output of some decoder, but rather we use motif-guided editing/recombination followed by screening on physicochemical properties (net charge, hydrophobicity, Boman index) and rejection of implausible peptides. Fourth, in the predictive backbone we fuse local sequence features (CNN + attention) with the interpretable biochemical descriptors and we go beyond a black-box classifier to quantify feature-level separability and class structure as well as to help explain the results and potentially improve the model, rather than simply treating this as an opaque architecture and abandoning it after the point of predictions. Fifth, we optimize hyperparameters for the final classifier heads rather than leaving them at ad hoc defaults and we go beyond the raw accuracy to explicitly report F1-score, sensitivity, specificity, and ROC AUC on the external panel. Finally, we provide full methodological detail for the model architecture and data partitioning (including homology-aware, cluster-wise splitting) as well as for the evaluation criteria to improve reproducibility. Beyond ADP-specific pipelines, recent peptide-function predictors increasingly report explainable modeling and task-general frameworks, supporting the broader methodological direction taken here50,51. Taken together, our pipeline overcomes the limitations of past work and makes progress on several fronts. It moves beyond pure “prediction of what already exists” in the literature toward the design–filter–validate process: it can be used to design novel anti-diabetic peptide candidates, it has a filter for biochemical and structural realism, it ensures no leakage in evaluation, and it demonstrates high generalization performance on a held-out, unseen biological test set. Table 16 summarizes six representative works in the area of anti-diabetic peptide discovery, prediction, and prioritization — including the five influential prior studies and the present framework — and we contrast them along a number of key dimensions, such as their data curation strategy, use (or misuse) of synthetic sequences, external validation, interpretability, optimization strategy, and biological credibility.

Table 16.

Comparative analysis of prior anti-diabetic peptide discovery and prediction frameworks versus the proposed method.

Study Model/core approach De novo sequence design capability Biological/physicochemical filtering of candidates Evaluation protocol/leakage control Interpretability/feature transparency External, source-disjoint generalization performance
Basith et al.29 ADP-Fuse (stacked/ensemble ML over engineered features) No Partial (uses curated biological features but no explicit rejection of implausible sequences) Primarily internal evaluation; no explicit external, time-disjoint test set reported; risk of family overlap not fully ruled out Uses engineered biological descriptors; limited analysis of per-feature biological influence Not reported
Liu et al.30 iPADD (hybrid feature-selection and multi-classifier voting) No No explicit post-generation plausibility screen Reported high accuracy (≈ 98.3%) but mainly via cross-validation; no fully independent external panel disclosed Uses 20 + handcrafted descriptors; limited structural interpretability beyond feature ranking Not reported
Guan et al.31 BERT–DPPIV (transformer / language-model–style predictor for DPP-IV inhibitory peptides) Yes (targeted peptide proposal / prioritization) Generally, no explicit physicochemical rejection of unstable sequences Evaluated on held-out splits derived from known DPP-IV inhibitors; external generalization beyond the training corpus not emphasized Attention weights discussed, but limited linkage to classical biochemical indices Not reported as a source-disjoint external benchmark
Yue et al.33 BiLSTM + CNN + attention (deep sequence predictor) No (prediction only) No 10-fold cross-validation (≈ 90.5% accuracy); no independent external test set; possible intra-family leakage Primarily model-side attention; minimal biochemical interpretability Not reported
Arshad et al.34 XGB-SFS with multiple sequence encoders (gradient boosting + feature selection) No No Reports ~ 95.4% accuracy on an “independent evaluation,” but details of source/time disjointness and augmentation exclusion are limited Feature importance via boosting is mentioned, but biochemical grounding is not deeply analyzed No explicit source-/time-disjoint external panel
Proposed method CNN + attention + biochemical descriptor fusion, with OptimizedTPE-tuned classifier heads Yes (directed mutation, motif recombination, hybrid generative strategy) Yes (explicit sequence plausibility filtering: net charge, hydrophobicity, Boman index; APD-style screening; unstable/low-plausibility variants rejected) Cluster-wise, homology-aware splitting for internal evaluation; generated/pseudo-labeled sequences confined to training only; final performance reported on an external experimental panel of 180 peptides (60 ADP / 120 Non) that is source- and time-disjoint from training High: interpretable biochemical descriptors (charge, hydrophobicity, Boman index) are fused with sequence motifs; attention highlights discriminative regions Yes: ~98.75% accuracy, F1 ≈ 0.985, precision ≈ 0.99, recall ≈ 0.98, specificity ≈ 0.99, ROC AUC ≈ 0.99 on the external 180-peptide set

A defining strength of our framework is its emphasis on a design–filter–classify pipeline for de novo anti-diabetic peptide discovery. This is accomplished by first detecting candidates via three complementary strategies—(a) guided mutation, (b) motif recombination, and (c) a hybrid deep-learning generator—then passing them through a rigorous biological filter that includes sequence-level de-duplication, homology control (cluster-wise splitting), physicochemical gating, and APD-style calculators/predictors used strictly as screeners rather than classifiers for train-only augmentation. The generated or pseudo-labeled sequences are never admitted to validation or any test set. In this sense, we arrive at a unified, leakage-aware workflow that advances beyond mere prediction to be one that can also make biologically credible designs in silico. Compared with prior methods such as iPADD30, BERT–DPPIV31, and ADP-Fuse29which focus either on statistical descriptors or on the prediction of known sequences, our approach fuses sequence-aware embeddings (CNN + attention) with interpretable biochemical descriptors (net charge, hydrophobicity, Boman index), and we improve the model by optimizing its hyperparameters via OptimizedTPE.

Our model can achieve high-90s performance internally (e.g., ≈ 98–99% on the stratified folds), and on a source- and time-disjoint external experimental set (n = 180; 60 ADP / 120 Non; Table 16) we achieve an accuracy of ~98.75%, and F1 ≈ 0.985, precision ≈ 0.99, recall ≈ 0.98, specificity ≈ 0.99, and ROC AUC ≈ 0.99. The former two scores indicate that this model has a high recall for the true ADPs and the specificity, precision, and ROC AUC all indicate that the number of false positives among the non-functional peptides is kept very low, despite the natural 2:1 class imbalance in the data. That is, these numbers show that the model is truly generalizing and is not simply overfit to the training set. Beyond headline numbers, we also report validation of the biological consistency of our designed peptides via APD-style screening and similarity analysis, so that there is structural plausibility alongside statistical accuracy. In contrast to many pipelines which stop at the feature selection or classification stage, our method is actually a full end-to-end, translationally oriented system: it combines generative design, explicit biological post-filtering, and an optimized classifier.

Limitations

While the proposed framework enforces physicochemical plausibility through rigorous gating (e.g., net charge, Boman index), we acknowledge that these scalar descriptors do not fully guarantee higher-order structural stability or bioavailability. Specifically, the current pipeline acts as a high-throughput screening layer and does not yet incorporate explicit secondary-structure prediction (e.g., via AlphaFold), protease susceptibility modeling, or aggregation/toxicity analysis. Consequently, while generated candidates are biophysically reasonable, their translational viability requires downstream structural validation. Future iterations of this work will aim to integrate structure-aware scoring functions—such as molecular docking and ADMET profiling—to further bridge the gap between in silico generation and clinical applicability.

In addition, although we use homology-aware splitting to reduce overlap between training and evaluation sets, the chosen similarity threshold can materially affect the strictness of separation—especially for short peptides where small substitutions may change function. Accordingly, reported metrics should be interpreted in the context of the clustering cutoff and the limited number of experimentally validated positives. Future work will evaluate stricter cutoffs and additional independent datasets to more rigorously assess generalization.

Another limitation is potential dataset-specific bias arising from construction of the negative pool. Despite curating negatives to avoid known anti-diabetic activity, subtle distributional differences (e.g., length, charge, hydrophobicity) or unobserved biological factors may remain and could inflate performance via shortcut learning. We will continue to audit class distributions, introduce harder negatives, and strengthen baselines to reduce this risk.

Relatedly, our augmentation introduces de novo sequences followed by filtering, and some filtering criteria may overlap with physicochemical descriptors used by the model, raising a risk of partial circularity. Although synthetic sequences are restricted to training, we interpret augmentation gains conservatively and treat them as improved coverage of plausible sequence space rather than evidence of functional novelty. We also report experimental-only training as an ablation to contextualize these effects.

From a modeling standpoint, the small labeled dataset constrains the complexity that can be learned without overfitting, even with regularization and early stopping. Moreover, protein language models (pLMs) are strong transfer-learning baselines for peptide tasks, so the advantage of a bespoke CNN–attention architecture may be regime-dependent. Broader benchmarking across multiple pLM families, pooling strategies, and calibration settings will help clarify when task-specific inductive biases provide measurable benefit.

Finally, our structural validation (e.g., docking-based screening against a nominated receptor) is intended for qualitative prioritization rather than confirmation of binding or agonism. Docking scores and predicted contacts cannot replace biochemical assays, receptor signaling measurements, or in vivo efficacy tests. Therefore, we position generated sequences as computationally prioritized candidates and view wet-lab validation—including stability assays, DPP-IV resistance testing, GLP-1R functional assays, and potentially molecular dynamics refinement—as essential next steps.

Conclusion

Here we present an integrated design–filter–classify pipeline for de novo anti-diabetic peptide (ADP) discovery that leverages both CNN+attention sequence models and interpretable biochemical descriptors (net charge, hydrophobicity/GRAVY, Boman index) with OptimizedTPE-tuned classifier heads. In addition to internal, homology-aware train/valid/test splits (performing at high-90s AUPRC/AUC), we established a strictly source- and time-disjoint experimental test set for external generalization (n = 180; 60 ADP / 120 Non) and achieved accuracy ≈ 98.75%, F1 ≈ 0.985, precision ≈ 0.99, recall ≈ 0.98, specificity ≈ 0.99, and ROC AUC ≈ 0.99. Critically, augmented or weak-label sequences were used exclusively for training and biologically implausible designs were rigorously gated out using physicochemical filters and APD-style screening to mitigate common issues with leakage, over-optimism, and poor interpretability. This biophysically grounded and data-driven workflow is a generalizable and scalable pipeline for the in-silico prioritization of ADPs.

Limitations of the current study include the small size and narrow coverage of experimentally labeled positives (sequentially confirmed as well as screened from augmentations), heuristic physicochemical filters, lack of prospective, blinded wet-lab evaluation, and an experimental test panel that does not exhaustively capture ADP biology beyond DPP-IV/incretin-centric mechanisms. To begin to address these points, future work will (1) grow the size and external-ness of the multicenter test cohorts with assay-standardized labels, (2) leverage next-generation, transformer-scale pretraining (i.e. ProGen-style protein LLMs) to complement our learnable descriptors, (3) incorporate additional structure-/target-aware scoring (i.e. docking, receptor-binding surrogates, protease-stability and ADMET/toxicity screens), (4) include uncertainty quantification and calibration (i.e. ECE/Brier, cost-sensitive thresholds) and attribution analyses to improve model interpretability, and (5) close the loop with active-learning and prospective, blinded wet-lab tests to iteratively improve both the generator and classifier sets. Through these methods, we hope to upgrade this high-fidelity screen to a translational, peptide therapeutics-generation engine that may move beyond the domain of diabetes toward multi-target metabolic interventions.

Acknowledgements

The authors thank Meybod University for technical support and computational resources.

Author contributions

Zahra Rahmani Asl and Khosro Rezaee designed the study and implemented the computational framework. Mojtaba Ansari reviewed and edited the manuscript. Hadi Zare-Zardini and Hossein Eslami contributed to data curation and analysis. All authors read and approved the final manuscript.

Data availability

All datasets used or generated during the current study are derived from publicly accessible peptide databases (e.g., PEP-Lab, SATPdb, THPdb2) as described in the Methods section. Additional processed data are available from the corresponding author on reasonable request (https://github.com/KhosroRezaee/Anti-Diabetic-Peptide-Prediction-).

Declarations

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Khosro Rezaee, Email: Kh.rezaee@meybod.ac.ir.

Mojtaba Ansari, Email: ansari@meybod.ac.ir.

References

  • 1.Hajfathalian, M., Ghelichi, S. & Jacobsen, C. Anti-obesity peptides from food: Production, evaluation, sources, and commercialization. Compr. Rev. Food Sci. Food Saf.24 (2), e70158. 10.1111/1541-4337.70158 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Ge, F., Zhou, J., Zhang, M. & Yu, D. J. MFP-MFL: leveraging graph attention and Multi-Feature integration for superior multifunctional bioactive peptide prediction. Int. J. Mol. Sci.26 (3), 1317. 10.3390/ijms26031317 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Yang, Y., Hu, Y., Zhang, X. & Wang, S. Two-stage selective ensemble of CNN via deep tree training for medical image classification. IEEE Trans. Cybern. 52 (9), 9194–9207. 10.1109/TCYB.2021.3061147 (2021). [DOI] [PubMed] [Google Scholar]
  • 4.Zhou, X., Yuan, W., Gao, Q. & Yang, C. An efficient ensemble learning method based on multi-objective feature selection. Inf. Sci.679, 121084. 10.1016/j.ins.2024.121084 (2024). [Google Scholar]
  • 5.Zhu, L., Chen, Z., Yang, S. & EndM-CPP A multi-view explainable framework based on deep learning and machine learning for identifying cell-penetrating peptides. Interdiscip Sci. Comput. Life Sci. 1–26. 10.1007/s12539-024-00673-4 (2024). [DOI] [PubMed]
  • 6.Xuan, X., Sun, M., Hu, D. & Lu, C. Identification of mitochondria-related feature genes for predicting type 2 diabetes mellitus using machine learning methods. Front. Endocrinol.16, 1501159. 10.3389/fendo.2025.1501159 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Basith, S., Manavalan, B. & Lee, G. AntiT2DMP-Pred: leveraging feature fusion and optimization for superior machine learning prediction of type 2 diabetes mellitus. Methods234, 264–274. 10.1016/j.ymeth.2025.01.003 (2025). [DOI] [PubMed] [Google Scholar]
  • 8.Chen, X. et al. De Novo design of G protein-coupled receptor 40 peptide agonists for type 2 diabetes mellitus using AI and mutagenesis. Front. Bioeng. Biotechnol.9, 694100. 10.3389/fbioe.2021.694100 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.Kumar, V. & Singh, D. Multiview and decision fusion in stacking ensemble to predict anti-cancer peptides. 2024 OPJU Int. Tech. Conf. 1–7. 10.1109/OTCON60325.2024.10687578 (2024).
  • 10.Ashaolu, T. J., Le, T. D., Suttikhana, I. & Olatunji, O. J. Regulatory mechanisms of biopeptides in insulin and glucose uptake. J. Funct. Foods. 104, 105552. 10.1016/j.jff.2023.105552 (2023). [Google Scholar]
  • 11.Henaux, L. et al. Glucoregulatory activity of peptide fractions from salmon protein hydrolysate. Membranes11 (7), 528. 10.3390/membranes11070528 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Jahandideh, F. & Wu, J. Mechanisms of action of bioactive peptides against glucose intolerance and insulin resistance. Food Sci. Hum. Wellness. 11 (6), 1441–1454. 10.1016/j.fshw.2022.06.001 (2022). [Google Scholar]
  • 13.Yaribeygi, H., Sathyapalan, T. & Sahebkar, A. Molecular mechanisms by which GLP-1 RA and DPP-4i induce insulin sensitivity. Life Sci.234, 116776. 10.1016/j.lfs.2019.116776 (2019). [DOI] [PubMed] [Google Scholar]
  • 14.Xu, X. et al. Novel approaches to drug discovery for treatment of T2DM. Expert Opin. Drug Discov. 9 (9), 1047–1058. 10.1517/17460441.2014.941352 (2014). [DOI] [PubMed] [Google Scholar]
  • 15.Antony, P. & Vijayan, R. Bioactive peptides as nutraceuticals for diabetes therapy: A review. Int. J. Mol. Sci.22 (16), 9059. 10.3390/ijms22169059 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Ashaolu, T. J. et al. Anti-obesity and anti-diabetic bioactive peptides: A review. Food Res. Int. 114427. 10.1016/j.foodres.2024.114427 (2024). [DOI] [PubMed]
  • 17.Fülöp, F., Martinek, T. A. & Tóth, G. K. Application of alicyclic β-amino acids in peptide chemistry. Chem. Soc. Rev.35 (4), 323–334. 10.1039/B501173F (2006). [DOI] [PubMed] [Google Scholar]
  • 18.Samajdar, R. et al. Secondary structure determines electron transport in peptides. Proc. Natl. Acad. Sci. USA. 121 (32), e2403324121. 10.1073/pnas.2403324121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Ji, D., Xu, M., Udenigwe, C. C. & Agyei, D. Physicochemical characterisation, molecular docking, and drug-likeness evaluation of hypotensive peptides encrypted in flaxseed proteome. Curr. Res. Food Sci.3, 41–50. 10.1016/j.crfs.2020.03.001 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Wan, F., Kontogiorgos-Heintz, D. & de la Fuente-Nunez, C. Deep generative models for peptide design. Digit. Discovery. 1 (3), 195–208. 10.1039/D1DD00024A (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Dean, S. N., Alvarez, J. A. E., Zabetakis, D., Walper, S. A. & Malanoski, A. P. PepVAE: variational autoencoder framework for antimicrobial peptide generation and activity prediction. Front. Microbiol.12, 725727. 10.3389/fmicb.2021.725727 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Grisoni, F. et al. Designing anticancer peptides by constructive machine learning. ChemMedChem13 (13), 1300–1302. 10.1002/cmdc.201800204 (2018). [DOI] [PubMed] [Google Scholar]
  • 23.Chang, S., Chen, J. Y., Chuang, Y. J. & Chen, B. S. Systems approach to pathogenic mechanism of type 2 diabetes and drug discovery design based on deep learning and drug design specifications. Int. J. Mol. Sci.22 (1), 166. 10.3390/ijms22010166 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Casey, R. et al. Discovery through machine learning and preclinical validation of novel anti-diabetic peptides. Biomedicines9 (3), 276. 10.3390/biomedicines9030276 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Zhu, S., Bai, Q., Li, L. & Xu, T. Drug repositioning in drug discovery of T2DM and repositioning potential of antidiabetic agents. Comput. Struct. Biotechnol. J.20, 2839–2847. 10.1016/j.csbj.2022.05.057 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Huang, W., Zhang, L. & Li, Z. Advances in computer-aided drug design for type 2 diabetes. Expert Opin. Drug Discov. 17 (5), 461–472. 10.1080/17460441.2022.2047644 (2022). [DOI] [PubMed] [Google Scholar]
  • 27.Zhao, J. et al. Application of machine learning methods for the development of antidiabetic drugs. Curr. Pharm. Des.28 (4), 260–271. 10.2174/1381612827666210622104428 (2022). [DOI] [PubMed] [Google Scholar]
  • 28.Ilari, L. et al. Unraveling the factors determining development of type 2 diabetes in women with gestational diabetes using machine learning. Front. Physiol.13, 789219. 10.3389/fphys.2022.789219 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 29.Basith, S. et al. ADP-Fuse: A two-layer machine learning predictor to identify antidiabetic peptides and diabetes types using multiview information. Comput. Biol. Med.165, 107386. 10.1016/j.compbiomed.2023.107386 (2023). [DOI] [PubMed] [Google Scholar]
  • 30.Liu, X. W. et al. iPADD: A computational tool for predicting potential antidiabetic drugs using ML algorithms. J. Chem. Inf. Model.63 (15), 4960–4969. 10.1021/acs.jcim.3c00564 (2023). [DOI] [PubMed] [Google Scholar]
  • 31.Guan, C. et al. Exploration of DPP-IV inhibitory peptide design rules using deep learning and enzyme site prediction. ACS Omega. 8 (42), 39662–39672. 10.1021/acs.jcim.3c00564 (2023). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 32.Cai, K. et al. Predicting antidiabetic peptide activity: A machine learning perspective on type 1 and type 2 diabetes. Int. J. Mol. Sci.25 (18), 10020. 10.3390/ijms251810020 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Yue, J. et al. Discovery of potential antidiabetic peptides using deep learning. Comput. Biol. Med.180, 109013. 10.1016/j.compbiomed.2024.109013 (2024). [DOI] [PubMed] [Google Scholar]
  • 34.Arshad, F., Ahmed, S., Amjad, A. & Kabir, M. An explainable stacking-based approach for accelerating the prediction of antidiabetic peptides. Anal. Biochem.691, 115546. 10.1016/j.ab.2024.115546 (2024). [DOI] [PubMed] [Google Scholar]
  • 35.Tuo, S., Zhu, Y., Lin, J. & Jiang, J. AMHF-TP: multifunctional therapeutic peptides prediction based on multi‐granularity hierarchical features. Quant. Biol.13 (1), e73. 10.1002/qub2.73 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36.Niu, Y., Qin, P. & Lin, P. Advances of deep neural networks in the development of peptide drugs. Future Med. Chem. 1–15. 10.1080/17568919.2025.2463319 (2025). [DOI] [PMC free article] [PubMed]
  • 37.PEP-Lab. Antidiabetic peptides [Internet]. Pep-Lab. [cited 2025 Oct 24]. (2025). Available from: https://www.pep-lab.info/Antidiabetic-activity-peptidespep-lab.info+1.
  • 38.University of Nebraska Medical Center. Antimicrobial Peptide Database (APD6) [Internet]. 2025 [cited 2025 Oct 24]. Available from: https://aps.unmc.edu/aps.unmc.edu.
  • 39.Singh, S. et al. SATPdb: a database of structurally annotated therapeutic peptides. Nucleic Acids Res.44 (D1), D1119–D1126. 10.1093/nar/gkv1114 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Jain, S., Gupta, S., Patiyal, S. & Raghava, G. P. S. THPdb2: compilation of FDA approved therapeutic peptides and proteins. Drug Discov Today. 29 (7), 104047. 10.1016/j.drudis.2024.104047 (2024). [DOI] [PubMed] [Google Scholar]
  • 41.Minkiewicz, P., Iwaniak, A. & Darewicz, M. The BIOPEP-UWM database of bioactive peptides: current opportunities. Int. J. Mol. Sci.20 (23), 5978. 10.3390/ijms20235978 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Wang, G., Schmidt, C., Li, X. & Wang, Z. APD6: the antimicrobial peptide database is expanded to promote research and development by deploying an unprecedented information pipeline. Nucleic Acids Res. gkaf860. 10.1093/nar/gkaf860 (2025). [DOI] [PMC free article] [PubMed]
  • 43.Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics22 (13), 1658–1659. 10.1093/bioinformatics/btl158 (2006). [DOI] [PubMed] [Google Scholar]
  • 44.Chen, X., Huang, J. & He, B. AntiDMPpred: a web service for identifying anti-diabetic peptides. PeerJ10, e13581. 10.7717/peerj.13581 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Xie, X. et al. BertADP: a fine-tuned protein Language model for anti-diabetic peptide prediction. BMC Biol.23 (1), 210. 10.1186/s12915-025-02312-w (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Rezaee, K. Anti-Diabetic Peptide Prediction [Internet]. GitHub; (2025). Available from: https://github.com/KhosroRezaee/Anti-Diabetic-Peptide-Prediction-.
  • 47.Nongonierma, A. B. & FitzGerald, R. J. Dipeptidyl peptidase IV (DPP-IV) inhibitory properties of a Whey protein isolate hydrolysate: influence of fractionation, enzyme specificity and hydrolysis time. Food Funct.4 (12), 1843–1853. 10.1016/j.peptides.2016.03.005 (2013). [DOI] [PubMed] [Google Scholar]
  • 48.Zhang, X. et al. Structure and dynamics of semaglutide-and taspoglutide-bound GLP-1R-Gs complexes. Cell. Rep.36 (2). 10.1016/j.celrep.2021.109374 (2021). [DOI] [PubMed]
  • 49.Huan, Y., Kong, Q., Mou, H. & Yi, H. Antimicrobial peptides: classification, design, application and research progress in multiple fields. Front. Microbiol.11, 582779. 10.3389/fmicb.2020.582779 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50.Yao, L. et al. Identifying antitubercular peptides via deep forest architecture with effective feature representation. Anal. Chem.96 (4), 1538–1546. 10.1021/acs.analchem.3c04196 (2024). [DOI] [PubMed] [Google Scholar]
  • 51.Xie, P. et al. Toward high-efficiency, low-resource, and explainable neuropeptide prediction with MSKDNP. Brief. Bioinform. 26 (5), bbaf466. 10.1093/bib/bbaf466 (2025). [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All datasets used or generated during the current study are derived from publicly accessible peptide databases (e.g., PEP-Lab, SATPdb, THPdb2) as described in the Methods section. Additional processed data are available from the corresponding author on reasonable request (https://github.com/KhosroRezaee/Anti-Diabetic-Peptide-Prediction-).


Articles from Scientific Reports are provided here courtesy of Nature Publishing Group

RESOURCES