Skip to main content
BMC Biology logoLink to BMC Biology
. 2026 Jan 26;24:43. doi: 10.1186/s12915-026-02520-y

Aegis: a transformer-based deep learning framework for the accurate identification of anticancer peptides

Zexu Zhou 1,2,#, Lei Xie 3,#, Xiaolong Li 1,4, Yijie Wei 1, Xinwei Luo 1, Feitong Hong 1, Sijia Xie 1, Hao Lyu 1, Fuying Dao 5, Chengbing Huang 4,, Hui Ding 1,, Huan Yang 2,
PMCID: PMC12918072  PMID: 41588397

Abstract

Background

Anticancer peptides (ACPs) are promising therapeutic agents with selective cytotoxicity toward cancer cells and minimal toxicity toward normal cells. However, the experimental identification and characterization of ACPs are often costly, time-consuming, and inefficient. Computational approaches provide promising alternatives for the rapid and accurate prediction of ACPs.

Results

Here, we introduce Aegis, a novel transformer-based deep learning framework designed for precise ACP identification. We systematically evaluated various machine learning and deep learning models via multiple feature extraction methods, including the composition of k-spaced amino acid pairs (CKSAAP), CTD composition (CTDC), CTD transition (CTDT), CTD distribution (CTDD), and pseudo amino acid composition (PAAC) methods. Comprehensive feature importance analyses via analysis of variance (ANOVA), ReliefF, and SHapley Additive exPlanations (SHAP) methods were performed, followed by incremental feature selection (IFS) to determine the optimal subset of discriminative features. Using the 103 optimal features identified via SHAP, Aegis achieves state-of-the-art (SOTA) performance on an independent testing dataset, outperforming existing ACP prediction models. Furthermore, compositional analysis revealed that ACP sequences are significantly enriched in positively charged and hydrophobic residues.

Conclusions

Overall, our study demonstrates the exceptional potential of transformer-based deep learning for ACP identification, laying a foundation for future computational screening and the clinical development of novel ACPs.

Keywords: Anticancer peptides, Transformer, Deep learning, Incremental feature selection, SHAP

Background

Cancer remains one of the most critical public health challenges worldwide, causing millions of deaths annually due to the limitations of current therapeutic options [13]. At present, the primary clinical treatments for cancer include surgery, chemotherapy, radiotherapy, and targeted drug therapy [46]. Although these approaches play significant roles in cancer treatment, they are often accompanied by severe side effects and may lead to drug resistance [710]. Therefore, novel, safe, and effective cancer treatment strategies are urgently needed. Recent advances in immunotherapy, particularly immune checkpoint inhibitors and combination strategies, have further reshaped the oncology landscape, underscoring the need for new anticancer agents with enhanced specificity and reduced toxicity [11].

In recent years, anticancer peptides (ACPs) have attracted increasing attention as a promising class of therapeutic agents. Typically, composed of 10–50 amino acid residues, ACPs exhibit high selectivity and low toxicity, targeting cancer cells while sparing normal cells, thus demonstrating great potential for clinical application [12]. However, only a limited number of ACPs have advanced to clinical use [13]. Traditional experimental methods for ACP functional identification are often time-consuming, costly, and inefficient [14]. As a result, leveraging computational approaches for the rapid and accurate identification of ACPs holds substantial practical significance.

Computational methods, particularly those based on machine learning and deep learning, have shown remarkable advantages in ACP identification in recent years [1520]. For example, AntiCP 2.0, developed by Agrawal et al., utilizes extremely randomized trees based on dipeptide and amino acid compositions for efficient ACP prediction [21]. Chen et al. introduced iACP, which employs optimized pseudo amino acid composition and g-gap dipeptide mode to improve predictive accuracy [22]. ACPred, proposed by Schaduangrat et al., integrates physicochemical properties, amino acid composition, and dipeptide composition features, achieving high accuracy via support vector machine and random forest algorithms [23]. ACPred-LAF, developed by He et al., further enhances ACP prediction by incorporating multisense-scaled attention-based embeddings, providing superior performance over traditional handcrafted features [24]. Recently, MLACP 2.0, an updated version of MLACP, employed a combination of conventional classifiers and a convolutional neural network, significantly boosting predictive performance and generalizability [25, 26]. In addition to these methods, ensemble-based ACP prediction strategies have shown that aggregating heterogeneous learners can enhance predictive stability and provide an alternative methodological perspective alongside deep-learning-based models [27]. Recent studies have further expanded this landscape by introducing a variety of innovative ACP prediction frameworks. For instance, an intelligent sequence-evolutionary profiling model was proposed to improve ACP discrimination performance [28], while PLMACPred incorporated protein language model embeddings together with wavelet-based denoising to enhance feature representation [29]. More recently, StackACPred combined optimized multi-descriptor features with a stacked ensemble architecture, achieving competitive performance across benchmark datasets [30]. Zhong and Deng developed ACPScanner, an integrated machine learning framework utilizing sequence, physicochemical, secondary structure, and deep representation learning features, along with models such as the graph attention network and LightGBM, to predict ACPs and their specific anticancer functional types [31]. Although ACPScanner has demonstrated promising performance, it focuses primarily on multilevel ACP function prediction and still suffers from feature space redundancy, limited model interpretability, and potential overfitting.

To address these limitations, we developed a transformer-based deep learning model named “Aegis” for precise ACP identification. We utilized a publicly available dataset from the ACPScanner study, consisting of 701 experimentally validated ACP and nonanticancer peptide (non-ACP) sequences. The dataset was divided into 563 sequences for training and cross-validation and 138 sequences for independent testing. As illustrated in Fig. 1, we first performed comparative analyses of amino acid and dipeptide compositions to identify sequence differences between ACPs and non-ACPs. Next, multiple sequence features were extracted via the iLearnPlus platform. Various machine learning and deep learning models were trained and optimized, followed by feature importance ranking via analysis of variance (ANOVA), ReliefF, and SHapley additive exPlanations (SHAP) methods. Incremental feature selection (IFS) was then applied to determine the optimal feature subset. The final Aegis model achieved an accuracy (ACC) of 0.942 on the independent testing dataset, outperforming existing ACP prediction methods. Our study not only demonstrates the effectiveness of transformer-based models in ACP prediction but also provides a valuable computational tool for the precise identification and screening of ACPs in future research.

Fig. 1.

Fig. 1

Flowchart of our work

Results

Dataset construction

The dataset used in this study was derived from previously validated ACPs and non-ACPs. After removing redundant sequences with more than 80% identity via CD-HIT software, the dataset was divided into two subsets: 563 sequences for training and cross-validation and 138 sequences reserved as an independent testing dataset.

Compositional analysis

To investigate the compositional characteristics distinguishing ACPs from non-ACPs, we analyzed amino acid composition (AAC) and dipeptide composition (DPC) frequencies. As shown in Fig. 2a, AAC analysis revealed notable differences in amino acid distributions between ACPs and non-ACPs. Specifically, lysine (K), leucine (L), and arginine (R) were significantly enriched in ACPs, with lysine showing the most pronounced increase. This enrichment pattern is consistent with established anticancer mechanisms, in which cationic residues enable strong electrostatic interactions with negatively charged cancer cell membranes, while hydrophobic residues facilitate membrane insertion and subsequent membrane disruption [32]. Conversely, aspartic acid (D), glutamic acid (E), and serine (S) were more abundant in non-ACPs. These results suggest that ACPs preferentially contain positively charged residues (K, R) and hydrophobic residues (L), whereas non-ACPs have a greater proportion of negatively charged residues (D, E) and smaller polar residues (S). These compositional features likely play critical roles in determining their biological activities.

Fig. 2.

Fig. 2

Comparison of AAC and DPC features between ACPs and non-ACPs. a The percentage of each amino acid in ACPs and non-ACPs. b Percentages of dipeptides in ACPs and non-ACPs. Only dipeptides with absolute composition differences greater than 0.4% between ACPs and non-ACPs are shown. The red bars represent ACPs, whereas the blue bars represent non-ACPs

DPC analysis focused on dipeptides whose absolute compositional differences exceeded 0.4% between ACPs and non-ACPs (Fig. 2b). Among these dipeptides, KK, KL, AK, and KW showed prominent enrichment in ACPs, with KK displaying the highest relative abundance. Conversely, dipeptides such as ED, EL, LD, SG, and VE were more frequent in non-ACPs. These observations are consistent with those of the AAC analysis, further emphasizing that ACPs favor positively charged or positively charged-hydrophobic residue combinations, whereas non-ACPs are characterized by dipeptides comprising negatively charged or aliphatic-polar residues.

Benchmarking machine learning models and feature encodings reveals SVM with CKSAAP as the top performer

Given the distinct amino acid composition between ACPs and non-ACPs, we employed computational strategies to extract discriminative features for classification. Feature extraction and model training were performed via iLearnPlus, a web-based tool for automated predictive analysis of biological sequences [33]. Specifically, five feature extraction methods were implemented: composition of k-spaced amino acid pairs (CKSAAP), CTD composition (CTDC), CTD transition (CTDT), CTD distribution (CTDD), and pseudo amino acid composition (PAAC). Two variants of CKSAAP (CKSAAP4 and CKSAAP5) were generated by setting the k-space parameter to 4 and 5. For PAAC, the lambda parameter was set to 2, and the weight factor was 0.05, resulting in six distinct feature sets for model construction.

Four machine learning algorithms—XGBoost, random forest (RF), LightGBM, and support vector machine (SVM)—were trained using each feature set with fivefold cross-validation. These models represent two major categories of classical machine learning paradigms: tree-based ensemble learners (XGBoost, RF, and LightGBM), which excel at modeling nonlinear relationships in high-dimensional feature spaces, and a kernel-based classifier (SVM), which is well suited for margin-maximizing classification in sparse peptide representations. The cross-validated models were then ensembled to enhance robustness and evaluated on an independent testing dataset. Confidence intervals were computed using non-parametric bootstrapping (1000 iterations) over the independent testing set.

As shown in Table 1 and Fig. 3, model performance varies depending on the feature extraction method and classifier used. Overall, the SVM classifiers consistently outperformed the other classifiers, particularly with the CKSAAP-based feature sets (CKSAAP4 and CKSAAP5). Specifically, the SVM model trained with CKSAAP4 features achieved the highest overall performance, with an ACC of 0.942 (95% confidence interval (CI), 0.899–0.978), a sensitivity (SN) of 0.963 (95% CI 0.922–0.991), a specificity (SP) of 0.867 (95% CI 0.739–0.971), a Matthews correlation coefficient (MCC) of 0.83 (95% CI 0.708–0.940), and an area under the receiver operating characteristic curve (AUC) of 0.981 (95% CI 0.961–0.996), demonstrating excellent discriminative capability.

Table 1.

Performance of the ML models trained with different types of features on the testing dataset

Model Feature SN SP ACC MCC AUC
XGBoost CKSAAP4 0.972 0.6 0.891 0.657 0.972
CKSAAP5 0.991 0.5 0.884 0.632 0.976
CTDC 0.889 0.767 0.862 0.621 0.951
CTDD 0.944 0.733 0.899 0.695 0.946
CTDT 0.87 0.867 0.87 0.67 0.939
PAAC 0.944 0.733 0.899 0.695 0.946
RF CKSAAP4 0.963 0.567 0.877 0.608 0.962
CKSAAP5 0.954 0.567 0.87 0.586 0.956
CTDC 0.935 0.8 0.906 0.727 0.945
CTDD 0.935 0.667 0.877 0.626 0.939
CTDT 0.889 0.767 0.862 0.621 0.931
PAAC 0.972 0.733 0.92 0.756 0.949
LightGBM CKSAAP4 0.991 0.467 0.877 0.606 0.956
CKSAAP5 0.981 0.467 0.87 0.577 0.954
CTDC 0.898 0.833 0.884 0.687 0.952
CTDD 0.954 0.767 0.913 0.739 0.957
CTDT 0.88 0.9 0.884 0.709 0.944
PAAC 0.935 0.767 0.899 0.702 0.944
SVM CKSAAP4 0.963 0.867 0.942 0.83 0.981
CKSAAP5 0.963 0.867 0.942 0.83 0.98
CTDC 0.926 0.8 0.899 0.709 0.953
CTDD 0.954 0.7 0.899 0.69 0.93
CTDT 0.935 0.9 0.928 0.799 0.961
PAAC 0.935 0.867 0.92 0.775 0.959

Fig. 3.

Fig. 3

Performance of machine learning models in classifying ACPs and non-ACPs. a, b ROC and PR curves of XGBoost models trained with 6 different features. c, d ROC and PR curves of RF models trained with 6 different features. e, f ROC and PR curves of LightGBM models trained with 6 different features. g, h ROC and PR curves of SVM models trained with 6 different features

The SVM-CTDD model showed improved and balanced performance (ACC of 0.899, 95% CI 0.848–0.949; SN of 0.954, 95% CI 0.912–0.990; SP of 0.767, 95% CI 0.538–0.848; MCC of 0.739, 95% CI 0.528–0.827), reflecting effective feature selection and model robustness. The SVM-CTDT model also performed well (ACC of 0.928, 95% CI 0.884–0.964; SP of 0.9, 95% CI 0.781–1.000), indicating its strong generalization ability.

Among the tree-based algorithms, the LightGBM-CTDD model exhibited robust performance (ACC of 0.913, 95% CI 0.862–0.957; SN of 0.954, 95% CI 0.909–0.991; MCC of 0.739, 95% CI 0.574–0.859), whereas XGBoost achieved competitive predictive results (ACC of 0.891, 95% CI 0.841–0.942; SN of 0.972, 95% CI 0.937–1.000) when CKSAAP4 features were used. However, both LightGBM and XGBoost showed lower specificity than SVM models did, revealing a sensitivity–specificity trade-off inherent in tree-based methods. RF-based models displayed optimal performance with PAAC features (ACC of 0.92, 95% CI 0.877–0.964; MCC of 0.756, 95% CI 0.620–0.880), demonstrating reliability.

The receiver operating characteristic (ROC) and precision-recall (PR) curves in Fig. 3 further confirmed these findings. The SVM-based classifiers maintained superior AUC (0.953–0.981) and average precision (AP, 0.989–0.995) values, clearly demonstrating superior predictive capability. In contrast, models utilizing CTDD or CTDT features generally exhibited moderate performance (AUC, 0.939–0.957), underscoring the importance of effective feature selection for optimizing model outcomes.

Benchmarking deep learning models and feature encodings reveals transformer with CKSAAP as the top performer

To further enhance the predictive performance, we explored deep learning (DL) techniques using the same six feature sets previously employed in machine learning models. Specifically, we utilized several DL architectures available in iLearnPlus, including convolutional models (CNN and attention-based CNN, ABCNN), recurrent sequence models (RNN and bidirectional RNN, BRNN), and a representation-learning model (autoencoder, AE), alongside a transformer model that we independently constructed. These architectures embody complementary inductive biases, spanning fine-grained local motif detection to broader long-range dependency modeling, thus supporting a more comprehensive and meaningful benchmark. Each DL model was evaluated via fivefold cross-validation, with the final performance assessed on the independent testing dataset. Confidence intervals were computed using non-parametric bootstrapping (1000 iterations) over the independent testing set.

Table 2 summarizes the performance of different DL architectures, while Fig. 4 presents the corresponding ROC and PR curves. Among these models, the transformer architecture demonstrated superior overall performance, particularly when trained on CKSAAP-derived features. Notably, the transformer trained with the CKSAAP5 feature set yielded the highest performance, with an ACC of 0.942 (95% CI 0.906–0.978), SN of 0.954 (95% CI 0.910–0.990), SP of 0.9 (95% CI 0.784–1.000), MCC of 0.834 (95% CI 0.717–0.936), and an outstanding AUC of 0.978 (95% CI 0.957–0.994), demonstrating strong predictive capability and balanced classification accuracy. Despite its strong performance, the transformer model remained computationally manageable, with training performed on a single NVIDIA RTX 3060 Laptop GPU and requiring approximately 10 s per fold, reflecting a parameter scale that is moderately larger than CNN- and RNN-based models but still accessible to typical research computing environments.

Table 2.

Performance of the DL models trained with different types of features on the testing dataset

Model Feature SN SP ACC MCC AUC
CNN CKSAAP4 0.944 0.667 0.884 0.645 0.941
CKSAAP5 0.833 0.867 0.841 0.62 0.929
CTDC 0.843 0.967 0.87 0.708 0.948
CTDD 0.815 0.8 0.812 0.544 0.929
CTDT 0.806 0.9 0.826 0.611 0.925
PAAC 0.898 0.833 0.884 0.687 0.946
ABCNN CKSAAP4 0.935 0.567 0.855 0.546 0.938
CKSAAP5 0.944 0.8 0.913 0.744 0.958
CTDC 0.861 0.833 0.855 0.631 0.944
CTDD 0.935 0.3 0.797 0.303 0.798
CTDT 0.796 0.967 0.833 0.652 0.948
PAAC 0.833 0.833 0.833 0.594 0.918
RNN CKSAAP4 0.769 0.467 0.703 0.215 0.607
CKSAAP5 0.972 0.5 0.87 0.578 0.848
CTDC 0.806 0.867 0.819 0.585 0.923
CTDD 0.75 0.8 0.761 0.47 0.855
CTDT 0.75 0.867 0.775 0.523 0.891
PAAC 0.843 0.833 0.841 0.606 0.915
BRNN CKSAAP4 0.963 0.4 0.841 0.468 0.89
CKSAAP5 0.981 0.533 0.884 0.631 0.89
CTDC 0.815 0.833 0.819 0.57 0.917
CTDD 0.75 0.767 0.754 0.443 0.847
CTDT 0.685 0.833 0.717 0.432 0.868
PAAC 0.833 0.9 0.848 0.645 0.913
AE CKSAAP4 0.963 0.667 0.899 0.685 0.953
CKSAAP5 0.963 0.7 0.906 0.71 0.953
CTDC 0.972 0.5 0.87 0.578 0.888
CTDD 0.926 0.533 0.841 0.5 0.861
CTDT 0.889 0.9 0.891 0.723 0.937
PAAC 0.926 0.733 0.884 0.659 0.95
Transformer CKSAAP4 0.954 0.8 0.92 0.763 0.974
CKSAAP5 0.954 0.9 0.942 0.834 0.978
CTDC 0.926 0.9 0.92 0.783 0.958
CTDD 0.935 0.767 0.899 0.702 0.952
CTDT 0.917 0.9 0.913 0.767 0.95
PAAC 0.907 0.867 0.899 0.727 0.964

Fig. 4.

Fig. 4

Performance of deep learning models in classifying ACPs and non-ACPs. a, b ROC and PR curves of the CNN models trained with 6 different features. c, d ROC and PR curves of ABCNN models trained with 6 different features. e, f ROC and PR curves of RNN models trained with 6 different features. g, h ROC and PR curves of the BRNN models trained with 6 different features. i, j ROC and PR curves of the AE models trained with 6 different features. k, l ROC and PR curves of transformer models trained with 6 different features

The CNN and ABCNN also displayed promising results, although they were slightly inferior to the transformer. Specifically, the CNN model trained with CTDC features exhibited robust predictive accuracy (ACC = 0.87, 95% CI 0.819–0.920), excellent specificity (SP = 0.967, 95% CI 0.893–1.000), a favorable MCC (0.708, 95% CI 0.596–0.815), and an AUC of 0.948 (95% CI 0.905–0.985). The ABCNN performed well when trained on the CKSAAP5 feature set, achieving high sensitivity (SN = 0.944, 95% CI 0.903–0.982), reasonable specificity (SP = 0.8, 95% CI 0.652–0.935), an ACC of 0.913 (95% CI 0.869–0.957), and an AUC of 0.958 (95% CI 0.922–0.987), indicating strong but slightly less balanced predictive capability than the CNN.

In contrast, the RNN, BRNN, and AE models achieved varied and generally lower predictive performances than the CNN, ABCNN, and transformer models did. Specifically, the RNN model demonstrated moderate discriminative ability across most feature sets, achieving balanced performance with PAAC features (ACC = 0.841, 95% CI 0.782–0.899; SN = 0.843, 95% CI 0.770–0.908; SP = 0.833, 95% CI 0.704–0.963; MCC = 0.606, 95% CI 0.458–0.751; AUC = 0.915, 95% CI 0.844–0.970). BRNN models exhibited improved performance over standard RNNs, notably with CKSAAP5 features, attaining a balanced predictive accuracy (ACC = 0.884, 95% CI 0.826–0.935; SN = 0.981, 95% CI 0.951–1.000; SP = 0.533, 95% CI 0.364–0.719; MCC = 0.631, 95% CI 0.472–0.781; AUC = 0.89, 95% CI 0.821–0.948). However, both the RNN and BRNN models suffer from inconsistent performance across different feature sets, highlighting challenges in generalization.

The autoencoder (AE) models also yielded varying results, with relatively balanced and improved predictive capabilities when trained with CTDT features (ACC = 0.891, 95% CI 0.840–0.942; SN = 0.889, 95% CI 0.827–0.944; SP = 0.9, 95% CI 0.784–1.000; MCC = 0.723, 95% CI 0.589–0.847; AUC = 0.937, 95% CI 0.893–0.975). Nonetheless, AE models frequently exhibit lower specificity with certain feature sets, such as CKSAAP and CTDC, indicating potential limitations in their discriminative power.

The ROC and PR curves presented in Fig. 4 further illustrate these observations. The transformer architecture consistently outperformed the other DL models, with ROC curves close to the ideal upper-left corner and PR curves indicating high precision and recall across various feature sets. Compared with the transformer, the CNN and ABCNN showed robust but slightly less balanced performance. The RNN, BRNN, and AE models displayed moderate and less consistent discriminative capabilities, underscoring the necessity of careful feature selection and model optimization.

Multi-strategy feature selection uncovers complementary importance landscapes in CKSAAP feature spaces

After constructing the machine learning and deep learning models, we observed substantial redundancy in the original high-dimensional CKSAAP feature matrices (CKSAAP4, 2000 features; CKSAAP5, 2400 features), largely because many k-spaced dipeptide combinations do not occur in real peptide sequences. To mitigate this redundancy and improve both interpretability and predictive robustness, we systematically performed feature selection. Specifically, we ranked the extracted features according to their relative importance in distinguishing ACPs from non-ACPs, facilitating subsequent IFS.

CKSAAP features represent the frequency of amino acid pairs separated by k intervening residues. Given 20 standard amino acids, CKSAAP generates 20 × 20 = 400 possible amino acid pairs for each specific k. Accordingly, features were named in the following format: {AminoAcid1}{AminoAcid2}_{k}, clearly indicating the amino acid pair and its separation distance.

Because IFS requires a ranked feature list, we employed three independent feature-ranking methods—ANOVA, ReliefF, and SHAP—to generate distinct importance rankings for each feature set (Fig. 5).

Fig. 5.

Fig. 5

Feature importance ranking and SHAP value visualization for the CKSAAP4 and CKSAAP5 feature sets. a SHAP value summary plot showing the feature impact and value distribution of the top 20 features from the CKSAAP4 set. b, c Top 20 features from the CKSAAP4 set ranked by ANOVA and ReliefF. d–f Same as above for the CKSAAP5 feature set

The SHAP ranking method utilizes an XGBoost model trained on the entire feature set, subsequently computing Shapley values to quantify the impact of each feature on model predictions. SHAP values uniquely capture complex nonlinear interactions among features. The SHAP summary plots in Fig. 5a (CKSAAP4) and Fig. 5d (CKSAAP5) visually illustrate feature contributions, where each dot represents an instance. Dot color indicates feature magnitude (high or low), whereas horizontal placement reflects the direction and strength of the feature’s influence on model predictions.

The ANOVA method evaluates features on the basis of their ability to discriminate between ACPs and non-ACPs, ranking them by their corresponding F values (Fig. 5b, e). Higher F values indicate stronger discriminatory capacity, highlighting key differentiating features.

In contrast, ReliefF is a robust, distance-based method that assigns higher scores to features with strong discriminative capability between neighboring samples from different classes (Fig. 5c, f). Unlike ANOVA, ReliefF implicitly considers local feature interactions and the data structure, providing complementary insights into feature importance.

Comparative analyses revealed both consistencies and differences among the three ranking methods. For example, features VQ_3 and SE_0 ranked highly in both ANOVA and ReliefF for CKSAAP4 (Fig. 5b, c), demonstrating robust intrinsic discriminative capabilities. In contrast, SHAP prominently identified KK_3, RR_0, and LK_1 as highly influential features, emphasizing their relevance within a nonlinear model context. Similar patterns were observed for CKSAAP5 (Fig. 5d–f), further highlighting the advantages of integrating multiple ranking approaches for a comprehensive evaluation of feature significance.

IFS highlights SHAP’s efficiency and discriminative power

Having generated comprehensive feature rankings via ANOVA, ReliefF, and SHAP, we proceeded to identify an optimal, compact set of discriminative features via IFS. Owing to its strong predictive ability, the transformer model was chosen as the baseline classifier throughout the IFS procedure. Specifically, beginning with an empty feature set, features were incrementally incorporated one by one following each ranking order. At each incremental step, the models were trained via fivefold cross-validation on the training dataset, and their ensemble predictions were evaluated on the basis of the AUC metric on an independent testing dataset. Through systematic iteration, we generated IFS curves to depict how classification performance evolves with an increasing number of selected features (Fig. 6a, b).

Fig. 6.

Fig. 6

IFS Evaluation via SHAP, ANOVA, and ReliefF on CKSAAP4 and CKSAAP5 feature sets. a, b AUC trajectories built on CKSAAP4 and CKSAAP5 feature sets, with features incrementally added according to SHAP, ANOVA, and ReliefF rankings. cf ROC and PR curves of the best-performing models obtained from each feature ranking method. gl UMAP visualization of the feature spaces using optimal feature subsets ranked by SHAP, ANOVA, and ReliefF for CKSAAP4 and CKSAAP5

As illustrated in Fig. 6a, b the predictive performance for both the CKSAAP4 and CKSAAP5 feature sets rapidly increased as the most critical features were incorporated. Notably, the SHAP-based ranking method for CKSAAP4 achieved the highest AUC (0.98, 95% CI 0.954–0.997) when only 103 features were used, which is substantially lower than that required by ANOVA (AUC of 0.977, 95% CI 0.954–0.995) at 861 features) and ReliefF (AUC of 0.97 at 1,548 features, 95% CI 0.943–0.990). These findings highlight the exceptional ability of SHAP in identifying a compact yet highly informative feature subset, significantly improving computational efficiency and interpretability.

A similar trend was observed for CKSAAP5, where SHAP also demonstrated superior efficiency, achieving an optimal AUC of 0.979 (95% CI 0.958–0.995) with only 474 features. In contrast, ANOVA and ReliefF required considerably larger feature subsets (853 and 1,478 features, respectively) to reach peak performance (maximum AUC, ANOVA = 0.977; 95% CI 0.954–0.994; ReliefF = 0.969, 95% CI 0.940–0.989). This comparison further emphasizes the strength of SHAP in effectively capturing highly discriminative, nonredundant feature subsets.

ROC and PR analyses further reinforced these trends (Fig. 6c–f). The models trained with SHAP-selected features consistently demonstrated superior predictive performance, with the CKSAAP4 model achieving an AUC of 0.980 (95% CI 0.954–0.997) and an AP of 0.995 (95% CI 0.988–0.999) and the CKSAAP5 model achieving similarly high performance (AUC = 0.979, 95% CI 0.958–0.995); AP = 0.995, 95% CI 0.988–0.999). Models using ANOVA and ReliefF selected features also performed strongly but presented slightly lower AUC and AP values, highlighting SHAP’s advantage in identifying a concise, highly discriminative feature space.

To further investigate how the transformer models internally represented class structures under different feature-ranking methods, we applied uniform manifold approximation and projection (UMAP) to visualize sample distributions in the learned feature spaces (Fig. 6g–l) [34]. All UMAP projections demonstrated distinguishable clustering patterns separating ACPs from non-ACPs. The SHAP-derived transformer models for CKSAAP4 and CKSAAP5 (Fig. 6g, h) showed moderate but consistent separation, characterized by smooth boundaries and compact intraclass organization. This pattern may indicate stronger generalization ability, as the learned embeddings avoid overfitting to overly rigid class distinctions. In contrast, the models trained on ANOVA and ReliefF selected features (Fig. 6i–l) presented sharper class boundaries and more polarized clusters, suggesting that the selected features may encode more rigid decision rules but with potentially reduced flexibility across varying data distributions. These visualization results highlight how SHAP-guided feature selection promotes balanced and generalizable representations, enabling robust class discrimination while preserving interpretability.

Our method outperforms the state-of-the-art methods

To comprehensively validate our transformer-based model, we adopted a classifier trained with the optimal subset of 103 CKSAAP4 features identified via SHAP-based IFS, named “Aegis.” We subsequently compared Aegis with several state-of-the-art (SOTA) ACP predictors, including ACPScanner [31], ACPred-LAF [24], ACPred [23], AntiCP 2.0 [21], iACP [22], MLACP [26], and MLACP 2.0 [25]. For a fair and robust evaluation, we used an independent testing dataset previously described and extensively validated in earlier studies.

Table 3 summarizes the results of this comparative analysis. Aegis demonstrated excellent predictive performance, achieving the highest SN (0.991), ACC (0.942), and MCC (0.824) among all methods evaluated. Although other models, such as ACPScanner, ACPred-LAF, and AntiCP 2.0, produced higher SP (1.0), Aegis offered a better balance between sensitivity and specificity, as reflected by its higher MCC and AUC. Specifically, the MCC of Aegis (0.824) was the highest observed, surpassing all other predictors, and its AUC (0.98) was comparable to that of the best-performing methods, such as AntiCP 2.0 and ACPScanner.

Table 3.

Independent test comparison for aegis with current typical ACP predictors

Model SN SP ACC MCC AUC
Aegis 0.991 0.767 0.942 0.824 0.98
ACPScanner 0.898 1.0 0.92 0.811 0.983
ACPred-LAF 0.426 1.0 0.551 0.373 0.571
ACPred 0.806 0.867 0.819 0.585 0.908
AntiCP 2.0 0.824 1.0 0.862 0.71 0.98
iACP 0.611 0.8 0.652 0.339 0.777
MLACP 0.657 0.933 0.717 0.488 0.874
MLACP 2.0 0.87 0.967 0.891 0.745 0.971

Discussion

In this study, we developed Aegis, a transformer-based deep learning framework designed to accurately identify ACPs. By systematically exploring various sequence-derived feature encoding methods, including CKSAAP, CTDC, CTDT, CTDD, and PAAC, combined with rigorous computational modeling, we demonstrated the significant advantages of transformer models in biological sequence classification. Moreover, the combination of SHAP-guided feature prioritization and incremental feature selection provides an interpretable and efficient feature optimization strategy, offering a methodological distinction relative to existing transformer-based or hybrid ACP predictors.

Compared with traditional machine learning models such as XGBoost, random forest, and SVM, the transformer architecture consistently demonstrated superior predictive performance, reflecting its ability to effectively capture complex sequential patterns and feature interactions within peptide sequences. Notably, through IFS, we identified an optimal subset of only 103 CKSAAP4 features using SHAP values, achieving exceptional predictive accuracy (ACC = 0.942). This result highlights the transformer model’s robustness and efficiency, indicating that a carefully selected compact feature subset can significantly enhance computational efficiency without compromising predictive accuracy. Conceptually, Aegis also differs from recently published transformer- and attention-based peptide predictors by integrating SHAP-guided feature prioritization with an IFS-optimized transformer architecture, enabling more interpretable, feature-efficient, and computationally lightweight prediction [35]. Importantly, the use of SHAP-guided feature prioritization within the IFS workflow did not introduce performance trade-offs; instead, it enabled us to identify a compact and highly informative feature subset that reduced redundancy and computational complexity while retaining the full predictive strength of the transformer model.

Our compositional analysis provided additional biological validation, revealing that ACPs preferentially incorporate positively charged (e.g., lysine and arginine) and hydrophobic residues (e.g., leucine), a pattern consistent with previous studies indicating the structural and functional characteristics required for selective anticancer activity.

When comparing Aegis with existing SOTA ACP predictors, such as ACPScanner, ACPred-LAF, and AntiCP 2.0, our model displayed superior overall performance, particularly in terms of sensitivity and balanced predictive capability. Although certain models exhibited slightly higher specificity, Aegis achieved the highest MCC, indicating its balanced performance and higher reliability for practical applications. In addition, although the ACPScanner dataset provides a reliable benchmark for comparative evaluation, it still carries inherent limitations in peptide diversity and class balance, and future validation on larger or more diverse independent datasets would further enhance the generalizability of Aegis.

However, our study has several limitations. First, despite careful selection and redundancy filtering of peptide sequences, the size and diversity of the dataset remain limited by currently available experimental data. Future research should aim to integrate larger-scale datasets from multiple cancer types to further validate and improve model generalizability. Second, although the transformer architecture demonstrated exceptional performance, the interpretability of deep learning models remains a challenge. Employing SHAP values mitigates this issue by providing insights into feature importance; however, further research into interpretable transformer models would be valuable. In this context, recent advances in explainable AI for clinical decision-support systems have highlighted the importance of transparent and trustworthy predictive models, aligning with our use of SHAP for model interpretability [36]. Moreover, broader developments in AI-driven smart healthcare illustrate how computational frameworks such as Aegis may eventually be integrated into next-generation therapeutic discovery pipelines [37].

Conclusions

In conclusion, our study emphasizes the significant potential of deep learning-based computational frameworks, especially transformer models, in the accurate identification and characterization of ACPs. Aegis represents an important step forward, offering both methodological innovation and practical utility for future biomedical research and clinical therapeutic peptide discovery.

Methods

Dataset collection and preprocessing

The dataset used in this study was derived from the work of Zhong and Deng [31], which were used in the ACPScanner. Experimentally validated ACPs, which specifically target nine common human cancers, were retrieved from the Database of Antimicrobial Activity and Structure of Peptides [38]. The included cancer types were human acute monocytic leukemia, human breast adenocarcinoma, human cervical carcinoma, human colon adenocarcinoma, human hepatocellular carcinoma, human histiocytic lymphoma, human lung adenocarcinoma, human myelogenous leukemia, and human prostate adenocarcinoma. Additionally, non-ACP sequences serving as negative samples were obtained from Agrawal et al. [21].

To ensure dataset diversity and minimize redundancy, sequences with identities greater than 80% were removed via CD-HIT software [39]. After this filtering step, the final curated dataset consisted of 701 peptide sequences, which were partitioned into two subsets: 563 samples utilized for training and cross-validation and the remaining 138 samples reserved as an independent testing dataset. The training dataset incorporated ACP subsets specific to the aforementioned cancer categories, additional ACP sequences not specific to any particular cancer type, and non-ACP sequences. This rigorous dataset is the foundation for developing and validating robust predictive models.

iLearnPlus platform

In this study, feature extraction and computational modeling were conducted via iLearnPlus [33], an integrated machine learning platform designed for biological sequence analysis. iLearnPlus provides both graphical and web-based interfaces, enabling streamlined construction of automated machine learning pipelines for nucleic acid and protein sequence analyses without extensive programming requirements. The platform comprises four primary modules: iLearnPlus-Basic, iLearnPlus-Estimator, iLearnPlus-AutoML, and iLearnPlus-LoadModel, which support extensive customizable feature engineering, performance assessment, statistical analysis, and visualization. Notably, iLearnPlus incorporates 21 machine learning algorithms (12 conventional classifiers, two ensemble-learning frameworks, and seven deep-learning methods) along with 19 sequence encoding schemes (in total 147 descriptors), significantly surpassing current bioinformatics tools available for sequence analysis.

Feature extraction methods

Five sequence-derived feature encoding methods provided in iLearnPlus, namely, CKSAAP, CTDC, CTDT, CTDD, and PAAC, were utilized in this study. Each method is described below:

  • CKSAAP

    CKSAAP captures the frequency of amino acid pairs separated by k intervening residues in peptide sequences. For each k, CKSAAP generates a feature vector comprising 400 descriptors, representing all possible amino acid pairs from the 20 standard amino acids. Each descriptor value reflects the normalized occurrence of a specific amino acid pair at the defined interval k throughout the peptide sequence [40].

  • CTD features

    The composition, transition, and distribution (CTD) descriptors capture the global and local distribution patterns of amino acid residues on the basis of specific structural or physicochemical properties [41]. These features have been widely applied in various bioinformatics tasks, such as protein folding class prediction [42], enzyme family classification [43], RNA-binding protein prediction [44], protein structure prediction [45], and anticancer peptide identification [46].

    CTD features are computed through the following process: (i) The amino acid sequence is transformed into a property-based sequence, where each residue is mapped to a defined structural or physicochemical attribute. (ii) The 20 standard amino acids are grouped into three categories for each property on the basis of the clustering of amino acid indices proposed by Tomii and Kanehisa [45]. (iii) Thirteen types of physicochemical properties are used to derive CTD features, including hydrophobicity, normalized van der Waals volume, polarity, polarizability, charge, secondary structure, and solvent accessibility (see Table 4).

    Each CTD descriptor quantifies the composition (percentage of residues in each group), transition (frequency of transitions between different groups), and distribution (positional distribution of the first, 25%, 50%, 75%, and last occurrence of each group) for each property, providing a comprehensive representation of sequence characteristics.

  • CTDC

    The CTDC encodes the global compositional properties of amino acids on the basis of predefined physicochemical groupings (e.g., hydrophobicity, polarity, charge). For each physicochemical property, the 20 standard amino acids are categorized into three groups, and the CTDC computes the composition (percentage occurrence) of residues from each group in the sequence, resulting in concise descriptors capturing overall sequence characteristics.

  • CTDT

    The CTDT quantifies the frequency of transitions between physicochemically defined amino acid groups. A transition is defined as the occurrence of two adjacent residues belonging to different physicochemical groups. CTDT descriptors measure the normalized frequency of transitions such as polar-to-neutral, neutral-to-hydrophobic, and hydrophobic-to-polar, effectively capturing sequence variability with respect to physicochemical properties [47].

  • CTDD

    CTDD provides positional distribution information for each physicochemical group within peptide sequences. Specifically, for each group (e.g., polar, neutral, hydrophobic), five descriptor values are calculated to represent the positions within the sequence where the first residue of the group occurs, as well as the positions marking 25%, 50%, 75%, and 100% of the occurrences. These descriptors effectively summarize the spatial distributions of amino acid groups [47].

  • PAAC

    PAAC integrates both amino acid composition and sequence-order information. It generates descriptors by calculating correlation factors on the basis of physicochemical properties (e.g., hydrophobicity, hydrophilicity, and side-chain mass) between residues separated by specific distances. The resulting feature vector includes standard amino acid composition values and additional sequence‒order correlation factors, effectively encoding both local and global sequence characteristics [48].

Table 4.

Amino acid physicochemical attributes and their classification into three groups

Attribute Division
Hydrophobicity_PRAM900101 Polar: RKEDQN Neutral: GASTPHY Hydrophobicity: CLVIMFW
Hydrophobicity_ARGP820101 Polar: QSTNGDE Neutral: RAHCKMV Hydrophobicity:LYPFIW
Hydrophobicity_ZIMJ680101 Polar: QNGSWTDERA Neutral: HMCKV Hydrophobicity: LPFYI
Hydrophobicity_PONP930101 Polar: KPDESNQT Neutral: GRHA Hydrophobicity: YMFWLCVI
Hydrophobicity_CASG920101 Polar: KDEQPSRNTG Neutral: AHYMLV Hydrophobicity: FIWC
Hydrophobicity_ENGD860101 Polar: RDKENQHYP Neutral: SGTAV Hydrophobicity: CVLIMF
Hydrophobicity_FASG890101 Polar: KERSQD Neutral: NTPG Hydrophobicity: AYHWVMFLIC
Normalized van der Waals volume

Volume: 0–2.78

GASTPD

Volume: 2.95–94.0

NVEQIL

Volume: 4.03–8.08

MHKFRYW

Polarity

Polarity: 4.9–6.2

LIFWCMVY

Polarity: 8.0–9.2

PATGS

Polarity: 10.4–13.0 HQRKNED
Polarizability

Polarizability value: 0–1.08

GASDT

Polarizability value: 0.128–120.186

GPNVEQIL

Polarizability value: 0.219–0.409

KMHFRYW

Charge Positive: KR Neutral: AANCGQGHILMFPSTWYV Negative: DE
Secondary structure Helix: EALMQKRH Strand: VIYCWFT Coil: GNPSD
Solvent accessibility Buried: ALFCGIVW Exposed: PKQEND Intermediate: MPSTHY

Feature selection methods

Feature selection can not only obtain relevant modeling variables but also improve the comprehensibility, scalability and accuracy of the proposed model [49]. In this study, we used three feature selection methods:

  • ANOVA

    ANOVA is a statistical technique that assesses whether the means of different groups are significantly different from one another [50]. It operates by partitioning the total variability in the data into two components: between-group variance and within-group variance. The between-group variance captures the differences between the means of the groups, while the within-group variance reflects the individual variability within each group. The purpose of ANOVA is to compare these two types of variance. If the between-group variance is significantly larger than the within-group variance, it suggests that there is a meaningful difference between the groups in terms of the feature being analyzed.

  • ReliefF

    ReliefF is a supervised feature-ranking algorithm widely used in bioinformatics for identifying discriminative features in high-dimensional datasets while maintaining robustness to noisy and correlated variables [51]. The method evaluates the importance of each feature based on its ability to distinguish an instance from its nearest neighbors of the same class (nearest hits) and those of different classes (nearest misses).

    Given a dataset D consisting of n instances described by M features, ReliefF computes a feature score Wi for each feature Xi. The algorithm randomly selects an instance R from the dataset and identifies its k nearest neighbors within the same class (nearest hits, Hj) and its k nearest neighbors from each different class (nearest misses, MjC, where C denotes classes other than that of instance R). The feature score Wi for each feature Xi is updated iteratively on the basis of differences observed between the instance R and its nearest neighbors:
    Wi=Wi-1m×kj=1kdiffXi,R,Hj+1m×kcclassRPC1-classRj=1kdiffXi,R,MjC 1
    where Wi is the weight or importance score for feature Xi, m is the number of randomly selected instances used to estimate feature scores, k is the number of nearest neighbors considered for both hits and misses, PC represents the prior probability of class C, classR indicates the class of instance R, and diffXi,I1,I2 is a function that measures the difference in feature values of feature Xi between two instances I1 and I2. The numeric features are typically defined as follows:
    diffXi,I1,I2=XiI1-XiI2max(Xi)-min(Xi) 2

    which normalizes differences by the range of the feature.

  • SHAP

    SHAP is a method grounded in cooperative game theory, specifically based on Shapley values, which are used to fairly distribute the “payout” among players who collaborate to achieve a result [52].

    In machine learning, SHAP values adapt the Shapley concept to explain feature contributions to model predictions. SHAP values can be calculated for each feature Xi in relation to the output of the machine learning model f for a specific instance x. The total model prediction can be expressed as a sum of the SHAP values for each feature:
    fx=ϕ0+i=1Mϕi 3

    where fx is the model output for instance x, ϕ0 is the model’s average output (the base value or expected prediction), ϕi is the SHAP value representing the contribution of feature i to the prediction, M is the total number of features.

    Each SHAP value ϕi measures how much the feature Xi shifts the prediction from the base value (the model’s average prediction without knowing any specific feature values).

  • IFS

    IFS is a feature selection method designed to efficiently handle datasets characterized by large numbers of features [53]. The primary objective of IFS is to sequentially construct an optimal subset of features by incrementally adding features that contribute significantly to the predictive performance while concurrently minimizing redundancy among selected features. This approach effectively reduces dimensionality, enhances computational efficiency, and maintains or improves predictive accuracy.

    The IFS process begins by ranking all candidate features on the basis of their individual importance or relevance to the target classification task. The features are then sequentially added from the highest-ranked feature downward, with each addition forming a new feature subset. At each step, the performance of the newly constructed feature set is evaluated via predefined metrics such as accuracy. This incremental addition continues iteratively until the inclusion of further features no longer results in significant improvement in the model’s predictive performance. Formally, if we denote the ranked feature set as:
    Sn=f1,f2,,fn} 4
    Then, the incremental feature subsets can be represented as:
    Sτ=f1,f2,,fτ} 5

    where τ represents the index of the subset. At each incremental step, a model trained using the current feature subset Sτ is evaluated to determine whether adding the next-ranked feature significantly enhances the model’s performance. The optimal feature set is expressed as:

    SΘ=f1,f2,,fΘ} 6

    where fΘ corresponds to the feature set at which the evaluation metric reaches its peak. This optimal subset is subsequently used for further model training and validation.

Evaluation metrics

In this study, ACC, SN, SP, and MCC were employed to evaluate the performance of the models. These metrics are calculated as follows [24, 25, 31, 54, 55]:

ACC=TP+TNTP+FP+TN+FN 7
SN=TPTP+FN 8
SP=TNTN+FP 9
MCC=TP×TN-FP×FNTP+FPTP+FNTN+FPTN+FN 10

where TP, TN, FP, and FN represent the numbers of true positives, true negatives, false positives, and false negatives, respectively.

Additionally, metrics such as the ACC could be biased toward categories with more samples; therefore, we employed the AUC and AP to assess the models comprehensively. The AUC essentially measures the probability that the threshold values effectively separate sorted positive and negative samples, whereas the AP summarizes the trade-off between precision and recall across various thresholds. Both the AUC and AP values range from 0 to 1, with higher values indicating better predictive performance by the model [56, 57].

Acknowledgements

The authors thank all members of the School of Life Science and Technology at the University of Electronic Science and Technology of China for their helpful discussions and support during this study.

Abbreviations

AAC

Amino acid composition

ABCNN

Attention-based convolutional neural network

ACC

Accuracy

ACP

Anticancer peptide

AE

Autoencoder

ANOVA

Analysis of variance

AP

Average precision

AUC

Area under the receiver operating characteristic curve

BRNN

Bidirectional recurrent neural network

CI

Confidence interval

CKSAAP

Composition of k-spaced amino acid pairs

CNN

Convolutional neural network

CTD

Composition, transition, and distribution

CTDC

CTD composition

CTDD

CTD distribution

CTDT

CTD transition

DL

Deep learning

DPC

Dipeptide composition

IFS

Incremental feature selection

MCC

Matthews correlation coefficient

non-ACP

Non-anticancer peptide

PAAC

Pseudo amino acid composition

PR

Precision-recall

RF

Random forest

RNN

Recurrent neural network

ROC

Receiver operating characteristic

SHAP

SHapley additive explanations

SN

Sensitivity

SP

Specificity

SOTA

State-of-the-art

SVM

Support vector machine

UMAP

Uniform manifold approximation and projection

Authors’ contributions

Z.Z.: Data curation, writing—original draft. Z.Z. and L.X.: Methodology, software. X.Li, Y.W., X.Luo, F.H., and S.X.: Visualization and investigation. H.L and F.D.: Resources and validation. C.H. and H.D.: Supervision, project administration, writing—review and editing. H.Y.: Conceptualization, funding acquisition, supervision, writing—review and editing. All authors read and approved the final manuscript.

Funding

This work was supported by grants from the National Natural Science Foundation of China (No. 62402207 to H.Y.) and the Municipal Government of Quzhou (No. 2024D008 to H.Y.). The study sponsors did not participate in the study design; the collection, analysis, and interpretation of the data; the writing of the manuscript; or the decision to submit the study for publication.

Data availability

All data generated or analysed during this study are included in this published article, its supplementary information files and publicly available repositories. The data and code supporting the findings of this study are publicly available in the Zenodo repository at https://doi.org/10.5281/zenodo.18207177. The underlying datasets were obtained from publicly available resources [31].

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Zexu Zhou and Lei Xie contributed equally to this work.

Contributor Information

Chengbing Huang, Email: 20049607@abtu.edu.cn.

Hui Ding, Email: hding@uestc.edu.cn.

Huan Yang, Email: yangh286@alumni.sysu.edu.cn.

References

  • 1.Siegel RL, Miller KD, Fuchs HE, Jemal A. Cancer statistics, 2022. CA Cancer J Clin. 2022;72(1):7–33. [DOI] [PubMed] [Google Scholar]
  • 2.Singhai H, Raikwar S, Rathee S, Jain SK. Emerging combinatorial drug delivery strategies for breast cancer: a comprehensive review. Curr Drug Targets. 2025;26(5):331–49. [DOI] [PubMed] [Google Scholar]
  • 3.He H, Yang J, Peng W, Li M, Shuai M, Tan F, et al. MT1JP: a pivotal tumor-suppressing LncRNA and its role in cancer progression and therapeutic potential. Curr Drug Targets. 2025;26(6):394–409. [DOI] [PubMed] [Google Scholar]
  • 4.Li T, Ren X, Luo X, Wang Z, Li Z, Luo X, et al. A foundation model identifies broad-spectrum antimicrobial peptides against drug-resistant bacterial infection. Nat Commun. 2024;15(1):7538. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Wang R, Jiang Y, Jin J, Yin C, Yu H, Wang F, et al. DeepBIO: an automated and interpretable deep-learning platform for high-throughput biological sequence prediction, functional annotation and visualization analysis. Nucleic Acids Res. 2023;51(7):3017–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Li P, Zhang K, Liu T, Lu R, Chen Y, Yao X, et al. A deep learning approach for rational ligand generation with toxicity control via reactive building blocks. Nat Comput Sci. 2024;4(11):851–64. [DOI] [PubMed] [Google Scholar]
  • 7.Porwal S, Malviya R, Sundram S, Sridhar SB, Shareef J. Diabetic wound healing: factors, mechanisms, and treatment strategies using herbal components. Curr Drug Targets. 2025;26(6):367–81. [DOI] [PubMed] [Google Scholar]
  • 8.Deb D, Dhanawat M, Bhushan B, Pachuau L, Das N. Targeting neurodegeneration: the emerging role of hybrid drugs. Curr Drug Targets. 2025;26(6):410–34. [DOI] [PubMed] [Google Scholar]
  • 9.Huang Z, Xiao Z, Ao C, Guan L, Yu L. Computational approaches for predicting drug-disease associations: a comprehensive review. Front Comput Sci. 2025;19(5):1–15. [Google Scholar]
  • 10.Zhang X, Zhou F, Zhang P, Zou Q, Zhang Y. TCM@MPXV: a resource for treating monkeypox patients in traditional Chinese medicine. Curr Bioinform. 2025;20(6):557–63. [Google Scholar]
  • 11.Mc Neil V, Lee SW. Advancing cancer treatment: a review of immune checkpoint inhibitors and combination strategies. Cancers (Basel). 2025;17(9):1408. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12.Tyagi A, Tuknait A, Anand P, Gupta S, Sharma M, Mathur D, et al. CancerPPD: a database of anticancer peptides and proteins. Nucleic Acids Res. 2015;43(D1):D837–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Ahmed S, Muhammod R, Khan ZH, Adilina S, Sharma A, Shatabda S, et al. ACP-MHCNN: an accurate multi-headed deep-convolutional neural network to predict anticancer peptides. Sci Rep. 2021;11(1):23676. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14.Balakrishnan A, Mishra SK, Sharma K, Gaglani C, Georrge JJ. Intersecting peptidomics and bioactive peptides in drug therapeutics. Curr Bioinform. 2025;20(2):103–19. [Google Scholar]
  • 15.Chen X, Qu Q, Zhang X, Nie H, Chao X, Ou W, et al. Prediction of miRNA–disease associations by deep matrix decomposition method based on fused similarity information. Curr Bioinform. 2025;20(6):545–56. [Google Scholar]
  • 16.Kaniwa F. GenRepAI: utilizing artificial intelligence to identify repeats in genomic suffix trees. Curr Bioinform. 2025;20(6):522–34. [Google Scholar]
  • 17.Zhang D, Qi R, Lan X, Liu B. A novel multi-slice framework for precision 3D spatial domain reconstruction and disease pathology analysis. Genome Res. 2025;35:1794–808. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Jiang Y, Wang R, Feng J, Jin J, Liang S, Li Z, et al. Explainable deep hypergraph learning modeling the peptide secondary structure prediction. Adv Sci. 2023;10(11):2206151. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19.Wang Z, Lei X, Zhang Y, Wu F-X, Pan Y. Recent progress of deep learning methods for RBP binding sites prediction on circRNA. Curr Bioinform. 2025;20(6):487–505. [Google Scholar]
  • 20.Yan K, Chen S, Liu B, Wu H. Accurate prediction of toxicity peptide and its function using multi-view tensor learning and latent semantic learning framework. Bioinformatics. 2025;41(9):btaf489. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21.Agrawal P, Bhagat D, Mahalwal M, Sharma N, Raghava GP. AntiCP 2.0: an updated model for predicting anticancer peptides. Brief Bioinform. 2021;22(3):bbaa153. [DOI] [PubMed] [Google Scholar]
  • 22.Chen W, Ding H, Feng P, Lin H, Chou K-C. iACP: a sequence-based tool for identifying anticancer peptides. Oncotarget. 2016;7(13):16895. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23.Schaduangrat N, Nantasenamat C, Prachayasittikul V, Shoombuatong W. ACPred: a computational tool for the prediction and analysis of anticancer peptides. Molecules. 2019;24(10):1973. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.He W, Wang Y, Cui L, Su R, Wei L. Learning embedding features based on multisense-scaled attention architecture to improve the predictive performance of anticancer peptides. Bioinformatics. 2021;37(24):4684–93. [DOI] [PubMed] [Google Scholar]
  • 25.Park HW, Pitti T, Madhavan T, Jeon Y-J, Manavalan B. MLACP 2.0: an updated machine learning tool for anticancer peptide prediction. Comput Struct Biotechnol J. 2022;20:4473–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Manavalan B, Basith S, Shin TH, Choi S, Kim MO, Lee G. MLACP: machine-learning-based prediction of anticancer peptides. Oncotarget. 2017;8(44):77121. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Abbas Z, Kim S, Lee N, Kazmi SAW, Lee SW. A robust ensemble framework for anticancer peptide classification using multi-model voting approach. Comput Biol Med. 2025;188:109750. [DOI] [PubMed] [Google Scholar]
  • 28.Kabir M, Arif M, Ahmad S, Ali Z, Swati ZNK, Yu D-J. Intelligent computational method for discrimination of anticancer peptides by incorporating sequential and evolutionary profiles information. Chemometr Intell Lab Syst. 2018;182:158–65. [Google Scholar]
  • 29.Arif M, Musleh S, Fida H, Alam T. PLMACPred prediction of anticancer peptides based on protein language model and wavelet denoising transformation. Sci Rep. 2024;14(1):16992. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 30.Arif M, Ahmed S, Ge F, Kabir M, Khan YD, Yu D-J, et al. StackACPred: prediction of anticancer peptides by integrating optimized multiple feature descriptors with stacked ensemble approach. Chemometr Intell Lab Syst. 2022;220:104458. [Google Scholar]
  • 31.Zhong G, Deng L. ACPScanner: prediction of anticancer peptides by integrated machine learning methodologies. J Chem Inf Model. 2024;64(3):1092–104. [DOI] [PubMed] [Google Scholar]
  • 32.Hoskin DW, Ramamoorthy A. Studies on anticancer activities of antimicrobial peptides. Biochimica et Biophysica Acta (BBA). 2008;1778(2):357–75. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Chen Z, Zhao P, Li C, Li F, Xiang D, Chen YZ, et al. iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. 2021;49(10):e60. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 34.McInnes L, Healy J, Melville J. UMAP: uniform manifold approximation and projection for dimension reduction. arXiv. 2018; arXiv:1802.03426.
  • 35.Kilimci ZH, Yalcin M. ACP-ESM: a novel framework for classification of anticancer peptides using protein-oriented transformer approach. Artif Intell Med. 2024;156:102951. [DOI] [PubMed] [Google Scholar]
  • 36.Abbas Q, Jeong W, Lee SW. Explainable AI in clinical decision support systems: a meta-analysis of methods, applications, and usability challenges. Healthcare. 2025;13(17):2154. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Abbas SR, Seol H, Abbas Z, Lee SW. Exploring the role of artificial intelligence in smart healthcare: a capability and function-oriented review. Healthcare (Basel). 2025;13(14):1642. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Pirtskhalava M, Amstrong AA, Grigolava M, Chubinidze M, Alimbarashvili E, Vishnepolsky B, et al. DBAASP v3: database of antimicrobial/cytotoxic activity and structure of peptides as a resource for development of new therapeutics. Nucleic Acids Res. 2021;49(D1):D288–97. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Fu L, Niu B, Zhu Z, Wu S, Li W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Zou X, Ren L, Cai P, Zhang Y, Ding H, Deng K, et al. Accurately identifying hemagglutinin using sequence information and machine learning methods. Front Med (Lausanne). 2023;10:1281880. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Zhu W, Yuan S-S, Li J, Huang C-B, Lin H, Liao B. A first computational frame for recognizing heparin-binding protein. Diagnostics. 2023;13(14):2465. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Dubchak I, Muchnik I, Mayor C, Dralyuk I, Kim SH. Recognition of a protein fold in the context of the SCOP classification. Proteins. 1999;35(4):401–7. [PubMed] [Google Scholar]
  • 43.Cai C, Han L, Ji Z, Chen Y. Enzyme family classification by support vector machines. Proteins. 2004;55(1):66–76. [DOI] [PubMed] [Google Scholar]
  • 44.Han LY, Cai CZ, Lo SL, Chung MC, Chen YZ. Prediction of RNA-binding proteins from primary sequence by a support vector machine approach. RNA. 2004;10(3):355–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Tomii K, Kanehisa M. Analysis of amino acid indices and mutation matrices for sequence comparison and structure prediction of proteins. Protein Eng Des Sel. 1996;9(1):27–36. [DOI] [PubMed] [Google Scholar]
  • 46.Wei L, Zhou C, Chen H, Song J, Su R. ACPred-FL: a sequence-based predictor using effective feature representation to improve the prediction of anti-cancer peptides. Bioinformatics. 2018;34(23):4007–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47.Dubchak I, Muchnik I, Holbrook SR, Kim S-H. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci U S A. 1995;92(19):8700–4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48.Chou KC. Prediction of protein cellular attributes using pseudo‐amino acid composition. Proteins. 2001;43(3):246–55. [DOI] [PubMed] [Google Scholar]
  • 49.Wang Y, Zhai Y, Ding Y, Zou Q. SBSM-Pro: support bio-sequence machine for proteins. Sci China Inf Sci. 2024;67(11):212106. [Google Scholar]
  • 50.Lin H, Deng E-Z, Ding H, Chen W, Chou K-C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42(21):12961–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51.Kira K, Rendell LA. A practical approach to feature selection. In: Derek Sleeman PE, editor. Machine learning proceedings 1992. Morgan Kaufmann: Elsevier; 1992. p. 249–56. [Google Scholar]
  • 52.Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Proceedings of the 31st International Conference on Neural Information Processing Systems; Long Beach, California, USA: Curran Associates Inc.; 2017. pp. 4768–77.
  • 53.Liu H, Setiono R. Incremental feature selection. Appl Intell. 1998;9:217–30. [Google Scholar]
  • 54.Xie H, Wang L, Qian Y, Ding Y, Guo F. Methyl-GP: accurate generic DNA methylation prediction based on a language model and representation learning. Nucleic Acids Res. 2025;53(6):gkaf223. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55.Ai C, Yang H, Liu X, Dong R, Ding Y, Guo F. MTMol-GPT: de novo multi-target molecular generation with transformer-based generative adversarial imitation learning. PLoS Comput Biol. 2024;20(6):e1012229. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56.Huang Z, Guo X, Qin J, Gao L, Ju F, Zhao C, et al. Accurate RNA velocity estimation based on multibatch network reveals complex lineage in batch scRNA-seq data. BMC Biol. 2024;22(1):290. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 57.Guo X, Huang Z, Ju F, Zhao C, Yu L. Highly accurate estimation of cell type abundance in bulk tissues based on single‐cell reference and domain adaptive matching. Adv Sci Weinh. 2024;11(7):2306329. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

All data generated or analysed during this study are included in this published article, its supplementary information files and publicly available repositories. The data and code supporting the findings of this study are publicly available in the Zenodo repository at https://doi.org/10.5281/zenodo.18207177. The underlying datasets were obtained from publicly available resources [31].


Articles from BMC Biology are provided here courtesy of BMC

RESOURCES