Abstract
Accurate prediction of protein and peptide functions from amino acid sequences is essential for understanding biological processes and advancing biomolecular engineering. Due to the limitations of experimental methods, computational approaches, particularly machine learning, have gained significant attention. However, many existing tools are task-specific and lack adaptability. Here, we propose a BERT–BiLSTM–Attention–TCN Protein Function Prediction Framework (BBATProt), a versatile framework for predicting protein and peptide functions. BBATProt leverages transfer learning with a pretrained bidirectional encoder representations from transformer model to capture high-dimensional features. The custom network integrates bidirectional long short-term memory and temporal convolutional network to align with proteins’ spatial characteristics, combining local and global feature extraction via attention mechanisms to achieve more precise predictions. Evaluations demonstrate that BBATProt consistently outperforms state-of-the-art models in tasks such as hydrolytic catalysis, peptide bioactivity, and post-translational modification (PTM) site prediction. Specifically, BBATProt improves accuracy by 2.96%–41.96% in antimicrobial peptide (AMP) prediction and by 0.64%–23.54% in PTM prediction tasks. In terms of area under the receiver operating characteristic curve, improvements range from 0.71% to 40.51% for AMP prediction and 0.62%–27.82% for PTM prediction. Visualizations of feature evolution and refinement via attention mechanisms validate the framework’s interpretability, providing transparency into the feature-extraction process and offering deeper insights into the basis of property prediction.
Keywords: functional prediction framework, attention mechanisms, BERT, interpretable deep learning
Introduction
Protein and peptide, the biomacromolecules composed of amino acid strings, execute a diverse array of essential biological properties in living organisms. Unraveling how these strings encode the structural and topological characteristics that determine their bioactivities is the core of biology. Previous studies, relying heavily on cumbersome wet experiments such as protein crystallography and biochemical assays, have advanced slowly [1]. For instance, of the >100 million sequences recorded in the UniProt database, only
0.5% have been manually annotated in the UniProtKB/Swiss-Prot section [2]. Despite AlphaFold’s demonstration that amino acid sequences contain all the information necessary for folding, accurately predicting protein functions purely from sequences remains difficult [3]. As the number of protein sequences continues to grow, computational methods must be leveraged to tackle large-scale functional prediction challenges [4].
The difficulty in directly mapping a sequence of amino acids to its potential property arises from the complexity and diversity of biological properties [5]. A particular biological function depends not only on the specific composition or configuration of key residues, but is also influenced to varying degrees by other residues in the vicinity or even far away in the molecule [6]. For instance, while the catalytic activity of an enzyme is largely determined by the residues in its catalytic pocket, it can also be modulated by residues distant from the active site. Similarly, the reactivity of a post-translational modification (PTM) site is governed by both the chemical environment of the target group and the local context. Since peptides are considerably smaller than proteins, factors such as composition, sequence patterns, and overall shapes contribute to biological functions to different extents. Therefore, accurately predicting biological functions based on sequences requires an adaptive framework capable of automatically balancing multiple-level features of chemical and structural characteristics [7].
Traditional shallow machine learning methods, such as K-nearest neighbors (KNN), random forests (RF), and support vector machines (SVM) [8–11], have been widely used for predicting peptide and protein functions. However, these methods have notable limitations in generalization, as they rely on manually high-quality manually crafted features. This dependency limits their ability to capture the high-dimensional information implicit in sequences, ultimately limiting their adaptability across diverse datasets and unseen environments. In contrast, deep neural network methods have emerged as powerful tools for mapping complex, high-dimensional and nonlinear relationships between sequences and their biological functions. Representative methods include convolutional neural networks (CNN), bidirectional long short-term memory (Bi-LSTM), temporal convolutional networks (TCN), and Transformer [12–15].
Advances in Transformer-based models have significantly enhanced deep learning’s capacity for feature extraction and predictive accuracy. However, existing approaches are often constrained by task-specific designs and a reliance on auxiliary data, which limits their adaptability across diverse functional prediction scenarios. For instance, MITNet integrates Transformer and CNN architectures to achieve precise epitope prediction via binary classification, yet its framework cannot be generalized to multitask settings [16]. TransFun, developed by Boadu et al. [17], distills information from both protein sequences and structures to predict protein function, restricting its use to scenarios where reliable structural prediction are available. Additionally, a three-stage network combining Transformer with intermolecular attention mechanisms has been proposed to predict drug–target interactions but fails to capture intra-protein functional determinants [18].
The bidirectional encoder representations from transformers (BERT) model has further advanced sequence-based feature extraction [19], and transfer learning with BERT has proven effective at leveraging contextual information with minimal task-specific data. For example, Rives et al. [20] pretrained a protein language model on 250 million sequences, treating amino acid sequences as sentences, and demonstrated BERT’s ability to capture high-dimensional structural and functional features. Building on these developments, models such as ProteinBERT, ProtTrans, and BERT-Protein have been introduced to extract high-dimensional features from protein sequences through large-scale unsupervised and transfer learning methods [21–23]. Several BERT-based models have been applied for further prediction of specific protein functions. For instance, Li et al. [24] introduced a self-supervised BERT model for molecular prediction, learning representations from unlabeled drug-like molecules. Wang et al. [25] used BERT to predict transcription factor binding sites, capturing long-range dependencies and local features. Lee et al. [26] fine-tuned BERT to extract structural and functional information from antimicrobial peptides (AMP). IDP-LM [27], a PLM-based method for predicting protein intrinsic disorder and related functions, was proposed by Pang and Liu. Although these existing models are effective in partial single-task prediction, they also exhibit certain limitations [28]. The extracted embedding codes suffer from rigidity in their representations, and the network model structure for decision-making lacks adaptability in transforming task goals across different protein prediction application scenarios.
Given the complexities inherent in protein and peptide functional prediction, it is imperative to develop a novel framework that extracts multilevel features capturing both local and global sequence information. In this study, we present BBATProt, a feature-enhanced network framework that leverages BERT for dynamic word embedding extraction from amino acid sequences. BBATProt is custom-designed with a temporal data-based architecture that aligns with the spatial structural characteristics of proteins, thereby significantly enhancing prediction precision. As illustrated in Fig. 1, BBATProt adeptly captures the intrinsic contextual information in protein sequences, enabling the model to dynamically learn from a wide array of sequences without requiring extensive prior knowledge of protein structures or functional domains.
Figure 1.
Overall framework of BBATProt.
BBATProt’s design emphasizes multilevel feature extraction, which is crucial for understanding the diverse functionalities of proteins and peptides. By integrating a targeted combination of CNN, Bi-LSTM, and TCN, BBATProt effectively encapsulates the complex information embedded in the encoded features, managing long-range dependencies and high-dimensional abstract features while minimizing computational overhead. Moreover, its self-attention mechanism optimally leverages interdependencies among features, highlighting the nonlinear significance of distinct sequence positions for functional realization. To assess the algorithm’s effectiveness, extensive evaluations across five independent datasets—including one carboxylesterase dataset, two peptide datasets, and two PTM site datasets—reveal that BBATProt excels in accuracy, robustness, and generalization compared to current state-of-the-art (SOTA) models. Additionally, to further validate the interpretability of the BBATProt architecture, t-distributed stochastic neighbor embedding (t-SNE) is employed to illustrate the layer-by-layer evolution of features, with the refinement achieved through the attention mechanism demonstrated separately [29]. This study, grounded in dynamic word embeddings and neural networks, not only effectively identifies key functional features but also showcases broad applicability and stability in practical scenarios, thereby laying a solid foundation for future research in protein and peptide function prediction.
Materials and Methods
Dataset construction
Due to the lack of a benchmark dataset for validating the prediction model’s general versatility, a comprehensive series of comparative experiments was carried out across multiple benchmark datasets to illustrate the effectiveness of BBATProt. By employing the same transfer learning pretrained model, this study aims to comprehensively assess the robustness and adaptability of the algorithm in a range of biological environments. The framework was strategically designed to accommodate a range of functional prediction needs, encompassing instances such as AMP, dipeptidyl peptidase-IV (DPP-IV) inhibitory peptides, carboxylesterase hydrolysis prediction, and PTM site prediction [30–33]. To minimize redundancy within these datasets, a unique two-step operation deploying Cluster Database at High Identity with Tolerance (CD-HIT) was executed at both protein and fragment levels [34]. Table 1 presents the specific sources and detailed information of the datasets. The ratio of positive to negative samples was almost always 1:1 for all datasets.
Table 1.
Information on five peptide and protein datasets
In the experiment, a 10-fold cross-validation approach was employed to evaluate the algorithm’s robustness and reliability. This methodology entails partitioning the dataset into 10 distinct, nonoverlapping subsets. During each iteration, training is conducted using nine subsets, while the remaining subset is reserved for independent testing. This iterative process facilitates a thorough assessment of the algorithm’s performance. The adoption of this cross-validation method helps minimize bias in the selection of training and test sets, thereby improving the scientific objectivity of the experimental results and ensuring replicability. In addition, Table S20 provides detailed data on the training process as well as the equipment, which further enhances the transparency and reproducibility of our method.
Bidirectional encoder representations from transformer feature extraction
Pretrained natural language processing models have found extensive application across various fields [40, 41]. Transfer learning techniques enable the execution of specific information processing tasks through a deep understanding of context and semantic relationships. In contrast to traditional recurrent neural networks and long short-term memory (LSTM) models, the architecture in this study mitigates performance degradation caused by long-term dependencies, enhancing both parallel computation and the capture of long-distance information. An innovation on this architecture, BERT, employs a bidirectional Transformer encoder that fully considers contextual information within the input.
In this study, BERT is leveraged to map each amino acid sequence into a feature vector, treating each molecular sequence in the reference dataset as a sentence and each amino acid as a word. The multihead self-attention mechanism employed by BERT captures intricate relationships between every possible pair of amino acids, thereby enhancing feature representation. By utilizing transfer learning, BERT effectively applies insights from natural language processing to protein sequences, significantly reducing the reliance on extensive task-specific data while facilitating efficient feature extraction even in data-scarce scenarios. The selected pretraining model for this experiment is BERT-Small, consisting of 4 encoder layers, each with 8 attention heads and 512 hidden units. This process encompasses two primary steps: sequence embedding and feature extraction through the encoder layers. During the sequence embedding phase, [CLS] and [SEP] tokens are incorporated to ensure proper embedding within the BERT framework. As BERT processes each amino acid, it generates token, segment, and position embeddings, enabling the model to comprehend[relationships among different amino acids in the sequence. Token embeddings represent words based on a specific tokenization methodology, while segment embeddings distinguish positions within the sequence. Position embeddings convey both relative and absolute positional information, as articulated in formulas (1) and (2), following the position vector framework proposed by Vaswani et al. [15].
![]() |
(1) |
![]() |
(2) |
At the encoder layer stage, multihead self-attention and feedforward neural networks (FFNN) are employed to process embedding vectors and transform them into more intricate feature representations. The multihead self-attention sublayer performs attention calculations at each position of every input sequence, capturing the relationships between the current position and all other positions.
![]() |
(3) |
![]() |
(4) |
![]() |
(5) |
In the above formula,
,
, and
, respectively, represent the matrices
(Query),
(Key), and
(Value) obtained through different linear transformations of the input sequence.
,
,
, and
represent weight matrices, where
denotes the dimensionality of the key vectors in the
matrix. The FFNN sublayer achieves information processing by applying nonlinear transformations to the contextual representations at each position. This sublayer consists of an activation function, like
, along with two linear transformations, facilitating the efficient extraction of features. The specific formula is as follows:
![]() |
(6) |
where
and
stand for weight parameters;
and
signify the bias terms. Additionally,
indicates the representation obtained after passing through the multihead self-attention sublayer.
Network architecture post feature extraction
In this study, we constructed a functional prediction network framework integrating CNN, Bi-LSTM, attention mechanism, and TCN. The protein sequence is first encoded using BERT, and the framework is used to further enhance feature extraction. The CNN, with a kernel size of 3, captures the local structural patterns of the protein through a 1D convolutional layer. This layer focuses on the local space of neighboring amino acids. The output from the CNN is then passed to the Bi-LSTM layer for further processing.
The Bi-LSTM layer employs a hidden state configuration of size 128. This design is intended to balance computational efficiency with the ability to capture long-range dependencies. Each Bi-LSTM cell simultaneously processes sequence features in both directions. This enables the extraction of information from both ends of the sequence, which aligns with the fact that protein functions are often driven by dependencies between the N- and C-termini. This configuration helps reveal the link between protein structure and function more comprehensively.
To further enhance the interpretability of the model, an attention mechanism is incorporated into the network framework. It enables the model to selectively focus on different segments of the input when analyzing protein sequences. The main components of the attention mechanism include query, key, value, attention score, softmax function, and final attention output. The comparison between the query and key is calculated, and the result is normalized using the softmax function. The normalized attention weights are then applied to the values, producing the final output. BBATProt uses self-attention to dynamically adjust the importance of each region in the sequence, thus improving model interpretability. This approach clarifies the contribution of each part of the protein sequence to the final prediction, making the decision-making process more transparent. The flattening layer then transforms the multidimensional input data generated by the attention mechanism into a 1D array, which is passed to the TCN layer for further refinement.
The TCN uses a 1D convolution with a kernel size of 2 and dilated convolutions with increasing dilation factors (1, 2, 4, etc.). This setup exponentially increases the model’s receptive field as the number of layers increases, facilitating the capture of global sequence dependencies. Additionally, causal convolutions ensure the model analyzes the temporal and positional relationships of sequences, which are crucial for protein folding and functional information. Residual connections in TCN help prevent the loss of cross-layer information and ensure the stability of capturing progressive dependencies.
The complementary strengths of Bi-LSTM and TCN in BBATProt allow the model to capture both local and global dependencies. While Bi-LSTM excels in sequential bidirectional learning, TCN’s dilation factor exponentially enhances its receptive field to capture extended sequence relationships essential for protein structure analysis. This design strikes a balance: Bi-LSTM handles intricate dependencies, while TCN efficiently captures long-range relationships, providing a deeper understanding of sequence contributions.
Finally, a dense layer with 64 units integrates the extracted features hierarchically, gradually reducing dimensionality. This design refines the structure–function relationships by distilling important sequence features, especially as dimensionality is reduced without losing critical details. Sigmoid activation functions ensure clear classification outputs, converting the learned features into interpretable binary results that indicate the presence or the absence of functional activity.
Model evaluation parameters
Seven metrics were taken into consideration in order to assess the performance of the classification models: accuracy (ACC), Matthews correlation coefficient (MCC), sensitivity (SEN), specificity (SPE), precision (PRE), F1 score (FSc), and area under the receiver operating characteristic curve (AUROC). These measures have the following definitions:
![]() |
(7) |
![]() |
(8) |
![]() |
(9) |
![]() |
(10) |
![]() |
(11) |
![]() |
(12) |
![]() |
(13) |
![]() |
(14) |
In the above formula (7)–(14),
and
designate the counts of correctly forecasted positive and negative instances, respectively, while
and
denote the quantities of erroneously predicted positive and negative instances, respectively.
Model evaluation parameters comparison of bidirectional encoder representations from transformer representational capability across different sizes
The selection of an appropriate BERT variant for protein sequence representation is a critical decision that requires a thorough evaluation of several factors, including model complexity, computational efficiency, and the ability to capture biologically relevant features. Given that various BERT architectures differ in depth, parameter count, and representational capacity, it is essential to consider how these variations impact the model’s capacity to extract meaningful biological information while ensuring computational feasibility. To address these considerations, we performed a comprehensive comparative analysis of six pretrained BERT models: BERT-Tiny, BERT-Mini, BERT-Small, BERT-Medium, BERT-Base, and BERT-Protein [42].
As outlined in Table S2, BERT-Small, with its architecture comprising four layers and 512 hidden units, strikes a well-balanced trade-off between model complexity and computational efficiency. This architecture is particularly well-suited for the natural length distribution of protein sequences (ranging from 50 to 500 amino acids), effectively capturing both local and global dependencies within the sequences. Furthermore, Table S3 demonstrates that BERT-Small achieves a robust balance between training efficiency and computational resource demands. In contrast, the simpler architectures of BERT-Tiny and BERT-Mini, with fewer layers and hidden units, fail to capture the intricate sequence interactions, limiting their representational capacity.
On the other hand, while BERT-Base and BERT-Protein feature larger parameter sets, they incur significantly higher computational costs and are more prone to overfitting due to excessive model complexity and information redundancy in the feature extraction process. As shown in Table S3, their training requirements exceed those of BERT-Small by >2.5-fold. Moreover, both larger models exhibit clear overfitting on the test set, with AUROC and related metrics declining noticeably. An overfitting analysis on the lysine glycation (Kgly) dataset is detailed in Table S4. Under the same conditions, the training metric Area under the Precision-Recall Curve (AUPRC) increases monotonically from 0.680 to 0.916 as the model size grows, while the validation AUPRC remains essentially unchanged. In addition, MCC and expected calibration error (ECE) analyses further verify overfitting, indicating amplified memorization, lower MCC and higher ECE, without corresponding generalization gains. Taken together, the markedly higher computational demand and reduced generalization argue for balancing representational depth with stability, thereby supporting the selection of BERT-Small as the default backbone for protein sequence representation.
To identify the optimal BERT variant, we conducted a quantitative performance evaluation across the various BERT models. As shown in Fig. 2, BERT-Small consistently outperforms the other models across all evaluation metrics, demonstrating its superior capacity to capture both local and global dependencies in protein sequences. These results highlight that, while larger models may provide enhanced feature representation, the efficiency and balance achieved by BERT-Small make it the most suitable choice for the task at hand.
Figure 2.
Comparative representational ability of BERT pretrained model with different sizes (%).
By optimizing model complexity, training efficiency, and biological relevance, BERT-Small was ultimately selected as the pretrained model in BBATProt. This choice ensures that biologically meaningful features are effectively extracted, while maintaining computational feasibility, making BERT-Small a highly efficient and biologically relevant model for protein sequence analysis.
Model interpretability analysis
Model interpretability refers to the ability to understand and explain the process by which a model achieves its function. Deep learning models, particularly those with complex multilayer architectures and numerous parameters, often exhibit high intrinsic complexity. This makes it challenging to intuitively understand their functions, potentially leading to skepticism about the reliability of their predictions. As a result, improving model interpretability and gaining a deeper understanding of prediction mechanisms are crucial research challenges. In this paper, we enhance the interpretability of the BBATProt model by designing a network architecture based on protein spatial structure and incorporating an attention mechanism. This design highlights key sequence regions, providing a transparent view of the feature extraction process within the model. In this section, we analyze the model’s interpretability by visualizing its feature extraction and training processes. Notably, these regions of high interest were able to coincide with regions of high frequency of occurrence in biological phenomena, suggesting that the model’s predictions were able to establish a link with biological phenomena.
Interpretability-driven module ablation analysis of BBATProt
The primary goal of deep learning in classification tasks is to iteratively extract distinctive features from input data. This principle forms the core of the BBATProt network design. To demonstrate the network’s capability in feature learning, we employed t-SNE to visualize the feature evolution across different layers. This visualization not only supports the rationality of the network design but also provides insights into how features are abstracted and learned at each layer, thus enhancing model interpretability.
To further investigate BBATProt’s feature extraction mechanism, we conducted ablation experiments by progressively removing the Dense layer, TCN layer, Attention layer, Bi-LSTM layer, and Convolutional layer. These experiments emphasize the critical role of each layer in shaping feature evolution within the network. As shown in Fig. 3, points representing the positive and negative samples are clearly separated with the full BBATProt architecture, and the separation decreases as the Dense, TCN, Attention, and Bi-LSTM layers are ablated. The experimental results indicate that the hierarchical improvement of the network architecture significantly enhances feature extraction. Each layer refines the input features, progressively distilling high-dimensional abstractions from protein sequences. These findings support the rationale behind the BBATProt design and underscore the importance of each layer in feature evolution and overall model performance.
Figure 3.
t-SNE plots show progressive loss of cluster separation from the full BBATProt architecture (a) to successive ablations removing Dense, TCN, Attention, and Bi-LSTM layers, leaving only BERT encoding (b–f).
To further assess the robustness and validity of BBATProt’s design, we performed ablation experiments on the PTM prediction task. As shown in Tables S21 and S22, we systematically removed individual components of the model, either sequentially or independently. Each variant was retrained and evaluated on the same test set. As components were removed, performance declined significantly relative to the intact BBATProt model. Furthermore, the AUROC curves in Fig. S21 provide a clear visual comparison of the ablated models, reaffirming that the full model outperforms all ablated versions and highlighting the necessity of each module for optimal predictive accuracy.
To quantify the effects of layer ablation on feature separability, we computed three cluster separation metrics: the Silhouette Score, Calinski-Harabasz Index (CHI), and Davies-Bouldin Index (DBI). These metrics were calculated for the high-dimensional feature representations at each stage of BBATProt’s progressive ablation, as shown in Table S5. The full architecture achieved a Silhouette Score of 0.522, a CHI of 92,461.164, and a DBI of 0.794, indicating well-separated and cohesive clusters. As the Dense layer, TCN module, attention mechanism, and Bi-LSTM layer were progressively removed, these metrics steadily declined, reflecting a decrease in cluster differentiation. This quantitative analysis aligns with the visual trends observed in Fig. 3 and further confirms the significant contribution of each layer to the model’s ability to extract discriminative features.
Visualization of attention weight allocation by BBATProt on peptide dataset
As shown in Figs 4 and S13, the visualization of attentional weight allocation reveals that BBATProt, leveraging the BERT pretrained model, assigns different weights to various segments of the randomly sampled sequence during computation. With eight independent attention heads, the model can selectively focus on distinct regions, prioritizing features most prominent to the function prediction task. This mechanism effectively filters out irrelevant information, directing the attention of the model to key regions and enhancing its capability to emphasize important sections of the protein sequence. The visual analysis of attention weight allocation further highlights the interpretability of the model in the feature extraction process. This finding clearly illustrates how the model allocates functional focus during prediction and highlights the role of the multihead attention mechanism in improving interpretability and prediction accuracy in BBATProt.
Figure 4.
Visualization of attentional weight allocation during BERT feature extraction in AMP dataset.
Visualization of feature vectors by BBATProt on the post-translational modification sites dataset
To assess interpretability, we visualized BBATProt’s attention weights over crotonylation sites and their flanking residues. Figures 5 and S14 show that attention peaks coincide with amino acids enriched at Kcr sites, as identified by Liu et al. [43]. This alignment indicates that BBATProt’s attention mechanism highlights biologically relevant sequence motifs, reinforcing the interpretability of its predictions.
Figure 5.
Characterization of BBATProt feature vectors for prediction of PTM sites.
BBATProt shows a pronounced attention peak at position +6, corresponding to the known lysine enrichment in rice crotonylome studies. In contrast, attention at the adjacent +7 position is suppressed, while a secondary focus emerges at position +13. These patterns underscore the model’s sensitivity to both local and distal sequence context. These observations demonstrate that BBATProt not only captures known crotonylation motifs but also learns the extended sequence dependencies governing site recognition.
Results and Discussion
To comprehensively validate the universal applicability of the BBATProt framework in peptide and protein function prediction tasks, this study focused on three key categories: enzyme function prediction, peptide function prediction, and PTM site prediction. A total of five diverse datasets were utilized to ensure a robust evaluation of the model’s versatility, covering the prediction of various biological functions, including carboxyl ester hydrolytic activity, antimicrobial activity, DPP-IV inhibition, and potential sites for post-translational modifications (PTMs) such as lysine crotonylation (Kcr) and Kgly.
Comparison of BBATProt with existing predictors on carboxylesterase datasets
This segment starts with the formulation of a benchmark dataset containing carboxylesterases identified by EC number 3.1.1.1. Known for their ability to act as catalysts, these enzymes facilitate the hydrolysis of esters. Beyond their functions in biological processes like drug metabolism, food digestion, and xenobiotic detoxification, these enzymes play a crucial role in various industries, including renewable energy, food processing, and bioremediation in environmental engineering. Given the extensive time and effort required by traditional experimental methods, computational approaches offer valuable and efficient alternatives.
The distribution of protein lengths in biological systems suggests that natural selection pressures generally favor coding sequences of 50–500 amino acids [44]. To ensure compatibility with the 512-dimensional input space of our network architecture, sequences exceeding 512 residues were intentionally excluded during preprocessing. As summarized in Table S1, this filtering criterion removed only 6.0% of sequences from the carboxylesterase dataset, resulting in a negligible impact on dataset representativeness. The retained sequences exhibit a length distribution, as shown in Table S1 and Fig. S15, that closely aligns with natural protein size patterns documented in [44], ensuring the preservation of biological relevance. Furthermore, sequence redundancy was rigorously controlled using CD-HIT with an 80% similarity threshold, ensuring diversity while eliminating evolutionary bias. The negative samples were randomly selected from Swiss-Prot in the same length range as the positive samples with a similarity threshold of 30%. Ultimately, a training dataset comprising 6960 carboxylesterase samples was obtained, with 3480 each for positive and negative samples.
To demonstrate the effectiveness of the proposed network model, a comparison was drawn against various conventional machine learning classification strategies, including KNN, RF, SVM, and XGBoost. Carboxylesterases were selected specifically for a 10-fold cross-validation, the comprehensive results of which are presented in Table 2. As these data reveal, BBATProt excels in the intricate domain of enzyme protein function prediction, exhibiting peak performance across a range of evaluation metrics.
Table 2.
Ten-fold cross-validation comparison between BBATProt and other machine learning methods on the hydrolysis enzyme dataset with the same BERT encoding features (%)
| Predictor | ACC | MCC | SEN | SPE | PRE | FSc | AUROC |
|---|---|---|---|---|---|---|---|
| BBATProt | 96.60 | 93.33 | 95.59 | 97.79 | 97.97 | 96.71 | 96.63 |
| BERT+RF | 75.57 | 51.30 | 71.50 | 79.65 | 77.87 | 74.53 | 97.97 |
| BERT+XGBOOST | 80.63 | 61.28 | 79.25 | 82.02 | 81.51 | 80.34 | 99.58 |
| BERT+KNN | 79.35 | 59.87 | 88.92 | 69.88 | 74.66 | 81.13 | 86.35 |
| BERT+SVM | 83.79 | 67.61 | 82.38 | 85.23 | 84.78 | 83.54 | 84.60 |
Note: The optimal performance for each metric is indicated in bold.
Additionally, a comparative analysis was conducted to evaluate the encoding effectiveness of two NLP models, BERT and Word2Vec [45], on the hydrolase dataset. Both models were applied to encode the data, and their performance was assessed using an identical neural network architecture for classification. Word2Vec uses a dual-modality that encompasses both skip-gram and continuous bag of words (CBOW) strategies. CBOW mainly concentrates on forecasting the desired term when provided with contextual clues, whereas the skip-gram tactic emphasizes the estimation of the surrounding wording based on the specific target term. Table 3 shows the efficacy measures of each system via a 10-fold cross-validation. The outcome decisively attests that the pretrained BERT model is superior to the other two NLP models in every performance measurement, strongly underscoring the notable effectiveness of BERT encoding compared with Word2Vec.
Table 3.
Comparative results of 10-fold cross-validation on the hydrolysis enzyme dataset using different NLP-based feature representation models (%)
| Predictor | ACC | MCC | SEN | SPE | PRE | FSc | AUROC |
|---|---|---|---|---|---|---|---|
| BBATProt | 96.60 | 93.33 | 95.59 | 97.79 | 97.97 | 96.71 | 96.63 |
| Word2Vec-CBOW+Conv+BiLSTM+Att+TCN+Dense | 48.36 | NA | 50.00 | 50.00 | 24.18 | 32.59 | 48.36 |
| Word2Vec-Skip-gram+Conv+BiLSTM+Att+TCN+Dense | 48.22 | NA | 60.00 | 40.00 | 29.11 | 39.19 | 48.22 |
Note: The optimal performance for each metric is indicated in bold. NA is used as a representation when the result is not available.
Comparison of BBATProt with existing predictors on peptide datasets
The assessment of BBATProt’s predictive capability on peptide datasets necessitates a comprehensive examination of both the AMP and DPP-IV inhibitory peptide datasets.
Comparison of BBATProt with existing predictors on antimicrobial peptide datasets
In this section, we provide a detailed comparison of BBATProt with 13 SOTA methods [36, 46–55] using the XUAMP dataset. This comparison was conducted through a comprehensive evaluation in an independent testing environment. The results demonstrate that BBATProt significantly outperforms the other methods across key metrics, including ACC, MCC, SEN, Fsc, and the AUROC. As shown in Table 4, BBATProt shows an improvement ranging from 2.81% to 31.96% in ACC, 5.42% to 63.68% in MCC, 0.56% to 64.4% in SEN, 2.39% to 49.46% in FSc, and 0.71% to 40.51% in AUROC when compared with its counterparts.
Table 4.
Performance comparison results of BBATProt and existing methods on independent AMP dataset (%)
| Predictor | ACC | MCC | SEN | SPE | PRE | FSc | AUROC |
|---|---|---|---|---|---|---|---|
| BBATProt | 85.96 | 71.98 | 88.20 | 83.71 | 84.41 | 86.26 | 93.81 |
| ADAM-HMM | 68.40 | 39.00 | 52.10 | 84.70 | 77.30 | 62.30 | 68.40 |
| ADAM-SVM | 61.20 | 26.40 | 34.60 | 87.80 | 73.90 | 47.10 | 61.20 |
| AmpGram | 56.40 | 13.10 | 44.50 | 68.20 | 58.40 | 50.50 | 54.70 |
| AMAP | 60.20 | 25.00 | 31.40 | 89.10 | 74.20 | 44.10 | 60.20 |
| AMPEP | 65.80 | 42.50 | 32.50 | 99.20 | 97.50 | 48.70 | 72.70 |
| AMPfun | 67.40 | 41.40 | 40.60 | 94.30 | 87.70 | 55.50 | 73.50 |
| AMPScannerV2 | 56.80 | 13.70 | 52.30 | 61.30 | 57.50 | 54.80 | 58.50 |
| APIN | 57.90 | 16.30 | 44.60 | 71.20 | 60.70 | 51.40 | 57.50 |
| APIN-fusion | 56.00 | 12.30 | 45.70 | 66.30 | 57.60 | 51.00 | 55.40 |
| iAMP-Attenpred | 83.15 | 66.56 | 87.64 | 78.65 | 80.41 | 83.87 | 93.10 |
| iAMPpred | 54.00 | 8.30 | 65.20 | 42.80 | 53.30 | 58.70 | 57.50 |
| Deep-AMPEP30 | 55.80 | 11.80 | 45.70 | 65.80 | 57.20 | 50.80 | 53.30 |
| sAMPpred-GAT | 71.50 | 46.40 | 53.00 | 90.00 | NA | NA | 77.70 |
Note: The optimal performance for each metric is indicated in bold. NA is used as a representation when the result is not available.
In order to verify the validity of BBATProt, six additional independent datasets were selected for evaluation. As shown in Fig. S16, the Upset plot was used to perform a correlation analysis across all seven datasets. The detailed results are presented in Tables S8–S19, and the statistical metrics, histograms, and standard deviation fluctuation plots are presented in Figs S1–S12. It is noteworthy that the model achieves an excellent balance between SEN and SPE while ensuring the stability of ACC. In addition, the mean AUROC value under the five-fold cross-validation for all datasets reaches 0.954. The low coefficient of variation (CV) values and narrow confidence intervals demonstrate the stability and repeatability of our results, thereby underscoring the reliability of our model’s performance across diverse dataset subsets.
Additionally, BBATProt ranked sixth and second in the SPE and PRE evaluations, respectively. While AMPEP achieved the highest scores in these two metrics, its overall average performance across the other five evaluation metrics was 32.80% lower than that of BBATProt. This indicates that BBATProt provides a more comprehensive extraction of feature information, emphasizing its strengths across various predictive tasks and suggesting potential avenues for further improvement in protein analysis. By examining the model characteristics of AMPEP in more detail, valuable insights can be gained for optimizing BBATProt in future studies.
To comprehensively assess the effectiveness of BBATProt in AMP prediction, a benchmark dataset was constructed. During dataset construction, sequences with >20 natural standard amino acids were excluded to ensure data consistency. The AMP dataset was compiled from various sources, including the AMPer, APD3, and ADAM databases [45, 56, 57]. Data cleaning involved setting a CD-Hit threshold of 90% to remove redundant sequence information. Correspondingly, non-AMP data were sourced from UniProt, with a constraint on the residue length of protein fragments between 5 and 100 to maintain sequence lengths similar to those of the AMP dataset for experimental validity. The CD-Hit redundancy removal threshold for the non-AMP dataset was set to 40%. In addition, sequences with annotations containing terms such as “Defensin,” “Antimicrobial,” “Antibiotic,” or “Antifungal” were excluded to ensure data purity.
Using the benchmark dataset, multiple repetitions of 10-fold cross-validation were performed to accurately evaluate the performance and generalization capability of the predictor. The average performance metrics across five repetitions are presented in Fig. 6 and detailed in Table S6. In addition, we computed several consistency metrics, including the mean, standard deviation, CV, and 95% confidence intervals, for each evaluation metric, as summarized in Table S7. The low CV values and narrow confidence intervals demonstrate the stability and repeatability of our results, thereby underscoring the reliability of our model’s performance across diverse dataset subsets.
Figure 6.
Five times average performance of 10-fold cross-validation method on the AMP benchmark dataset (%).
Comparison of BBATProt with existing predictors on inhibitory peptide datasets
Due to their crucial role in pharmaceutical development and diabetes treatment, distinguishing between inhibitory and noninhibitory peptides of DPP-IV has become a key focus. The BBATProt model was trained and evaluated on a dataset designed to differentiate these two classes, including both inhibitory DPP-IV peptides and noninhibitory counterparts [37]. The dataset was split into a benchmark set and an independent test set, each containing an equal distribution of both peptide types. Specifically, the benchmark set comprised 532 samples, while the independent set contained 133 samples. As presented in Table 5, BBATProt outperforms other models in terms of ACC, MCC, SEN, and SPE and shows comparable AUROC performance to the current SOTA models. This result demonstrates the rationality of BBATProt’s design based on interpretability architecture, which can ensure prediction accuracy while maintaining robustness.
Table 5.
Performance comparison results of BBATProt and existing methods on the DPP-IV inhibitory peptide dataset (%)
| Dataset | Predictor | ACC | MCC | SEN | SPE | PRE | FSc | AUROC |
|---|---|---|---|---|---|---|---|---|
| Benchmark | BBATProt | 95.6 | 91.64 | 94.23 | 97.69 | 93.68 | 95.75 | 95.58 |
| iDPPIV | 81.9 | 64.3 | NA | 81.9 | 81.8 | NA | 87 | |
| iDPPIV-SCM | 85.8 | 71.7 | NA | 87.7 | 83.9 | NA | 94 | |
| IPPF-FE | NA | NA | NA | NA | NA | NA | NA | |
| independent-data | BBATProt | 89.1 | 78.25 | 90.62 | 87.22 | 90.98 | 88.89 | 93.04 |
| iDPPIV | 79.7 | 59.4 | NA | 78.9 | 80.5 | NA | 84.7 | |
| iDPPIV-SCM | 86.5 | 73.1 | NA | 87.4 | 85.60 | NA | 93.90 | |
| IPPF-FE | 86.64 | 73.7 | NA | 85.71 | 87.97 | NA | 94.25 |
Note: The optimal performance for each metric is indicated in bold. NA is used as a representation when the result is not available.
Comparison of BBATProt with existing predictors on post-translational modification site prediction datasets
Concerning PTM prediction challenges such as protein lysine crotonylation and protein lysine glycation, these alterations can affect the functionality, stability, and affinity of proteins, subsequently steering their biological operations. However, experimental methods for site prediction are both expensive and time-consuming. In contrast, computational methods can provide reasonable prediction in a highly efficient and cost-effective manner. Using bioinformatics tools and algorithms, protein sequences and structures can be analyzed to predict potential modification sites.
To validate the advanced nature of the designed network structure, BBATProt was compared with some classical machine learning classification methods, including KNN, RF, SVM, and XGBoost. A 10-fold cross-validation was implemented for Kcr site prediction, with the outcomes detailed in Table 6. The evidence indicates the superior effectiveness of BBATProt, most notably in brief site prediction duties. This can be attributed to the fact that BBATProt is based on hierarchical learning, which allows it to adapt to protein data of different lengths and extract high-level features, making the model more effective in handling complex classification tasks.
Table 6.
Ten-fold cross-validation comparison between BBATProt and other machine learning methods on the Kcr site prediction dataset with consistent BERT encoding features (%)
| Predictor | ACC | MCC | SEN | SPE | PRE | FSc | AUROC |
|---|---|---|---|---|---|---|---|
| BBATProt | 94.95 | 89.93 | 94.05 | 95.86 | 95.80 | 94.91 | 94.95 |
| BERT+RF | 74.34 | 49.49 | 83.36 | 65.34 | 70.63 | 76.46 | 96 01 |
| BERT+XGBOOST | 76.86 | 54.10 | 82.80 | 70.91 | 74.00 | 78.15 | 97.38 |
| BERT+KNN | 70.12 | 40.23 | 69.48 | 70.74 | 70.36 | 69.91 | 78.70 |
| BERT+SVM | 79.47 | 59.30 | 84.97 | 73.97 | 76.56 | 80.53 | 92.40 |
Note: The optimal performance for each metric is indicated in bold.
To avoid overestimating the performance of BBATProt, a comparison was made between its training results and those of other existing SOTA models on independent test sets for both Kcr and Kgly site prediction problems. In the Kgly site prediction task presented in Fig. 7, BBATProt demonstrates a clear superiority over other advanced SOTA predictors across all evaluated metrics [39, 58–60]. Notably, for the Sensitivity metric, BBATProt shows an improvement range of 18.9%–56.9% compared with other predictors, underscoring its robust capability to accurately capture positive samples. This performance highlights the significant feasibility of BBATProt in PTM site prediction, emphasizing its effectiveness in identifying relevant sites with high accuracy.
Figure 7.
Performance comparison results of BBATProt and existing methods on the Kgly site prediction dataset (%).
As shown in Fig. 8 and Table S23, BBATProt also outperforms other Kcr-specific predictors across nearly all metrics [33, 38, 61, 62]. Nevertheless, its SEN metric remains marginally lower than that of CKSAPP_CrotSite, reflecting the intrinsic challenge of modeling highly localized residue contexts. This limitation defines a clear avenue for future enhancements, particularly toward more effective capture of fine-grained sequence dependencies.
Figure 8.
Performance comparison results of BBATProt and existing methods on independent Kcr site prediction dataset (%).
Across enzyme and peptide datasets featuring relatively long sequences, BBATProt delivers consistent improvements in ACC, MCC, AUROC, and AUPRC, reflecting an enhanced capacity to retain long-range dependencies. On short-sequence PTM benchmarks, where precise localization of the residue-level context is paramount, the attention module adaptively concentrates weights on local windows, balancing sensitivity and specificity. Taken together, these dataset-specific trends indicate that BBATProt’s cross-dataset generalization primarily stems from steering the pretrained backbone toward task-relevant semantic patterns rather than model scale.
Conclusion
The function prediction of protein and peptide based on amino acid sequences is crucial in both academic research and industrial applications. This study introduces BBATProt, an innovative framework designed to tackle this challenge. Diverse biological datasets were transformed into peptide and protein datasets with varying amino acid sequence lengths, targeting prediction tasks such as hydrolases, AMP, DPP-IV inhibitory peptides, and PTM at Kcr and Kgly sites. Comprehensive evaluations across multiple datasets confirm its superior accuracy, robustness, and generalization capabilities compared to SOTA models. Furthermore, this study enhances the interpretability of what is typically a black-box model by leveraging visualization techniques. The sequential network architecture, built around the spatial conformations of proteins, further validates the model’s effectiveness. Additionally, transfer learning strengthens feature extraction in our framework and yields measurable gains in predictive performance. BBATProt combines high accuracy with interpretability. It provides transparent attributions to sequence features and clarifies how specific amino-acid patterns drive functional outcomes. These attributions help identify candidate sites for site-directed mutagenesis to improve antimicrobial activity, catalytic efficiency, or regulation of PTMs. The predictions are readily verifiable and suitable for deployment in protein and peptide engineering. For example, BBATProt could try to mine large marine metagenomic datasets to prioritize candidates with targeted functions, such as PET-hydrolyzing enzymes or AMPs, and output score-ranked lists for experimental testing [63]. After activity confirmation, residue-level interpretability can guide rational mutagenesis to enhance function, and the validated sequences can serve as a feedback set for subsequent model adaptation. The model can also be adapted to specific applications through continued training on high-quality experimental datasets, which further improves task-specific performance and practical utility.
Although BBATProt performs well in protein and peptide function prediction, improvements are still necessary. The current sequence length limit of 512 amino acids of BBATProt aligns with natural protein distributions and the fixed input dimensions of BERT, ensuring computational efficiency. However, for long sequences, the sparsity introduced by zero-padding may hinder the capability of the model to capture long-range dependencies. Similarly, while BBATProt excels in functional prediction tasks with large-scale datasets, its capability to discern intrinsic sequence patterns in small-sample scenarios remains a challenge. To address these limitations, we are integrating sparse attention into long sequence processing and combining meta-learning with expert prior to improve model prediction performance under small sample problems. Furthermore, plans include the development of an intuitive and user-friendly web server to offer public prediction services, enhancing accessibility to a wider researcher base and expanding its practical applications.
Key Points
BBATProt proposes an interpretable neural network framework for protein function prediction.
The model outperforms state-of-the- art methods in various protein and peptide prediction tasks.
Visualization confirms model interpretability, highlighting attention focus on relevant data.
Supplementary Material
Contributor Information
Youqing Wang, State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China; College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.
Xukai Ye, College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.
Yue Feng, College of Life Science and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.
Haoqian Wang, College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.
Xiaofan Lin, State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China; Beijing Advanced Innovation Center for Soft Matter Science and Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.
Xin Ma, College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.
Yifei Zhang, State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China; Beijing Advanced Innovation Center for Soft Matter Science and Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.
Conflict of interest: None declared.
Funding
This work was supported by the National Natural Science Funds of China under Grants 32371325, 62303039, and 62433004; in part by the National Science Fund for Distinguished Young Scholars of China under Grant 62225303; in part by the China Postdoctoral Science Foundation BX20230034 and 2023M730190; in part by the Fundamental Research Funds for the Central Universities buctr20120201, QNTD2023-01; in part by Beijing Natural Science Foundation (L241014).
Data availability
Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/Xukai-YE/BBATProt.
References
- 1. Attique M, Farooq MS, Khelifi A. et al. Prediction of therapeutic peptides using machine learning: computational models, datasets, and feature encodings. IEEE Access 2020;11:148570–94. [Google Scholar]
- 2. Gligorijević V, Renfrew PD, Kosciolek T. et al. Structure-based protein function prediction using graph convolutional networks. Nat Commun 2021;12:3168. 10.1038/s41467-021-23303-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3. Jumper J, Evans R, Pritzel A. et al. Highly accurate protein structure prediction with AlphaFold. Nature 2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4. Samet Özdilek A, Atakan A, Özsari G. et al. ProFAB—Open protein functional annotation benchmark. Brief Bioinform 2023;24:bbac627. 10.1093/bib/bbac627 [DOI] [PubMed] [Google Scholar]
- 5. Mitchell AL, Attwood TK, Babbitt PC. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res 2019;47:D351–60. 10.1093/nar/gky1100 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6. Lai B, Jinbo X. Accurate protein function prediction via graph attention networks with predicted structure information. Brief Bioinform 2022;23:bbab502. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 7. Han Y, Luo X. IPPF-FE: an integrated peptide and protein function prediction framework based on fused features and ensemble models. Brief Bioinform 2023;24:bbac476. [DOI] [PubMed] [Google Scholar]
- 8. Cover TM, Hart PE. Nearest neighbor pattern classification. IEEE Trans Inform Theory 1967;13:21–7. 10.1109/TIT.1967.1053964 [DOI] [Google Scholar]
- 9. Breiman L. Random forests. Mach Learn 2001;36:105–39. [Google Scholar]
- 10. Vapnik VN. The nature of statistical learning theory. IEEE Trans Neural Netw 1997;8:1564. 10.1109/TNN.1997.641482 [DOI] [PubMed] [Google Scholar]
- 11. Chen T, Guestrin C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). San Francisco, California, USA: Association for Computing Machinery (ACM), 2016; pp. 785–794.
- 12. Wang S, Li G, Liao Z. et al. CnnPOGTP: a novel CNN-based predictor for identifying the optimal growth temperatures of prokaryotes using only genomic k-mers distribution. Bioinformatics 2022;38:3106–8. 10.1093/bioinformatics/btac289 [DOI] [PubMed] [Google Scholar]
- 13. Zhang S, Zheng D, Hu X. et al. Bidirectional long short-term memory networks for relation classification. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (PACLIC 2015). Shanghai, China: Association for Computational Linguistics, 2015; pp. 73–78.
- 14. Shaojie, Bai J, Kolter Z, Koltun V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint 2018;arXiv:1803.01271 [Google Scholar]
- 15. Vaswani A, Shazeer N, Parmar N. Attention is all you need. In: Proceedings of 31st International Conference on Neural Information Processing Systems (NeurIPS 2017). Long Beach, California, USA: Curran Associates, Inc., 2017; pp. 5998–6008.
- 16. Darmawan JT, Leu J-S, Avian C. et al. MITNet: a fusion transformer and convolutional neural network architecture approach for T-cell epitope prediction. Brief Bioinform 2023;24:bbad202. [DOI] [PubMed] [Google Scholar]
- 17. Boadu F, Cao H, Cheng J. Jianlin Cheng combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics 2023;39:i318–25. 10.1093/bioinformatics/btad208 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18. Liu S, Wang Y, Deng Y. et al. Improved drug–target interaction prediction with intermolecular graph transformer. Brief Bioinform 2022;23:bba162. 10.1093/bib/bbac162 [DOI] [PubMed] [Google Scholar]
- 19. Devlin J, Chang M-W, Lee K. et al. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019). Minneapolis, Minnesota, USA: Association for Computational Linguistics, 2019; pp. 4171–4186.
- 20. Rives A, Meier J, Sercu T. et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA 2021;118:1–12. 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21. Brandes N, Ofer D, Peleg Y. et al. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics 2022;38:2102–10. 10.1093/bioinformatics/btac020 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22. Elnaggar A, Heinzinger M, Dallago C. et al. ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 2022;44:7112–27. 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]
- 23. Yue Z, Jianyuan L, Lianmin Z. et al. A novel antibacterial peptide recognition algorithm based on BERT. Brief Bioinform 2021;22:bbab200. 10.1093/bib/bbab200 [DOI] [PubMed] [Google Scholar]
- 24. Li B, Lin M, Chen T. et al. FG-BERT: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Brief Bioinform 2023;24:bbad398. [DOI] [PubMed] [Google Scholar]
- 25. Wang K, Zeng X, Zhou J. et al. BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform 2024;25:bbae195. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26. Lee H, Lee S, Lee I. et al. AMP-BERT: prediction of antimicrobial peptide function based on a BERT model. Protein Sci 2023;32:4529. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27. Pang Y, Liu B. IDP-LM: prediction of protein intrinsic disorder and disorder functions based on language models. PLoS Comput Biol 2023;19:e1011657. 10.1371/journal.pcbi.1011657 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28. Egbert C, Abhinav G, Julian R. et al. Transformer-based protein generation with regularized latent space optimization. Nat Mach Intell 2022;4:840–51. 10.1038/s42256-022-00532-1 [DOI] [Google Scholar]
- 29. van der Maaten L, Hinton G. Visualizing data using t-SNE. J Mach Learn Res 2008;9:2579–605. [Google Scholar]
- 30. Jing X, Li F, Leier A. et al. Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides. Brief Bioinform 2021;22:bbab083. 10.1093/bib/bbab083 [DOI] [PubMed] [Google Scholar]
- 31. Zou H, Yin Z. Identifying dipeptidyl peptidase-IV inhibitory peptides based on correlation information of physicochemical properties. Int J Pept Res Therapeut 2021;27:2651–9. 10.1007/s10989-021-10280-2 [DOI] [Google Scholar]
- 32. Wang D, Zou L, Jin Q. et al. Human carboxylesterases: a comprehensive review. Acta Pharm Sin B 2018;8:699–712. 10.1016/j.apsb.2018.05.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33. Lv H, Dao F-Y, Guan Z-X. et al. Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Brief Bioinform 2021;22:1–10. [DOI] [PubMed] [Google Scholar]
- 34. Huang Y, Niu B, Gao Y. et al. CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics 2010;26:680–2. 10.1093/bioinformatics/btq003 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35. UniProt Consortium . UniProt: a hub for protein information. Nucleic Acids Res 2015;43:D204–12. 10.1093/nar/gku989 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36. Yan K, Lv H, Guo Y. et al. sAMPpred-GAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure. Bioinformatics 2023;39:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 37. Charoenkwan P, Nantasenamat C, Hasan MM. et al. iBitter-fuse: a novel sequence-based bitter peptide predictor by fusing multi-view features. Int J Mol Sci 2021;22:8958. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38. Qiao Y, Zhu X, Gong H. BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics 2022;38:648–54. 10.1093/bioinformatics/btab712 [DOI] [PubMed] [Google Scholar]
- 39. Liu Y, Liu Y, Wang G-A. et al. BERT-Kgly: a bidirectional encoder representations from transformers (BERT)-based model for predicting lysine glycation site for homo sapiens. Front Bioinform 2022;2:834153. 10.3389/fbinf.2022.834153 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40. Lee J, Yoon W, Kim S. et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020;36:1234–40. 10.1093/bioinformatics/btz682 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41. Jialin Y, Shi S, Zhang F. et al. PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics 2019;35:2749–56. [DOI] [PubMed] [Google Scholar]
- 42. Zhang Y, Lin J, Zhao L. et al. A novel antibacterial peptide recognition algorithm based on BERT. Brief Bioinform 2021;22:bbab200. 10.1093/bib/bbab200 [DOI] [PubMed] [Google Scholar]
- 43. Liu S, Xue C, Fang Y. et al. Global involvement of lysine crotonylation in protein modification and transcription regulation in Rice. Mol Cell Proteomics 2018;17:1922–36. 10.1074/mcp.RA118.000640 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44. Nevers Y, Glover NM, Dessimoz C. et al. Protein length distribution is remarkably uniform across the tree of life. Genome Biol 2023;24:135. 10.1186/s13059-023-02973-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45. Mikolov T, Chen K, Corrado G. et al. Efficient estimation of word representations in vector space. arXiv preprint 2013; arXiv:1301.3781 [Google Scholar]
- 46. Lee H-T, Lee C-C, Yang J-R. et al. A large-scale structural classification of antimicrobial peptides. Biomed Res Int 2015;2015:475062. 10.1155/2015/475062 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 47. Burdukiewicz M, Sidorczuk K, Rafacz D. et al. Proteomic screening for prediction and design of antimicrobial peptides with AmpGram. Int J Mol Sci 2020;21:E4310. 10.3390/ijms21124310 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 48. Gull S, Shamim N, Minhas F. AMAP: hierarchical multi-label prediction of biologically active and antimicrobial peptides. Comput Biol Med 2019;107:172–81. 10.1016/j.compbiomed.2019.02.018 [DOI] [PubMed] [Google Scholar]
- 49. Bhadra P, Yan J, Li J. et al. AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep 2018;8:1697. 10.1038/s41598-018-19752-w [DOI] [PMC free article] [PubMed] [Google Scholar]
- 50. Chung C-R, Kuo T-R, Li-Ching W. et al. Characterization and identification of antimicrobial peptides with different functional activities. Brief Bioinform 2020;21:1098–114. 10.1093/bib/bbz043 [DOI] [PubMed] [Google Scholar]
- 51. Veltri D, Kamath U, Shehu A. Deep learning improves antimicrobial peptide recognition. Bioinformatics 2018;34:2740–7. 10.1093/bioinformatics/bty179 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52. Xin S, Jing X, Yin Y. et al. Antimicrobial peptide identification using multi-scale convolutional network. BMC Bioinformatics 2019;20:730. 10.1186/s12859-019-3327-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53. Xing W, Zhang J, Li C. et al. iAMP-Attenpred: a novel antimicrobial peptide predictor based on BERT feature extraction method and CNN-BiLSTM-attention combination model. Brief Bioinform 2023;25:bbad443. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 54. Meher PK, Sahu TK, Saini V. et al. Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep 2017;7:42362. 10.1038/srep42362 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 55. Yan J, Bhadra P, Li A. et al. Deep-AmPEP30: improve short antimicrobial peptides prediction with deep learning. Mol Ther Nucleic Acids 2020;20:882–94. 10.1016/j.omtn.2020.05.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56. Fjell CD, Hancock REW, Cherkasov A. AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics 2007;23:1148–55. 10.1093/bioinformatics/btm068 [DOI] [PubMed] [Google Scholar]
- 57. Wang G, Li X, Wang Z. APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res 2016;44:D1087–93. 10.1093/nar/gkv1278 [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58. Johansen MB, Kiemer L, Brunak S. Analysis and prediction of mammalian protein glycation. Glycobiology 2006;16:844–53. 10.1093/glycob/cwl009 [DOI] [PubMed] [Google Scholar]
- 59. Zhe J, Sun J, Li Y. et al. Predicting lysine glycation sites using bi-profile bayes feature extraction. Comput Biol Chem 2017;71:98–103. 10.1016/j.compbiolchem.2017.10.004 [DOI] [PubMed] [Google Scholar]
- 60. Yan X, Li L, Ding J. et al. Gly-PseAAC: identifying protein lysine glycation through sequences. Gene 2017;602:1–7. [DOI] [PubMed] [Google Scholar]
- 61. Zhe J, He J-J. Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou’s general PseAAC. J Mol Graph Model 2017;77:200–4. 10.1016/j.jmgm.2017.08.020 [DOI] [PubMed] [Google Scholar]
- 62. Liu Y, Zhaomin Y, Chen C. et al. Prediction of protein crotonylation sites through LightGBM classifier based on SMOTE and elastic net. Anal Biochem 2020;609:113903. 10.1016/j.ab.2020.113903 [DOI] [PubMed] [Google Scholar]
- 63. Chen J, Jia Y, Sun Y. et al. Global marine microbial diversity and its potential in bioprospecting. Nature 2024;633:371–9. 10.1038/s41586-024-07891-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/Xukai-YE/BBATProt.






















