Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2025 Nov 10;26(6):bbaf593. doi: 10.1093/bib/bbaf593

BBATProt: a framework predicting biological function with enhanced feature extraction via interpretable deep learning

Youqing Wang 1,2, Xukai Ye 3, Yue Feng 4, Haoqian Wang 5, Xiaofan Lin 6,7, Xin Ma 8,, Yifei Zhang 9,10,
PMCID: PMC12599320  PMID: 41212592

Abstract

Accurate prediction of protein and peptide functions from amino acid sequences is essential for understanding biological processes and advancing biomolecular engineering. Due to the limitations of experimental methods, computational approaches, particularly machine learning, have gained significant attention. However, many existing tools are task-specific and lack adaptability. Here, we propose a BERT–BiLSTM–Attention–TCN Protein Function Prediction Framework (BBATProt), a versatile framework for predicting protein and peptide functions. BBATProt leverages transfer learning with a pretrained bidirectional encoder representations from transformer model to capture high-dimensional features. The custom network integrates bidirectional long short-term memory and temporal convolutional network to align with proteins’ spatial characteristics, combining local and global feature extraction via attention mechanisms to achieve more precise predictions. Evaluations demonstrate that BBATProt consistently outperforms state-of-the-art models in tasks such as hydrolytic catalysis, peptide bioactivity, and post-translational modification (PTM) site prediction. Specifically, BBATProt improves accuracy by 2.96%–41.96% in antimicrobial peptide (AMP) prediction and by 0.64%–23.54% in PTM prediction tasks. In terms of area under the receiver operating characteristic curve, improvements range from 0.71% to 40.51% for AMP prediction and 0.62%–27.82% for PTM prediction. Visualizations of feature evolution and refinement via attention mechanisms validate the framework’s interpretability, providing transparency into the feature-extraction process and offering deeper insights into the basis of property prediction.

Keywords: functional prediction framework, attention mechanisms, BERT, interpretable deep learning

Introduction

Protein and peptide, the biomacromolecules composed of amino acid strings, execute a diverse array of essential biological properties in living organisms. Unraveling how these strings encode the structural and topological characteristics that determine their bioactivities is the core of biology. Previous studies, relying heavily on cumbersome wet experiments such as protein crystallography and biochemical assays, have advanced slowly [1]. For instance, of the >100 million sequences recorded in the UniProt database, only Inline graphic0.5% have been manually annotated in the UniProtKB/Swiss-Prot section [2]. Despite AlphaFold’s demonstration that amino acid sequences contain all the information necessary for folding, accurately predicting protein functions purely from sequences remains difficult [3]. As the number of protein sequences continues to grow, computational methods must be leveraged to tackle large-scale functional prediction challenges [4].

The difficulty in directly mapping a sequence of amino acids to its potential property arises from the complexity and diversity of biological properties [5]. A particular biological function depends not only on the specific composition or configuration of key residues, but is also influenced to varying degrees by other residues in the vicinity or even far away in the molecule [6]. For instance, while the catalytic activity of an enzyme is largely determined by the residues in its catalytic pocket, it can also be modulated by residues distant from the active site. Similarly, the reactivity of a post-translational modification (PTM) site is governed by both the chemical environment of the target group and the local context. Since peptides are considerably smaller than proteins, factors such as composition, sequence patterns, and overall shapes contribute to biological functions to different extents. Therefore, accurately predicting biological functions based on sequences requires an adaptive framework capable of automatically balancing multiple-level features of chemical and structural characteristics [7].

Traditional shallow machine learning methods, such as K-nearest neighbors (KNN), random forests (RF), and support vector machines (SVM) [8–11], have been widely used for predicting peptide and protein functions. However, these methods have notable limitations in generalization, as they rely on manually high-quality manually crafted features. This dependency limits their ability to capture the high-dimensional information implicit in sequences, ultimately limiting their adaptability across diverse datasets and unseen environments. In contrast, deep neural network methods have emerged as powerful tools for mapping complex, high-dimensional and nonlinear relationships between sequences and their biological functions. Representative methods include convolutional neural networks (CNN), bidirectional long short-term memory (Bi-LSTM), temporal convolutional networks (TCN), and Transformer [12–15].

Advances in Transformer-based models have significantly enhanced deep learning’s capacity for feature extraction and predictive accuracy. However, existing approaches are often constrained by task-specific designs and a reliance on auxiliary data, which limits their adaptability across diverse functional prediction scenarios. For instance, MITNet integrates Transformer and CNN architectures to achieve precise epitope prediction via binary classification, yet its framework cannot be generalized to multitask settings [16]. TransFun, developed by Boadu et al. [17], distills information from both protein sequences and structures to predict protein function, restricting its use to scenarios where reliable structural prediction are available. Additionally, a three-stage network combining Transformer with intermolecular attention mechanisms has been proposed to predict drug–target interactions but fails to capture intra-protein functional determinants [18].

The bidirectional encoder representations from transformers (BERT) model has further advanced sequence-based feature extraction [19], and transfer learning with BERT has proven effective at leveraging contextual information with minimal task-specific data. For example, Rives et al. [20] pretrained a protein language model on 250 million sequences, treating amino acid sequences as sentences, and demonstrated BERT’s ability to capture high-dimensional structural and functional features. Building on these developments, models such as ProteinBERT, ProtTrans, and BERT-Protein have been introduced to extract high-dimensional features from protein sequences through large-scale unsupervised and transfer learning methods [21–23]. Several BERT-based models have been applied for further prediction of specific protein functions. For instance, Li et al. [24] introduced a self-supervised BERT model for molecular prediction, learning representations from unlabeled drug-like molecules. Wang et al. [25] used BERT to predict transcription factor binding sites, capturing long-range dependencies and local features. Lee et al. [26] fine-tuned BERT to extract structural and functional information from antimicrobial peptides (AMP). IDP-LM [27], a PLM-based method for predicting protein intrinsic disorder and related functions, was proposed by Pang and Liu. Although these existing models are effective in partial single-task prediction, they also exhibit certain limitations [28]. The extracted embedding codes suffer from rigidity in their representations, and the network model structure for decision-making lacks adaptability in transforming task goals across different protein prediction application scenarios.

Given the complexities inherent in protein and peptide functional prediction, it is imperative to develop a novel framework that extracts multilevel features capturing both local and global sequence information. In this study, we present BBATProt, a feature-enhanced network framework that leverages BERT for dynamic word embedding extraction from amino acid sequences. BBATProt is custom-designed with a temporal data-based architecture that aligns with the spatial structural characteristics of proteins, thereby significantly enhancing prediction precision. As illustrated in Fig. 1, BBATProt adeptly captures the intrinsic contextual information in protein sequences, enabling the model to dynamically learn from a wide array of sequences without requiring extensive prior knowledge of protein structures or functional domains.

Figure 1.

Alt text: Three-panel schematic of the BBATProt workflow. (A) sequence embedding treats each amino acid as a token with token, segment and positional encodings; (B) a BERT multi-head attention module learns high-dimensional feature patterns embedded in text information; and (C) a function prediction classification framework based on interpretable deep learning design.

Overall framework of BBATProt.

BBATProt’s design emphasizes multilevel feature extraction, which is crucial for understanding the diverse functionalities of proteins and peptides. By integrating a targeted combination of CNN, Bi-LSTM, and TCN, BBATProt effectively encapsulates the complex information embedded in the encoded features, managing long-range dependencies and high-dimensional abstract features while minimizing computational overhead. Moreover, its self-attention mechanism optimally leverages interdependencies among features, highlighting the nonlinear significance of distinct sequence positions for functional realization. To assess the algorithm’s effectiveness, extensive evaluations across five independent datasets—including one carboxylesterase dataset, two peptide datasets, and two PTM site datasets—reveal that BBATProt excels in accuracy, robustness, and generalization compared to current state-of-the-art (SOTA) models. Additionally, to further validate the interpretability of the BBATProt architecture, t-distributed stochastic neighbor embedding (t-SNE) is employed to illustrate the layer-by-layer evolution of features, with the refinement achieved through the attention mechanism demonstrated separately [29]. This study, grounded in dynamic word embeddings and neural networks, not only effectively identifies key functional features but also showcases broad applicability and stability in practical scenarios, thereby laying a solid foundation for future research in protein and peptide function prediction.

Materials and Methods

Dataset construction

Due to the lack of a benchmark dataset for validating the prediction model’s general versatility, a comprehensive series of comparative experiments was carried out across multiple benchmark datasets to illustrate the effectiveness of BBATProt. By employing the same transfer learning pretrained model, this study aims to comprehensively assess the robustness and adaptability of the algorithm in a range of biological environments. The framework was strategically designed to accommodate a range of functional prediction needs, encompassing instances such as AMP, dipeptidyl peptidase-IV (DPP-IV) inhibitory peptides, carboxylesterase hydrolysis prediction, and PTM site prediction [30–33]. To minimize redundancy within these datasets, a unique two-step operation deploying Cluster Database at High Identity with Tolerance (CD-HIT) was executed at both protein and fragment levels [34]. Table 1 presents the specific sources and detailed information of the datasets. The ratio of positive to negative samples was almost always 1:1 for all datasets.

Table 1.

Information on five peptide and protein datasets

Dataset Training datasets Independent datasets
Pos Neg Pos Neg
Carboxylesterases [35] 3480 3480 696 696
Antimicrobial peptide [36] 3935 3905 1536 1536
Inhibitory peptide [37] 532 532 133 133
Lysine crotonylation [38] 6975 6975 2989 2989
Lysine glycation [39] 3573 3573 200 200

In the experiment, a 10-fold cross-validation approach was employed to evaluate the algorithm’s robustness and reliability. This methodology entails partitioning the dataset into 10 distinct, nonoverlapping subsets. During each iteration, training is conducted using nine subsets, while the remaining subset is reserved for independent testing. This iterative process facilitates a thorough assessment of the algorithm’s performance. The adoption of this cross-validation method helps minimize bias in the selection of training and test sets, thereby improving the scientific objectivity of the experimental results and ensuring replicability. In addition, Table S20 provides detailed data on the training process as well as the equipment, which further enhances the transparency and reproducibility of our method.

Bidirectional encoder representations from transformer feature extraction

Pretrained natural language processing models have found extensive application across various fields [40, 41]. Transfer learning techniques enable the execution of specific information processing tasks through a deep understanding of context and semantic relationships. In contrast to traditional recurrent neural networks and long short-term memory (LSTM) models, the architecture in this study mitigates performance degradation caused by long-term dependencies, enhancing both parallel computation and the capture of long-distance information. An innovation on this architecture, BERT, employs a bidirectional Transformer encoder that fully considers contextual information within the input.

In this study, BERT is leveraged to map each amino acid sequence into a feature vector, treating each molecular sequence in the reference dataset as a sentence and each amino acid as a word. The multihead self-attention mechanism employed by BERT captures intricate relationships between every possible pair of amino acids, thereby enhancing feature representation. By utilizing transfer learning, BERT effectively applies insights from natural language processing to protein sequences, significantly reducing the reliance on extensive task-specific data while facilitating efficient feature extraction even in data-scarce scenarios. The selected pretraining model for this experiment is BERT-Small, consisting of 4 encoder layers, each with 8 attention heads and 512 hidden units. This process encompasses two primary steps: sequence embedding and feature extraction through the encoder layers. During the sequence embedding phase, [CLS] and [SEP] tokens are incorporated to ensure proper embedding within the BERT framework. As BERT processes each amino acid, it generates token, segment, and position embeddings, enabling the model to comprehend[relationships among different amino acids in the sequence. Token embeddings represent words based on a specific tokenization methodology, while segment embeddings distinguish positions within the sequence. Position embeddings convey both relative and absolute positional information, as articulated in formulas (1) and (2), following the position vector framework proposed by Vaswani et al. [15].

graphic file with name DmEquation1.gif (1)
graphic file with name DmEquation2.gif (2)

At the encoder layer stage, multihead self-attention and feedforward neural networks (FFNN) are employed to process embedding vectors and transform them into more intricate feature representations. The multihead self-attention sublayer performs attention calculations at each position of every input sequence, capturing the relationships between the current position and all other positions.

graphic file with name DmEquation3.gif (3)
graphic file with name DmEquation4.gif (4)
graphic file with name DmEquation5.gif (5)

In the above formula, Inline graphic, Inline graphic, and Inline graphic, respectively, represent the matrices Inline graphic (Query), Inline graphic (Key), and Inline graphic (Value) obtained through different linear transformations of the input sequence. Inline graphic, Inline graphic, Inline graphic, and Inline graphic represent weight matrices, where Inline graphic denotes the dimensionality of the key vectors in the Inline graphic matrix. The FFNN sublayer achieves information processing by applying nonlinear transformations to the contextual representations at each position. This sublayer consists of an activation function, like Inline graphic, along with two linear transformations, facilitating the efficient extraction of features. The specific formula is as follows:

graphic file with name DmEquation6.gif (6)

where Inline graphic and Inline graphic stand for weight parameters; Inline graphic and Inline graphic signify the bias terms. Additionally, Inline graphic indicates the representation obtained after passing through the multihead self-attention sublayer.

Network architecture post feature extraction

In this study, we constructed a functional prediction network framework integrating CNN, Bi-LSTM, attention mechanism, and TCN. The protein sequence is first encoded using BERT, and the framework is used to further enhance feature extraction. The CNN, with a kernel size of 3, captures the local structural patterns of the protein through a 1D convolutional layer. This layer focuses on the local space of neighboring amino acids. The output from the CNN is then passed to the Bi-LSTM layer for further processing.

The Bi-LSTM layer employs a hidden state configuration of size 128. This design is intended to balance computational efficiency with the ability to capture long-range dependencies. Each Bi-LSTM cell simultaneously processes sequence features in both directions. This enables the extraction of information from both ends of the sequence, which aligns with the fact that protein functions are often driven by dependencies between the N- and C-termini. This configuration helps reveal the link between protein structure and function more comprehensively.

To further enhance the interpretability of the model, an attention mechanism is incorporated into the network framework. It enables the model to selectively focus on different segments of the input when analyzing protein sequences. The main components of the attention mechanism include query, key, value, attention score, softmax function, and final attention output. The comparison between the query and key is calculated, and the result is normalized using the softmax function. The normalized attention weights are then applied to the values, producing the final output. BBATProt uses self-attention to dynamically adjust the importance of each region in the sequence, thus improving model interpretability. This approach clarifies the contribution of each part of the protein sequence to the final prediction, making the decision-making process more transparent. The flattening layer then transforms the multidimensional input data generated by the attention mechanism into a 1D array, which is passed to the TCN layer for further refinement.

The TCN uses a 1D convolution with a kernel size of 2 and dilated convolutions with increasing dilation factors (1, 2, 4, etc.). This setup exponentially increases the model’s receptive field as the number of layers increases, facilitating the capture of global sequence dependencies. Additionally, causal convolutions ensure the model analyzes the temporal and positional relationships of sequences, which are crucial for protein folding and functional information. Residual connections in TCN help prevent the loss of cross-layer information and ensure the stability of capturing progressive dependencies.

The complementary strengths of Bi-LSTM and TCN in BBATProt allow the model to capture both local and global dependencies. While Bi-LSTM excels in sequential bidirectional learning, TCN’s dilation factor exponentially enhances its receptive field to capture extended sequence relationships essential for protein structure analysis. This design strikes a balance: Bi-LSTM handles intricate dependencies, while TCN efficiently captures long-range relationships, providing a deeper understanding of sequence contributions.

Finally, a dense layer with 64 units integrates the extracted features hierarchically, gradually reducing dimensionality. This design refines the structure–function relationships by distilling important sequence features, especially as dimensionality is reduced without losing critical details. Sigmoid activation functions ensure clear classification outputs, converting the learned features into interpretable binary results that indicate the presence or the absence of functional activity.

Model evaluation parameters

Seven metrics were taken into consideration in order to assess the performance of the classification models: accuracy (ACC), Matthews correlation coefficient (MCC), sensitivity (SEN), specificity (SPE), precision (PRE), F1 score (FSc), and area under the receiver operating characteristic curve (AUROC). These measures have the following definitions:

graphic file with name DmEquation7.gif (7)
graphic file with name DmEquation8.gif (8)
graphic file with name DmEquation9.gif (9)
graphic file with name DmEquation10.gif (10)
graphic file with name DmEquation11.gif (11)
graphic file with name DmEquation12.gif (12)
graphic file with name DmEquation13.gif (13)
graphic file with name DmEquation14.gif (14)

In the above formula (7)–(14), Inline graphic and Inline graphic designate the counts of correctly forecasted positive and negative instances, respectively, while Inline graphic and Inline graphic denote the quantities of erroneously predicted positive and negative instances, respectively.

Model evaluation parameters comparison of bidirectional encoder representations from transformer representational capability across different sizes

The selection of an appropriate BERT variant for protein sequence representation is a critical decision that requires a thorough evaluation of several factors, including model complexity, computational efficiency, and the ability to capture biologically relevant features. Given that various BERT architectures differ in depth, parameter count, and representational capacity, it is essential to consider how these variations impact the model’s capacity to extract meaningful biological information while ensuring computational feasibility. To address these considerations, we performed a comprehensive comparative analysis of six pretrained BERT models: BERT-Tiny, BERT-Mini, BERT-Small, BERT-Medium, BERT-Base, and BERT-Protein [42].

As outlined in Table S2, BERT-Small, with its architecture comprising four layers and 512 hidden units, strikes a well-balanced trade-off between model complexity and computational efficiency. This architecture is particularly well-suited for the natural length distribution of protein sequences (ranging from 50 to 500 amino acids), effectively capturing both local and global dependencies within the sequences. Furthermore, Table S3 demonstrates that BERT-Small achieves a robust balance between training efficiency and computational resource demands. In contrast, the simpler architectures of BERT-Tiny and BERT-Mini, with fewer layers and hidden units, fail to capture the intricate sequence interactions, limiting their representational capacity.

On the other hand, while BERT-Base and BERT-Protein feature larger parameter sets, they incur significantly higher computational costs and are more prone to overfitting due to excessive model complexity and information redundancy in the feature extraction process. As shown in Table S3, their training requirements exceed those of BERT-Small by >2.5-fold. Moreover, both larger models exhibit clear overfitting on the test set, with AUROC and related metrics declining noticeably. An overfitting analysis on the lysine glycation (Kgly) dataset is detailed in Table S4. Under the same conditions, the training metric Area under the Precision-Recall Curve (AUPRC) increases monotonically from 0.680 to 0.916 as the model size grows, while the validation AUPRC remains essentially unchanged. In addition, MCC and expected calibration error (ECE) analyses further verify overfitting, indicating amplified memorization, lower MCC and higher ECE, without corresponding generalization gains. Taken together, the markedly higher computational demand and reduced generalization argue for balancing representational depth with stability, thereby supporting the selection of BERT-Small as the default backbone for protein sequence representation.

To identify the optimal BERT variant, we conducted a quantitative performance evaluation across the various BERT models. As shown in Fig. 2, BERT-Small consistently outperforms the other models across all evaluation metrics, demonstrating its superior capacity to capture both local and global dependencies in protein sequences. These results highlight that, while larger models may provide enhanced feature representation, the efficiency and balance achieved by BERT-Small make it the most suitable choice for the task at hand.

Figure 2.

Alt text: Heatmap comparing representation quality across BERT pretrained model sizes.

Comparative representational ability of BERT pretrained model with different sizes (%).

By optimizing model complexity, training efficiency, and biological relevance, BERT-Small was ultimately selected as the pretrained model in BBATProt. This choice ensures that biologically meaningful features are effectively extracted, while maintaining computational feasibility, making BERT-Small a highly efficient and biologically relevant model for protein sequence analysis.

Model interpretability analysis

Model interpretability refers to the ability to understand and explain the process by which a model achieves its function. Deep learning models, particularly those with complex multilayer architectures and numerous parameters, often exhibit high intrinsic complexity. This makes it challenging to intuitively understand their functions, potentially leading to skepticism about the reliability of their predictions. As a result, improving model interpretability and gaining a deeper understanding of prediction mechanisms are crucial research challenges. In this paper, we enhance the interpretability of the BBATProt model by designing a network architecture based on protein spatial structure and incorporating an attention mechanism. This design highlights key sequence regions, providing a transparent view of the feature extraction process within the model. In this section, we analyze the model’s interpretability by visualizing its feature extraction and training processes. Notably, these regions of high interest were able to coincide with regions of high frequency of occurrence in biological phenomena, suggesting that the model’s predictions were able to establish a link with biological phenomena.

Interpretability-driven module ablation analysis of BBATProt

The primary goal of deep learning in classification tasks is to iteratively extract distinctive features from input data. This principle forms the core of the BBATProt network design. To demonstrate the network’s capability in feature learning, we employed t-SNE to visualize the feature evolution across different layers. This visualization not only supports the rationality of the network design but also provides insights into how features are abstracted and learned at each layer, thus enhancing model interpretability.

To further investigate BBATProt’s feature extraction mechanism, we conducted ablation experiments by progressively removing the Dense layer, TCN layer, Attention layer, Bi-LSTM layer, and Convolutional layer. These experiments emphasize the critical role of each layer in shaping feature evolution within the network. As shown in Fig. 3, points representing the positive and negative samples are clearly separated with the full BBATProt architecture, and the separation decreases as the Dense, TCN, Attention, and Bi-LSTM layers are ablated. The experimental results indicate that the hierarchical improvement of the network architecture significantly enhances feature extraction. Each layer refines the input features, progressively distilling high-dimensional abstractions from protein sequences. These findings support the rationale behind the BBATProt design and underscore the importance of each layer in feature evolution and overall model performance.

Figure 3.

Alt text: Six t-SNE plots comparing feature representations under progressive ablation. Clusters are most separated with the full BBATProt architecture and gradually overlap as Dense, TCN, attention and Bi-LSTM layers are removed, leaving only BERT raw feature encoding.

t-SNE plots show progressive loss of cluster separation from the full BBATProt architecture (a) to successive ablations removing Dense, TCN, Attention, and Bi-LSTM layers, leaving only BERT encoding (b–f).

To further assess the robustness and validity of BBATProt’s design, we performed ablation experiments on the PTM prediction task. As shown in Tables S21 and S22, we systematically removed individual components of the model, either sequentially or independently. Each variant was retrained and evaluated on the same test set. As components were removed, performance declined significantly relative to the intact BBATProt model. Furthermore, the AUROC curves in Fig. S21 provide a clear visual comparison of the ablated models, reaffirming that the full model outperforms all ablated versions and highlighting the necessity of each module for optimal predictive accuracy.

To quantify the effects of layer ablation on feature separability, we computed three cluster separation metrics: the Silhouette Score, Calinski-Harabasz Index (CHI), and Davies-Bouldin Index (DBI). These metrics were calculated for the high-dimensional feature representations at each stage of BBATProt’s progressive ablation, as shown in Table S5. The full architecture achieved a Silhouette Score of 0.522, a CHI of 92,461.164, and a DBI of 0.794, indicating well-separated and cohesive clusters. As the Dense layer, TCN module, attention mechanism, and Bi-LSTM layer were progressively removed, these metrics steadily declined, reflecting a decrease in cluster differentiation. This quantitative analysis aligns with the visual trends observed in Fig. 3 and further confirms the significant contribution of each layer to the model’s ability to extract discriminative features.

Visualization of attention weight allocation by BBATProt on peptide dataset

As shown in Figs 4 and S13, the visualization of attentional weight allocation reveals that BBATProt, leveraging the BERT pretrained model, assigns different weights to various segments of the randomly sampled sequence during computation. With eight independent attention heads, the model can selectively focus on distinct regions, prioritizing features most prominent to the function prediction task. This mechanism effectively filters out irrelevant information, directing the attention of the model to key regions and enhancing its capability to emphasize important sections of the protein sequence. The visual analysis of attention weight allocation further highlights the interpretability of the model in the feature extraction process. This finding clearly illustrates how the model allocates functional focus during prediction and highlights the role of the multihead attention mechanism in improving interpretability and prediction accuracy in BBATProt.

Figure 4.

Alt text: Process diagram showing how BERT computes attention on an AMP sequence: token embeddings yield attention weights that reweight value vectors to produce contextualized residue representations.

Visualization of attentional weight allocation during BERT feature extraction in AMP dataset.

Visualization of feature vectors by BBATProt on the post-translational modification sites dataset

To assess interpretability, we visualized BBATProt’s attention weights over crotonylation sites and their flanking residues. Figures 5 and S14 show that attention peaks coincide with amino acids enriched at Kcr sites, as identified by Liu et al. [43]. This alignment indicates that BBATProt’s attention mechanism highlights biologically relevant sequence motifs, reinforcing the interpretability of its predictions.

Figure 5.

Alt text: Visualization of feature representations for PTM-site prediction, contrasting embedding-based and attention-derived features for modified versus unmodified residues.

Characterization of BBATProt feature vectors for prediction of PTM sites.

BBATProt shows a pronounced attention peak at position +6, corresponding to the known lysine enrichment in rice crotonylome studies. In contrast, attention at the adjacent +7 position is suppressed, while a secondary focus emerges at position +13. These patterns underscore the model’s sensitivity to both local and distal sequence context. These observations demonstrate that BBATProt not only captures known crotonylation motifs but also learns the extended sequence dependencies governing site recognition.

Results and Discussion

To comprehensively validate the universal applicability of the BBATProt framework in peptide and protein function prediction tasks, this study focused on three key categories: enzyme function prediction, peptide function prediction, and PTM site prediction. A total of five diverse datasets were utilized to ensure a robust evaluation of the model’s versatility, covering the prediction of various biological functions, including carboxyl ester hydrolytic activity, antimicrobial activity, DPP-IV inhibition, and potential sites for post-translational modifications (PTMs) such as lysine crotonylation (Kcr) and Kgly.

Comparison of BBATProt with existing predictors on carboxylesterase datasets

This segment starts with the formulation of a benchmark dataset containing carboxylesterases identified by EC number 3.1.1.1. Known for their ability to act as catalysts, these enzymes facilitate the hydrolysis of esters. Beyond their functions in biological processes like drug metabolism, food digestion, and xenobiotic detoxification, these enzymes play a crucial role in various industries, including renewable energy, food processing, and bioremediation in environmental engineering. Given the extensive time and effort required by traditional experimental methods, computational approaches offer valuable and efficient alternatives.

The distribution of protein lengths in biological systems suggests that natural selection pressures generally favor coding sequences of 50–500 amino acids [44]. To ensure compatibility with the 512-dimensional input space of our network architecture, sequences exceeding 512 residues were intentionally excluded during preprocessing. As summarized in Table S1, this filtering criterion removed only 6.0% of sequences from the carboxylesterase dataset, resulting in a negligible impact on dataset representativeness. The retained sequences exhibit a length distribution, as shown in Table S1 and Fig. S15, that closely aligns with natural protein size patterns documented in [44], ensuring the preservation of biological relevance. Furthermore, sequence redundancy was rigorously controlled using CD-HIT with an 80% similarity threshold, ensuring diversity while eliminating evolutionary bias. The negative samples were randomly selected from Swiss-Prot in the same length range as the positive samples with a similarity threshold of 30%. Ultimately, a training dataset comprising 6960 carboxylesterase samples was obtained, with 3480 each for positive and negative samples.

To demonstrate the effectiveness of the proposed network model, a comparison was drawn against various conventional machine learning classification strategies, including KNN, RF, SVM, and XGBoost. Carboxylesterases were selected specifically for a 10-fold cross-validation, the comprehensive results of which are presented in Table 2. As these data reveal, BBATProt excels in the intricate domain of enzyme protein function prediction, exhibiting peak performance across a range of evaluation metrics.

Table 2.

Ten-fold cross-validation comparison between BBATProt and other machine learning methods on the hydrolysis enzyme dataset with the same BERT encoding features (%)

Predictor ACC MCC SEN SPE PRE FSc AUROC
BBATProt 96.60 93.33 95.59 97.79 97.97 96.71 96.63
BERT+RF 75.57 51.30 71.50 79.65 77.87 74.53 97.97
BERT+XGBOOST 80.63 61.28 79.25 82.02 81.51 80.34 99.58
BERT+KNN 79.35 59.87 88.92 69.88 74.66 81.13 86.35
BERT+SVM 83.79 67.61 82.38 85.23 84.78 83.54 84.60

Note: The optimal performance for each metric is indicated in bold.

Additionally, a comparative analysis was conducted to evaluate the encoding effectiveness of two NLP models, BERT and Word2Vec [45], on the hydrolase dataset. Both models were applied to encode the data, and their performance was assessed using an identical neural network architecture for classification. Word2Vec uses a dual-modality that encompasses both skip-gram and continuous bag of words (CBOW) strategies. CBOW mainly concentrates on forecasting the desired term when provided with contextual clues, whereas the skip-gram tactic emphasizes the estimation of the surrounding wording based on the specific target term. Table 3 shows the efficacy measures of each system via a 10-fold cross-validation. The outcome decisively attests that the pretrained BERT model is superior to the other two NLP models in every performance measurement, strongly underscoring the notable effectiveness of BERT encoding compared with Word2Vec.

Table 3.

Comparative results of 10-fold cross-validation on the hydrolysis enzyme dataset using different NLP-based feature representation models (%)

Predictor ACC MCC SEN SPE PRE FSc AUROC
BBATProt 96.60 93.33 95.59 97.79 97.97 96.71 96.63
Word2Vec-CBOW+Conv+BiLSTM+Att+TCN+Dense 48.36 NA 50.00 50.00 24.18 32.59 48.36
Word2Vec-Skip-gram+Conv+BiLSTM+Att+TCN+Dense 48.22 NA 60.00 40.00 29.11 39.19 48.22

Note: The optimal performance for each metric is indicated in bold. NA is used as a representation when the result is not available.

Comparison of BBATProt with existing predictors on peptide datasets

The assessment of BBATProt’s predictive capability on peptide datasets necessitates a comprehensive examination of both the AMP and DPP-IV inhibitory peptide datasets.

Comparison of BBATProt with existing predictors on antimicrobial peptide datasets

In this section, we provide a detailed comparison of BBATProt with 13 SOTA methods [36, 46–55] using the XUAMP dataset. This comparison was conducted through a comprehensive evaluation in an independent testing environment. The results demonstrate that BBATProt significantly outperforms the other methods across key metrics, including ACC, MCC, SEN, Fsc, and the AUROC. As shown in Table 4, BBATProt shows an improvement ranging from 2.81% to 31.96% in ACC, 5.42% to 63.68% in MCC, 0.56% to 64.4% in SEN, 2.39% to 49.46% in FSc, and 0.71% to 40.51% in AUROC when compared with its counterparts.

Table 4.

Performance comparison results of BBATProt and existing methods on independent AMP dataset (%)

Predictor ACC MCC SEN SPE PRE FSc AUROC
BBATProt 85.96 71.98 88.20 83.71 84.41 86.26 93.81
ADAM-HMM 68.40 39.00 52.10 84.70 77.30 62.30 68.40
ADAM-SVM 61.20 26.40 34.60 87.80 73.90 47.10 61.20
AmpGram 56.40 13.10 44.50 68.20 58.40 50.50 54.70
AMAP 60.20 25.00 31.40 89.10 74.20 44.10 60.20
AMPEP 65.80 42.50 32.50 99.20 97.50 48.70 72.70
AMPfun 67.40 41.40 40.60 94.30 87.70 55.50 73.50
AMPScannerV2 56.80 13.70 52.30 61.30 57.50 54.80 58.50
APIN 57.90 16.30 44.60 71.20 60.70 51.40 57.50
APIN-fusion 56.00 12.30 45.70 66.30 57.60 51.00 55.40
iAMP-Attenpred 83.15 66.56 87.64 78.65 80.41 83.87 93.10
iAMPpred 54.00 8.30 65.20 42.80 53.30 58.70 57.50
Deep-AMPEP30 55.80 11.80 45.70 65.80 57.20 50.80 53.30
sAMPpred-GAT 71.50 46.40 53.00 90.00 NA NA 77.70

Note: The optimal performance for each metric is indicated in bold. NA is used as a representation when the result is not available.

In order to verify the validity of BBATProt, six additional independent datasets were selected for evaluation. As shown in Fig. S16, the Upset plot was used to perform a correlation analysis across all seven datasets. The detailed results are presented in Tables S8–S19, and the statistical metrics, histograms, and standard deviation fluctuation plots are presented in Figs S1–S12. It is noteworthy that the model achieves an excellent balance between SEN and SPE while ensuring the stability of ACC. In addition, the mean AUROC value under the five-fold cross-validation for all datasets reaches 0.954. The low coefficient of variation (CV) values and narrow confidence intervals demonstrate the stability and repeatability of our results, thereby underscoring the reliability of our model’s performance across diverse dataset subsets.

Additionally, BBATProt ranked sixth and second in the SPE and PRE evaluations, respectively. While AMPEP achieved the highest scores in these two metrics, its overall average performance across the other five evaluation metrics was 32.80% lower than that of BBATProt. This indicates that BBATProt provides a more comprehensive extraction of feature information, emphasizing its strengths across various predictive tasks and suggesting potential avenues for further improvement in protein analysis. By examining the model characteristics of AMPEP in more detail, valuable insights can be gained for optimizing BBATProt in future studies.

To comprehensively assess the effectiveness of BBATProt in AMP prediction, a benchmark dataset was constructed. During dataset construction, sequences with >20 natural standard amino acids were excluded to ensure data consistency. The AMP dataset was compiled from various sources, including the AMPer, APD3, and ADAM databases [45, 56, 57]. Data cleaning involved setting a CD-Hit threshold of 90% to remove redundant sequence information. Correspondingly, non-AMP data were sourced from UniProt, with a constraint on the residue length of protein fragments between 5 and 100 to maintain sequence lengths similar to those of the AMP dataset for experimental validity. The CD-Hit redundancy removal threshold for the non-AMP dataset was set to 40%. In addition, sequences with annotations containing terms such as “Defensin,” “Antimicrobial,” “Antibiotic,” or “Antifungal” were excluded to ensure data purity.

Using the benchmark dataset, multiple repetitions of 10-fold cross-validation were performed to accurately evaluate the performance and generalization capability of the predictor. The average performance metrics across five repetitions are presented in Fig. 6 and detailed in Table S6. In addition, we computed several consistency metrics, including the mean, standard deviation, CV, and 95% confidence intervals, for each evaluation metric, as summarized in Table S7. The low CV values and narrow confidence intervals demonstrate the stability and repeatability of our results, thereby underscoring the reliability of our model’s performance across diverse dataset subsets.

Figure 6.

Alt text: Violin plots showing the distribution of cross-validation performance metrics on the AMP dataset for several models.

Five times average performance of 10-fold cross-validation method on the AMP benchmark dataset (%).

Comparison of BBATProt with existing predictors on inhibitory peptide datasets

Due to their crucial role in pharmaceutical development and diabetes treatment, distinguishing between inhibitory and noninhibitory peptides of DPP-IV has become a key focus. The BBATProt model was trained and evaluated on a dataset designed to differentiate these two classes, including both inhibitory DPP-IV peptides and noninhibitory counterparts [37]. The dataset was split into a benchmark set and an independent test set, each containing an equal distribution of both peptide types. Specifically, the benchmark set comprised 532 samples, while the independent set contained 133 samples. As presented in Table 5, BBATProt outperforms other models in terms of ACC, MCC, SEN, and SPE and shows comparable AUROC performance to the current SOTA models. This result demonstrates the rationality of BBATProt’s design based on interpretability architecture, which can ensure prediction accuracy while maintaining robustness.

Table 5.

Performance comparison results of BBATProt and existing methods on the DPP-IV inhibitory peptide dataset (%)

Dataset Predictor ACC MCC SEN SPE PRE FSc AUROC
Benchmark BBATProt 95.6 91.64 94.23 97.69 93.68 95.75 95.58
iDPPIV 81.9 64.3 NA 81.9 81.8 NA 87
iDPPIV-SCM 85.8 71.7 NA 87.7 83.9 NA 94
IPPF-FE NA NA NA NA NA NA NA
independent-data BBATProt 89.1 78.25 90.62 87.22 90.98 88.89 93.04
iDPPIV 79.7 59.4 NA 78.9 80.5 NA 84.7
iDPPIV-SCM 86.5 73.1 NA 87.4 85.60 NA 93.90
IPPF-FE 86.64 73.7 NA 85.71 87.97 NA 94.25

Note: The optimal performance for each metric is indicated in bold. NA is used as a representation when the result is not available.

Comparison of BBATProt with existing predictors on post-translational modification site prediction datasets

Concerning PTM prediction challenges such as protein lysine crotonylation and protein lysine glycation, these alterations can affect the functionality, stability, and affinity of proteins, subsequently steering their biological operations. However, experimental methods for site prediction are both expensive and time-consuming. In contrast, computational methods can provide reasonable prediction in a highly efficient and cost-effective manner. Using bioinformatics tools and algorithms, protein sequences and structures can be analyzed to predict potential modification sites.

To validate the advanced nature of the designed network structure, BBATProt was compared with some classical machine learning classification methods, including KNN, RF, SVM, and XGBoost. A 10-fold cross-validation was implemented for Kcr site prediction, with the outcomes detailed in Table 6. The evidence indicates the superior effectiveness of BBATProt, most notably in brief site prediction duties. This can be attributed to the fact that BBATProt is based on hierarchical learning, which allows it to adapt to protein data of different lengths and extract high-level features, making the model more effective in handling complex classification tasks.

Table 6.

Ten-fold cross-validation comparison between BBATProt and other machine learning methods on the Kcr site prediction dataset with consistent BERT encoding features (%)

Predictor ACC MCC SEN SPE PRE FSc AUROC
BBATProt 94.95 89.93 94.05 95.86 95.80 94.91 94.95
BERT+RF 74.34 49.49 83.36 65.34 70.63 76.46 96 01
BERT+XGBOOST 76.86 54.10 82.80 70.91 74.00 78.15 97.38
BERT+KNN 70.12 40.23 69.48 70.74 70.36 69.91 78.70
BERT+SVM 79.47 59.30 84.97 73.97 76.56 80.53 92.40

Note: The optimal performance for each metric is indicated in bold.

To avoid overestimating the performance of BBATProt, a comparison was made between its training results and those of other existing SOTA models on independent test sets for both Kcr and Kgly site prediction problems. In the Kgly site prediction task presented in Fig. 7, BBATProt demonstrates a clear superiority over other advanced SOTA predictors across all evaluated metrics [39, 58–60]. Notably, for the Sensitivity metric, BBATProt shows an improvement range of 18.9%–56.9% compared with other predictors, underscoring its robust capability to accurately capture positive samples. This performance highlights the significant feasibility of BBATProt in PTM site prediction, emphasizing its effectiveness in identifying relevant sites with high accuracy.

Figure 7.

Alt text: Radar chart comparing multiple models on the Kgly prediction dataset across several metrics. BBATProt achieved the best overall performance.

Performance comparison results of BBATProt and existing methods on the Kgly site prediction dataset (%).

As shown in Fig. 8 and Table S23, BBATProt also outperforms other Kcr-specific predictors across nearly all metrics [33, 38, 61, 62]. Nevertheless, its SEN metric remains marginally lower than that of CKSAPP_CrotSite, reflecting the intrinsic challenge of modeling highly localized residue contexts. This limitation defines a clear avenue for future enhancements, particularly toward more effective capture of fine-grained sequence dependencies.

Figure 8.

Alt text: Bar chart comparing model performance on the independent Kcr prediction dataset across standard evaluation metrics. BBATProt achieved the best overall performance.

Performance comparison results of BBATProt and existing methods on independent Kcr site prediction dataset (%).

Across enzyme and peptide datasets featuring relatively long sequences, BBATProt delivers consistent improvements in ACC, MCC, AUROC, and AUPRC, reflecting an enhanced capacity to retain long-range dependencies. On short-sequence PTM benchmarks, where precise localization of the residue-level context is paramount, the attention module adaptively concentrates weights on local windows, balancing sensitivity and specificity. Taken together, these dataset-specific trends indicate that BBATProt’s cross-dataset generalization primarily stems from steering the pretrained backbone toward task-relevant semantic patterns rather than model scale.

Conclusion

The function prediction of protein and peptide based on amino acid sequences is crucial in both academic research and industrial applications. This study introduces BBATProt, an innovative framework designed to tackle this challenge. Diverse biological datasets were transformed into peptide and protein datasets with varying amino acid sequence lengths, targeting prediction tasks such as hydrolases, AMP, DPP-IV inhibitory peptides, and PTM at Kcr and Kgly sites. Comprehensive evaluations across multiple datasets confirm its superior accuracy, robustness, and generalization capabilities compared to SOTA models. Furthermore, this study enhances the interpretability of what is typically a black-box model by leveraging visualization techniques. The sequential network architecture, built around the spatial conformations of proteins, further validates the model’s effectiveness. Additionally, transfer learning strengthens feature extraction in our framework and yields measurable gains in predictive performance. BBATProt combines high accuracy with interpretability. It provides transparent attributions to sequence features and clarifies how specific amino-acid patterns drive functional outcomes. These attributions help identify candidate sites for site-directed mutagenesis to improve antimicrobial activity, catalytic efficiency, or regulation of PTMs. The predictions are readily verifiable and suitable for deployment in protein and peptide engineering. For example, BBATProt could try to mine large marine metagenomic datasets to prioritize candidates with targeted functions, such as PET-hydrolyzing enzymes or AMPs, and output score-ranked lists for experimental testing [63]. After activity confirmation, residue-level interpretability can guide rational mutagenesis to enhance function, and the validated sequences can serve as a feedback set for subsequent model adaptation. The model can also be adapted to specific applications through continued training on high-quality experimental datasets, which further improves task-specific performance and practical utility.

Although BBATProt performs well in protein and peptide function prediction, improvements are still necessary. The current sequence length limit of 512 amino acids of BBATProt aligns with natural protein distributions and the fixed input dimensions of BERT, ensuring computational efficiency. However, for long sequences, the sparsity introduced by zero-padding may hinder the capability of the model to capture long-range dependencies. Similarly, while BBATProt excels in functional prediction tasks with large-scale datasets, its capability to discern intrinsic sequence patterns in small-sample scenarios remains a challenge. To address these limitations, we are integrating sparse attention into long sequence processing and combining meta-learning with expert prior to improve model prediction performance under small sample problems. Furthermore, plans include the development of an intuitive and user-friendly web server to offer public prediction services, enhancing accessibility to a wider researcher base and expanding its practical applications.

Key Points

  • BBATProt proposes an interpretable neural network framework for protein function prediction.

  • The model outperforms state-of-the- art methods in various protein and peptide prediction tasks.

  • Visualization confirms model interpretability, highlighting attention focus on relevant data.

Supplementary Material

Supplement_Information_bbaf593

Contributor Information

Youqing Wang, State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China; College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.

Xukai Ye, College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.

Yue Feng, College of Life Science and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.

Haoqian Wang, College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.

Xiaofan Lin, State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China; Beijing Advanced Innovation Center for Soft Matter Science and Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.

Xin Ma, College of Information Sciences and Technology, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.

Yifei Zhang, State Key Laboratory of Chemical Resource Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China; Beijing Advanced Innovation Center for Soft Matter Science and Engineering, Beijing University of Chemical Technology, North Third Ring Road 15, 100029 Beijing, China.

Conflict of interest: None declared.

Funding

This work was supported by the National Natural Science Funds of China under Grants 32371325, 62303039, and 62433004; in part by the National Science Fund for Distinguished Young Scholars of China under Grant 62225303; in part by the China Postdoctoral Science Foundation BX20230034 and 2023M730190; in part by the Fundamental Research Funds for the Central Universities buctr20120201, QNTD2023-01; in part by Beijing Natural Science Foundation (L241014).

Data availability

Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/Xukai-YE/BBATProt.

References

  • 1. Attique  M, Farooq  MS, Khelifi  A. et al.  Prediction of therapeutic peptides using machine learning: computational models, datasets, and feature encodings. IEEE Access  2020;11:148570–94. [Google Scholar]
  • 2. Gligorijević  V, Renfrew  PD, Kosciolek  T. et al.  Structure-based protein function prediction using graph convolutional networks. Nat Commun  2021;12:3168. 10.1038/s41467-021-23303-9 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Jumper  J, Evans  R, Pritzel  A. et al.  Highly accurate protein structure prediction with AlphaFold. Nature  2021;596:583–9. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Samet Özdilek  A, Atakan  A, Özsari  G. et al.  ProFAB—Open protein functional annotation benchmark. Brief Bioinform  2023;24:bbac627. 10.1093/bib/bbac627 [DOI] [PubMed] [Google Scholar]
  • 5. Mitchell  AL, Attwood  TK, Babbitt  PC. et al.  InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res  2019;47:D351–60. 10.1093/nar/gky1100 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Lai  B, Jinbo  X. Accurate protein function prediction via graph attention networks with predicted structure information. Brief Bioinform  2022;23:bbab502. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Han  Y, Luo  X. IPPF-FE: an integrated peptide and protein function prediction framework based on fused features and ensemble models. Brief Bioinform  2023;24:bbac476. [DOI] [PubMed] [Google Scholar]
  • 8. Cover  TM, Hart  PE. Nearest neighbor pattern classification. IEEE Trans Inform Theory  1967;13:21–7. 10.1109/TIT.1967.1053964 [DOI] [Google Scholar]
  • 9. Breiman  L. Random forests. Mach Learn  2001;36:105–39. [Google Scholar]
  • 10. Vapnik  VN. The nature of statistical learning theory. IEEE Trans Neural Netw  1997;8:1564. 10.1109/TNN.1997.641482 [DOI] [PubMed] [Google Scholar]
  • 11. Chen  T, Guestrin  C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '16). San Francisco, California, USA: Association for Computing Machinery (ACM), 2016; pp. 785–794.
  • 12. Wang  S, Li  G, Liao  Z. et al.  CnnPOGTP: a novel CNN-based predictor for identifying the optimal growth temperatures of prokaryotes using only genomic k-mers distribution. Bioinformatics  2022;38:3106–8. 10.1093/bioinformatics/btac289 [DOI] [PubMed] [Google Scholar]
  • 13. Zhang  S, Zheng  D, Hu  X. et al.  Bidirectional long short-term memory networks for relation classification. In: Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation (PACLIC 2015). Shanghai, China: Association for Computational Linguistics, 2015; pp. 73–78.
  • 14. Shaojie, Bai  J, Kolter  Z, Koltun  V. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint 2018;arXiv:1803.01271 [Google Scholar]
  • 15. Vaswani  A, Shazeer  N, Parmar  N. Attention is all you need. In: Proceedings of 31st International Conference on Neural Information Processing Systems (NeurIPS 2017). Long Beach, California, USA: Curran Associates, Inc., 2017; pp. 5998–6008.
  • 16. Darmawan  JT, Leu  J-S, Avian  C. et al.  MITNet: a fusion transformer and convolutional neural network architecture approach for T-cell epitope prediction. Brief Bioinform  2023;24:bbad202. [DOI] [PubMed] [Google Scholar]
  • 17. Boadu  F, Cao  H, Cheng  J. Jianlin Cheng combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics  2023;39:i318–25. 10.1093/bioinformatics/btad208 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18. Liu  S, Wang  Y, Deng  Y. et al.  Improved drug–target interaction prediction with intermolecular graph transformer. Brief Bioinform  2022;23:bba162. 10.1093/bib/bbac162 [DOI] [PubMed] [Google Scholar]
  • 19. Devlin  J, Chang  M-W, Lee  K. et al.  BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019). Minneapolis, Minnesota, USA: Association for Computational Linguistics, 2019; pp. 4171–4186.
  • 20. Rives  A, Meier  J, Sercu  T. et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci USA  2021;118:1–12. 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Brandes  N, Ofer  D, Peleg  Y. et al.  ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics  2022;38:2102–10. 10.1093/bioinformatics/btac020 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Elnaggar  A, Heinzinger  M, Dallago  C. et al.  ProtTrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell  2022;44:7112–27. 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]
  • 23. Yue  Z, Jianyuan  L, Lianmin  Z. et al.  A novel antibacterial peptide recognition algorithm based on BERT. Brief Bioinform  2021;22:bbab200. 10.1093/bib/bbab200 [DOI] [PubMed] [Google Scholar]
  • 24. Li  B, Lin  M, Chen  T. et al.  FG-BERT: a generalized and self-supervised functional group-based molecular representation learning framework for properties prediction. Brief Bioinform  2023;24:bbad398. [DOI] [PubMed] [Google Scholar]
  • 25. Wang  K, Zeng  X, Zhou  J. et al.  BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning. Brief Bioinform  2024;25:bbae195. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26. Lee  H, Lee  S, Lee  I. et al.  AMP-BERT: prediction of antimicrobial peptide function based on a BERT model. Protein Sci  2023;32:4529. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27. Pang  Y, Liu  B. IDP-LM: prediction of protein intrinsic disorder and disorder functions based on language models. PLoS Comput Biol  2023;19:e1011657. 10.1371/journal.pcbi.1011657 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 28. Egbert  C, Abhinav  G, Julian  R. et al.  Transformer-based protein generation with regularized latent space optimization. Nat Mach Intell  2022;4:840–51. 10.1038/s42256-022-00532-1 [DOI] [Google Scholar]
  • 29. van der Maaten  L, Hinton  G. Visualizing data using t-SNE. J Mach Learn Res  2008;9:2579–605. [Google Scholar]
  • 30. Jing  X, Li  F, Leier  A. et al.  Comprehensive assessment of machine learning-based methods for predicting antimicrobial peptides. Brief Bioinform  2021;22:bbab083. 10.1093/bib/bbab083 [DOI] [PubMed] [Google Scholar]
  • 31. Zou  H, Yin  Z. Identifying dipeptidyl peptidase-IV inhibitory peptides based on correlation information of physicochemical properties. Int J Pept Res Therapeut  2021;27:2651–9. 10.1007/s10989-021-10280-2 [DOI] [Google Scholar]
  • 32. Wang  D, Zou  L, Jin  Q. et al.  Human carboxylesterases: a comprehensive review. Acta Pharm Sin B  2018;8:699–712. 10.1016/j.apsb.2018.05.005 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33. Lv  H, Dao  F-Y, Guan  Z-X. et al.  Deep-Kcr: accurate detection of lysine crotonylation sites using deep learning method. Brief Bioinform  2021;22:1–10. [DOI] [PubMed] [Google Scholar]
  • 34. Huang  Y, Niu  B, Gao  Y. et al.  CD-HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics  2010;26:680–2. 10.1093/bioinformatics/btq003 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. UniProt Consortium . UniProt: a hub for protein information. Nucleic Acids Res  2015;43:D204–12. 10.1093/nar/gku989 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Yan  K, Lv  H, Guo  Y. et al.  sAMPpred-GAT: prediction of antimicrobial peptide by graph attention network and predicted peptide structure. Bioinformatics  2023;39:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Charoenkwan  P, Nantasenamat  C, Hasan  MM. et al.  iBitter-fuse: a novel sequence-based bitter peptide predictor by fusing multi-view features. Int J Mol Sci  2021;22:8958. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38. Qiao  Y, Zhu  X, Gong  H. BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models. Bioinformatics  2022;38:648–54. 10.1093/bioinformatics/btab712 [DOI] [PubMed] [Google Scholar]
  • 39. Liu  Y, Liu  Y, Wang  G-A. et al.  BERT-Kgly: a bidirectional encoder representations from transformers (BERT)-based model for predicting lysine glycation site for homo sapiens. Front Bioinform  2022;2:834153. 10.3389/fbinf.2022.834153 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Lee  J, Yoon  W, Kim  S. et al.  BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics  2020;36:1234–40. 10.1093/bioinformatics/btz682 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. Jialin  Y, Shi  S, Zhang  F. et al.  PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics  2019;35:2749–56. [DOI] [PubMed] [Google Scholar]
  • 42. Zhang  Y, Lin  J, Zhao  L. et al.  A novel antibacterial peptide recognition algorithm based on BERT. Brief Bioinform  2021;22:bbab200. 10.1093/bib/bbab200 [DOI] [PubMed] [Google Scholar]
  • 43. Liu  S, Xue  C, Fang  Y. et al.  Global involvement of lysine crotonylation in protein modification and transcription regulation in Rice. Mol Cell Proteomics  2018;17:1922–36. 10.1074/mcp.RA118.000640 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 44. Nevers  Y, Glover  NM, Dessimoz  C. et al.  Protein length distribution is remarkably uniform across the tree of life. Genome Biol  2023;24:135. 10.1186/s13059-023-02973-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Mikolov  T, Chen  K, Corrado  G. et al.  Efficient estimation of word representations in vector space. arXiv preprint 2013; arXiv:1301.3781 [Google Scholar]
  • 46. Lee  H-T, Lee  C-C, Yang  J-R. et al.  A large-scale structural classification of antimicrobial peptides. Biomed Res Int  2015;2015:475062. 10.1155/2015/475062 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Burdukiewicz  M, Sidorczuk  K, Rafacz  D. et al.  Proteomic screening for prediction and design of antimicrobial peptides with AmpGram. Int J Mol Sci  2020;21:E4310. 10.3390/ijms21124310 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 48. Gull  S, Shamim  N, Minhas  F. AMAP: hierarchical multi-label prediction of biologically active and antimicrobial peptides. Comput Biol Med  2019;107:172–81. 10.1016/j.compbiomed.2019.02.018 [DOI] [PubMed] [Google Scholar]
  • 49. Bhadra  P, Yan  J, Li  J. et al.  AmPEP: sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Sci Rep  2018;8:1697. 10.1038/s41598-018-19752-w [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Chung  C-R, Kuo  T-R, Li-Ching  W. et al.  Characterization and identification of antimicrobial peptides with different functional activities. Brief Bioinform  2020;21:1098–114. 10.1093/bib/bbz043 [DOI] [PubMed] [Google Scholar]
  • 51. Veltri  D, Kamath  U, Shehu  A. Deep learning improves antimicrobial peptide recognition. Bioinformatics  2018;34:2740–7. 10.1093/bioinformatics/bty179 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 52. Xin  S, Jing  X, Yin  Y. et al.  Antimicrobial peptide identification using multi-scale convolutional network. BMC Bioinformatics  2019;20:730. 10.1186/s12859-019-3327-y [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 53. Xing  W, Zhang  J, Li  C. et al.  iAMP-Attenpred: a novel antimicrobial peptide predictor based on BERT feature extraction method and CNN-BiLSTM-attention combination model. Brief Bioinform  2023;25:bbad443. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 54. Meher  PK, Sahu  TK, Saini  V. et al.  Predicting antimicrobial peptides with improved accuracy by incorporating the compositional, physico-chemical and structural features into Chou’s general PseAAC. Sci Rep  2017;7:42362. 10.1038/srep42362 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 55. Yan  J, Bhadra  P, Li  A. et al.  Deep-AmPEP30: improve short antimicrobial peptides prediction with deep learning. Mol Ther Nucleic Acids  2020;20:882–94. 10.1016/j.omtn.2020.05.006 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 56. Fjell  CD, Hancock  REW, Cherkasov  A. AMPer: a database and an automated discovery tool for antimicrobial peptides. Bioinformatics  2007;23:1148–55. 10.1093/bioinformatics/btm068 [DOI] [PubMed] [Google Scholar]
  • 57. Wang  G, Li  X, Wang  Z. APD3: the antimicrobial peptide database as a tool for research and education. Nucleic Acids Res  2016;44:D1087–93. 10.1093/nar/gkv1278 [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 58. Johansen  MB, Kiemer  L, Brunak  S. Analysis and prediction of mammalian protein glycation. Glycobiology  2006;16:844–53. 10.1093/glycob/cwl009 [DOI] [PubMed] [Google Scholar]
  • 59. Zhe  J, Sun  J, Li  Y. et al.  Predicting lysine glycation sites using bi-profile bayes feature extraction. Comput Biol Chem  2017;71:98–103. 10.1016/j.compbiolchem.2017.10.004 [DOI] [PubMed] [Google Scholar]
  • 60. Yan  X, Li  L, Ding  J. et al.  Gly-PseAAC: identifying protein lysine glycation through sequences. Gene  2017;602:1–7. [DOI] [PubMed] [Google Scholar]
  • 61. Zhe  J, He  J-J. Prediction of lysine crotonylation sites by incorporating the composition of k-spaced amino acid pairs into Chou’s general PseAAC. J Mol Graph Model  2017;77:200–4. 10.1016/j.jmgm.2017.08.020 [DOI] [PubMed] [Google Scholar]
  • 62. Liu  Y, Zhaomin  Y, Chen  C. et al.  Prediction of protein crotonylation sites through LightGBM classifier based on SMOTE and elastic net. Anal Biochem  2020;609:113903. 10.1016/j.ab.2020.113903 [DOI] [PubMed] [Google Scholar]
  • 63. Chen  J, Jia  Y, Sun  Y. et al.  Global marine microbial diversity and its potential in bioprospecting. Nature  2024;633:371–9. 10.1038/s41586-024-07891-2 [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplement_Information_bbaf593

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found here: https://github.com/Xukai-YE/BBATProt.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES