Skip to main content
Computational and Structural Biotechnology Journal logoLink to Computational and Structural Biotechnology Journal
. 2025 Dec 18;31:94–100. doi: 10.1016/j.csbj.2025.12.014

BLMPred: Predicting linear B-cell epitopes using pre-trained protein language models and machine learning

Barnali Das 1, Dmitrij Frishman 1,
PMCID: PMC12795683  PMID: 41536693

Abstract

B-cells get activated through interaction with B-cell epitopes, a specific portion of the antigen. Identification of B-cell epitopes is crucial for a wide range of clinical applications, including disease diagnostics, vaccine and antibody development, and immunotherapy. While experimental B-cell epitope identification is expensive and time-consuming, computational tools are starting to emerge that can generate lists of high-confidence epitopes for experimental trials. In this paper, we present BLMPred, a sequence-based linear B-cell epitope prediction tool, which exploits pre-trained protein language model embeddings for deriving local and global protein structural features from the protein primary structure. BLMPred is a binary classifier which can predict whether an input peptide sequence is an antibody epitope or not without relying on 3D protein structures. BLMPred has been shown to outperform other comparable tools when tested on multiple independent datasets. It is freely available at https://github.com/bdbarnalidas/BLMPred.git.

Keywords: B-cell epitopes, Machine learning, Protein language models, Embedders, Per-protein embeddings

Graphical Abstract

graphic file with name ga1.jpg

1. Introduction

B-cells provide long-term immunity against cancerous cells and pathogens/antigens and hence, are a vital component of the adaptive immune system. B-cells get activated when B-cell receptors, transmembrane proteins present on the B-cell surface, interact with B-cell epitopes, a specific portion of the antigen. Linear B-cell epitopes are stretches of adjacent residues along the antigen primary sequence, whereas non-contiguous residues spatially co-localized by protein folding form discontinuous or conformational B-cell epitopes. Both linear and discontinuous B-cell epitopes play significant roles in binding with immunoglobulins (B-cell receptors or antibodies) for recognizing foreign antigens [8].

Identification of B-cell epitopes is crucial in clinical and biotechnological applications such as disease diagnostics [25], vaccine and antibody design [19], [2]. Since B-cell epitope identification by assay screening is expensive and time-consuming, computational tools are nowadays becoming a major requirement for reducing development time and cost [35], [8].

The accuracy of in-silico B-cell epitope prediction methods has seen significant improvements over the past decades. Available approaches can be broadly classified into two categories. The first category comprises methods accepting protein sequences as input, such as Bepipred-3.0 [6], Bepipred-2.0 [16], epiDope [7], GraphBepi [45], Emini surface accessibility prediction [12], Parker hydrophilicity prediction [30], Kolaskar & Tongaonkar antigenicity prediction [18], Bepitope [29], and BcePred [33]. The second category includes tools that have been trained explicitly on exact linear B-cell epitope sequences, such as SVMTriP [42], LBEEP [34], epitope1D [8], LBCE-XGB [22], and LBCE-BERT [21]. These tools are beneficial when the user wants to know whether the input peptide may be a potential linear B-cell epitope or not, rather than recognizing the potential B-cell epitope within the given protein sequence. Methods from the first group seem to perform poorly when tested on short peptides, as observed in a recent study [8]. Recent studies have explored various computational strategies for epitope prediction, highlighting advances in machine learning, protein language model embeddings, and integrative sequence-feature approaches [14], [17], [26], [36], [37], [39], [40], [41], [44], [46].

Here, we propose BLMPred for predicting whether an input peptide sequence in the length range of 5–60 is a linear B-cell epitope or not. BLMPred was developed by training a Support Vector Machine on the numerical embeddings generated by the ProtTrans protein language model (pLM) from a huge experimentally curated linear epitope dataset. On a comprehensive benchmark dataset BLMPred outperformed SVMTriP, LBEEP, and epitope1D based on most performance metrics.

2. Materials and methods

2.1. Dataset of linear B-cell epitopes

We downloaded 208265 experimentally validated linear B-cell epitope sequences (positive samples) and 487127 non-B-cell epitope sequences (negative samples) from the Immune Epitope Database (IEDB, version: March 2023) [38]. We removed 1075 duplicate entries, 215 peptides containing non-standard amino acids symbols (Z, B, J, O, U, X), 512 peptides with a length less than the typical minimum length of 5 amino acids for linear B-cell epitopes [9], as well as 95971 peptides found both in the positive and the negative dataset. Furthermore, all peptides in the positive dataset longer than 60 amino acids were excluded from consideration since the negative dataset contained peptides with the lengths of up to 60 amino acids. This dataset was named BLMPred_5–60.

Recent reports suggest that the length of B-cell epitopes varies between 5 and 8 and 25 amino acid residues [1], [24], [32], [8]. This length range is dictated by the structural requirements of epitope binding to the Complementarity Determining Regions (CDRs) of the B-cell receptors. According to the INDI database [10], CDR1, CDR2, and CDR3 vary in lengths between 4 and 17, 5–17, and 5–38 amino acids, respectively. We therefore created an alternative dataset, BLMPred_8–25, consisting of peptides varying in length between 8 and 25.

The host protein sequences corresponding to the B-cell epitopes and non-B-cell epitopes were retrieved in FASTA format using Dbfetch [23]. To minimize sequence redundancy and potential bias due to homologous proteins, the sequences were clustered using CD-HIT [15] with an 80 % sequence identity threshold. Representative protein sequences from each cluster were selected. Further we performed peptide-level redundancy checks on the peptides belonging to these representative proteins and removed exact duplicate and highly similar (≥80 % identity) peptide sequences. After data cleaning steps, BLMPred_5–60 and BLMPred_8–25 contained 111015 (390589) and 102023 (387155) positive (negative) samples, respectively.

2.2. Preparation of training and test datasets

Since the BLMPred_5–60 and BLMPred_8–25 datasets were initially imbalanced, with the positive to negative sample ratio of 1:3, we made them balanced by drawing 111015 and 102023 negative samples from the pool of 390589 and 387155 negative samples in these two datasets, respectively. Additionally, we ensured that the sequence length distribution of these negative samples closely matched that of the positive samples. Both datasets were split into training (90 %) and test (10 %) datasets while retaining similar length distributions (Supplementary Figure S1). The training and test datasets constructed from BLMPred_5–60 (BLMPred_8–25) datasets are referred to as BLMPred_5–60_training (BLMPred_8–25_training) and BLMPred_5–60_test (BLMPred_8–25_test), respectively. The training datasets were used for cross-validation while the test datasets were solely used as independent datasets for assessing the performance of the final trained methods.

2.3. Preparation of benchmarking dataset

We prepared a separate independent dataset (BLMPred_benchmark) for comparing the performance of BLMPred with other existing tools. Since the training and test datasets described above are based on the IEDB release of March 2023, we downloaded sequences of linear B-cell epitopes and non-B-cell epitopes deposited with IEDB after April 2023. After data cleaning and filtering, the final BLMPred_benchmark dataset contained 2928 positive and 1000 negative samples.

2.4. Language model (LM) embeddings

For each peptide in our dataset, we generated average embeddings of length 1024 by utilizing the ProtT5-XL-U50 model of the ProtTrans protein language model [11].

2.5. Machine learning

Identification of linear B-cell epitopes was cast as a binary classification problem where an input peptide was either a B-cell epitope or not. A broad range of traditional machine learning models implemented in the Scikit-learn package [31] was tested, including adaboost classifier, bagging classifier, extra trees classifier, Gaussian Naïve Bayes, histogram-based gradient boosting classifier, k-nearest neighbors, linear discriminant analysis, logistic regression, multi-layer perceptron, quadratic discriminant analysis, random forest, and support vector machine. Also, we tested the XGBoost classifier [3] from the XGBoost package on the Python language platform. Furthermore, we trained an Explainable Boosting Machine (EBM) learning [27] classifier provided by an open source Python package, InterpretML [28]. We utilized RAPIDS, a data science framework capable of executing end-to-end pipelines completely in the GPU [13] for an accelerated training of the machine learning models.

2.6. Performance evaluation metrics

To assess the model performance on the test dataset, we calculated several performance metrics including accuracy (ACC), precision (P), sensitivity or recall (R), F1 score (F1), specificity (S), Matthew’s Correlation Coefficient (MCC), area under the ROC curve (AUROC), and the area under the precision-recall curve (AUPRC) as follows:

ACC=TP+TNTP+TN+FP+FN
P=TPTP+FP
R=TPTP+FN
F1=2×P×RP+R
S=TNTN+FP
MCC=TP×TNFP×FN(TP+FP)×(TP+FN)×(TN+FP)×(TN+FN)

where TP, TN, FP, and FN denote the number of true positives, true negatives, false positives, and false negatives, respectively. Among all the evaluation metrics, MCC has been reported to be more informative in evaluating binary classification problems [4], [5]. Hence, although we report our results based on the full set of metrics, we selected the trained model with the highest MCC measure as the optimal classifier for further processing.

3. Results and discussions

3.1. Overview of the methodology

A high-quality dataset of experimentally verified linear B-cell epitopes (BLMPred) was derived while keeping similar length distributions of peptides among the positive and negative samples. Peptide sequences were converted into feature vectors of 1024 numerical values using the ProtTrans embedder. The generalization capability of models was first evaluated on the training data based on 10-fold cross-validation, while the final accuracy assessment was conducted on test data. We trained a broad range of machine learning models and selected the best performing model, which was then applied to predict linear B-cell epitopes in the independent dataset. The entire methodology has been summarized in Fig. 1.

Fig. 1.

Fig. 1

Flowchart of our proposed methodology.

3.2. Models trained on the BLMPred_5–60 dataset

Machine learning models listed in Section 2.4 were trained on the embeddings derived from the BLMPred_5–60 dataset and their performance was assessed by 10-fold cross validation (Supplementary Table S1, Supplementary Figure S2). The best results were achieved with Support Vector Machine, with the mean±std values of accuracy, precision, recall, F1 score, specificity, MCC, AUROC and AUPRC of 0.839 ± 0.0027, 0.849 ± 0.0046, 0.824 ± 0.0037, 0.837 ± 0.0033, 0.855 ± 0.0041, 0.679 ± 0.0055, 0.839 ± 0.0028, and 0.788 ± 0.0047, respectively. For each classifier, the 10 trained models resulting from each fold of the 10-fold cross-validation were tested on the independent BLMPred_5–60_test dataset (Supplementary Table S2, Supplementary Figure S3) and SVM outperformed all other classifiers. Thus, when utilizing ProtTrans embeddings as input, SVM was clearly the best performing method on the BLMPred_5–60 dataset. In general, we found that none of the models were overfitted and their results were quite robust, as evidenced by the low standard deviation values of the performance metrics (Supplementary Tables S1 and S2).

3.3. Models trained on the BLMPred_8–25 dataset

SVM was also the best performing method among the machine learning models trained on the embeddings derived from the BLMPred_8–25 dataset (Supplementary Tables S3, S4, Supplementary Figures S4, S5). Its accuracy was around 5 % higher than that of the next best performing model, k-nearest neighbors. SVMs trained on the BLMPred_8–25 and BLMPred_5–60 datasets achieved a similar performance (Supplementary Tables S1-S4). This is not surprising as the BLMPred_8–25 and BLMPred_5–60 strongly overlap: only approximately 9 % of the BLMPred_5–60 dataset is constituted by B-cell epitopes with lengths outside of the 8–25 range which have been eliminated to create the BLMPred_8–25 dataset.

3.4. BLMPred models

Based on the model assessment presented above, we selected SVM for further analyses. SVM trained on the entire BLMPred_5–60_training and BLMPred_8–25_training datasets will be referred to as BLMPred_5–60 and BLMPred_8–25 models, respectively. The BLMPred_5–60 model, when tested on the independent BLMPred_5–60_test dataset, exhibited the accuracy, precision, recall, F1 score, specificity, MCC, AUROC and AUPRC of 0.846, 0.859, 0.829, 0.844, 0.864, 0.693, 0.846, and 0.798, respectively. Similarly, the values of accuracy, precision, recall, F1 score, specificity, MCC, AUROC and AUPRC achieved by the BLMPred_8–25 model when tested on the independent BLMPred_8–25_test dataset were 0.835, 0.849, 0.814, 0.831, 0.856, 0.671, 0.835, and 0.785, respectively. The BLMPred_5–60 and BLMPred_8–25 models accurately predicted 82.9 % (86.4 %) and 81.4 % (85.6 %) of the B-cell epitopes (non-B-cell epitopes) present in the BLMPred_5–60_test and BLMPred_8–25_test datasets, respectively, with low Type I and Type II error levels (Fig. 2).

Fig. 2.

Fig. 2

Confusion matrices obtained (A) on the independent BLMPred_5–60_test dataset by the BLMPred_5–60 model, and (B) on the BLMPred_8–25_test dataset by the BLMPred_8–25 model.

3.5. BLMPred performance on homology reduced dataset

Strict homology reduction procedures routinely applied to full-length protein sequences are not directly applicable to short peptides. Nevertheless, we assessed potential similarity between the peptides in our datasets as well as model performance on a homology reduced dataset. We downloaded 215912 experimentally validated linear B-cell epitope sequences (positive samples) and 491874 non-B-cell epitope sequences (negative samples) from the Immune Epitope Database (IEDB, version: November 2024) [38]. The B-cell epitopes and non-B-cell epitopes belonged to 10155 and 7925 host proteins, respectively, with unique UniProt accession numbers. Protein sequences of these host proteins in FASTA format were extracted using Dbfetch [23] and clustered based on 70 % sequence similarity by CD-HIT [15]. To reduce redundancy, training and test datasets were generated by extracting epitopes from the representative sequences of each cluster. After extensive data cleaning and filtering, the final dataset comprised epitope sequences with lengths varying between 5 and 60. Since the datasets are imbalanced (negative to positive ratio 4:1), we balanced them by drawing samples from the negative dataset while keeping a similar sequence length distribution as that of the positive samples. The final training and testing datasets consisted of 115310 and 20926 epitopes, respectively. ProtTrans embeddings [11] of these datasets were generated and used for training machine learning models. Support Vector Machine performed best among all models when tested on the test dataset and achieved the accuracy, precision, recall, F1 score, specificity, MCC, AUROC and AUPRC of 0.754, 0.92, 0.555, 0.693, 0.952, 0.553, 0.754, and 0.734, respectively. This reduction in performance is expected, as stricter redundancy filtering reduces sequence similarity within the training data, limiting the model’s exposure to similar peptides and slightly lowering predictive performance.

3.6. Performance of BLMPred compared with the reported performance of other linear B-cell epitope prediction tools

We compared the performance of BLMPred with five other models trained to classify an input peptide sequence as being a B-cell epitope or not: Support Vector Machine based on Tri-peptide similarity and Propensity scores (SVMTriP) [42], Linear B-Cell Exact Epitope Predictor (LBEEP) [34], LBCE-XGB [22], LBCE-BERT [21], and epitope1D [8]. A detailed summary of these tools is presented in Supplementary Table S5, including the specific algorithms, datasets and features utilized as well as the performance metrics reported in the corresponding original publications. According to Supplementary Table S5, epitope1D reportedly outperforms other methods in terms of MCC and AUROC, while our method, BLMPred_5–60, performs better than all existing tools in terms of accuracy, precision, recall, specificity, F1 score, and AUPRC. We attribute this high-performance level of BLMPred_5–60 to many up-to-date experimentally verified linear B-cell epitope data collected from IEDB, extensive data filtering, and the utilization of ProtTrans embeddings. Although BLMPred shows slightly lower AUROC values than epitope1D (0.83–0.84 vs 0.93), it achieves higher sensitivity and accuracy in identifying B-cell epitopes, which is critical for selecting experimentally testable candidates. The difference in AUROC likely arises from epitope1D’s incorporation of taxonomic features, ontology, and graph-based features that enhance peptide ranking. BLMPred focuses on thresholded binary classification using ProtT5 embeddings with an SVM, prioritizing practical sensitivity over rank-based discrimination. In the next section, we utilize a benchmarking dataset and perform a detailed comparative analysis of BLMPred with the SVMTriP, LBEEP, and epitope1D.

3.7. BLMPred compared with other existing tools on an independent dataset

As mentioned in Section 3.5, there are two major varieties of B-cell epitope prediction tools: i) methods predicting epitopic regions from protein sequences, and ii) methods predicting the possibility of an input peptide to be a B-cell epitope or not. Although tools belonging to the first category can also be executed for small peptides, it would be unjustifiable to compare them with the tools belonging to the second category. For example, Bepipred-3.0, Bepipred-2.0, Bepipred-1.0, have excellent predictive capabilities when tested on whole protein sequences but they do not perform in a desired manner when tested on short peptides, as observed in a recent study [8]. Our BLMPred model falls into the second category and here, we conduct a detailed comparative performance analysis of BLMPred with the freely available standalone tools belonging to the same category that we were able to install and execute – SVMTriP, LBEEP, and epitope1D. Supplementary Table S6 briefly summarizes the evaluation feasibility, reasons for exclusion from benchmarking, and notes for the existing tools considered for benchmarking analysis. To ensure fair benchmarking, we executed the standalone versions of epitope1D, LBEEP, and SVMTriP using their recommended default parameters. We note that epitope1D incorporates taxonomic information, which can influence performance across species and may introduce biases in comparative outcomes. For SVMTriP, six separate models were trained separately on epitopes of length 10, 12, 14, 16, 18, and 20 amino acids [42]. We selected the reportedly best-performing SVMTriP model trained on epitopes of 20 amino acids in length. LBEEP, SVMTriP, epitope1D, only accept as input peptides within the length ranges 6–15, 10–20, and 6-, respectively. Although BLMPred models do not have any strict length-based restrictions and can be executed for any length peptide, they have been trained and thus perform best on peptides whose length is in the range of 5–60. Hence, we utilized different groups of samples from the BLMPred_benchmark dataset to fulfill the length-based restrictions of the selected tools for an unbiased detailed comparative analysis of their performance (Table 1).

Table 1.

Comparison of BLMPred (name marked in bold font) with the previously published methods. The best achieved values for each of the performance metrics are highlighted in bold font. Lowest achieved values for FP and FN are in bold italics font.

Model Sample selection from BLMPred_benchmark dataset Performance metrics
Name Length-based restriction (min-max) Length range #Epitopes #Non-epitopes Accuracy Precision Recall F1 score Specificity AUROC AUPRC TP FP TN FN NPV Balanced accuracy
BLMPred NA 6–71 2928 1000 0.72 0.86 0.75 0.80 0.64 0.69 0.83 2194
(74.9 %)
361 (36.1 %) 639 (63.9 %) 734 (25.1 %) 0.47 0.69
epitope1D 6- 0.31 0.89 0.09 0.16 0.97 0.53 0.76 259 (8.8 %) 31 (3.1 %) 969 (96.9 %) 2669 (91.2 %) 0.27 0.53
BLMPred NA 20 188 41 0.66 0.81 0.76 0.78 0.17 0.46 0.81 143 (76 %) 34 (83 %) 7 (17 %) 45 (24 %) 0.13 0.47
epitope1D 6- 0.21 1.00 0.03 0.06 1.00 0.52 0.83 6 (3.2 %) 0 (0 %) 41 (100 %) 182 (96.8 %) 0.18 0.52
SVMTriP 20 0.30 0.78 0.21 0.33 0.72 0.46 0.82 39 (20.7 %) 11 (31.7 %) 28 (68.3 %) 149 (79.3 %) 0.16 0.46
BLMPred NA 6–15 546 400 0.63 0.78 0.50 0.60 0.80 0.65 0.68 273 (50 %) 79 (19.8 %) 321 (80.2 %) 273 (50 %) 0.54 0.65
LBEEP 6–15 0.47 0.56 0.35 0.43 0.63 0.49 0.57 192 (35.2 %) 149 (37 %) 252 (63 %) 355 (64.8 %) 0.42 0.49
epitope1D 6- 0.45 0.90 0.05 0.09 0.99 0.52 0.59 26 (4.8 %) 3 (0.5 %) 398 (99.5 %) 521 (95.2 %) 0.43 0.52

As shown in Table 1, BLMPred correctly identifies 75 % 50 %, and 76 % of the epitopes in the length ranges of 6–71, 6–15, and 20, respectively, in the benchmarking dataset. Also, BLMPred correctly predicts 64 % and 80 %, non-B-cell-epitopes in the length ranges of 6–71 and 6–15, respectively. Although the model identifies 143 epitopes out of 188 epitopes, it fails to identify most of the non-B-cell-epitopes, resulting in a low specificity for the 20-length range. The AUROC value of 0.46 arises exclusively from a very small, length-specific subset of the benchmark dataset - peptides of exactly 20 amino acids. We think that this problem happens because the model was trained on very few samples of length 20. Only 1.6 % of our training dataset consisted of peptides with the length of exactly 20aa. Such sparsely represented subsets are inherently more prone to variance and do not reliably reflect the general behavior of the model. Importantly, this isolated AUROC value should not be interpreted as representative of the model’s overall performance. For all other peptide-length partitions in the benchmark dataset, BLMPred achieves AUROC values of 0.69 and 0.65 (Table 1), demonstrating performance that is robust and well above random expectation. Thus, the model’s overall predictive ability remains strong. Also, to account for class imbalance in the benchmarking dataset, we have reported F1-score, AUPRC, and balanced accuracy (Table 1), collectively demonstrating robust predictive performance. In imbalanced real-world settings, F1-score is particularly informative, and based on F1-scores, BLMPred consistently outperforms other methods across all partitions of the benchmark dataset.

4. Conclusions

The existing machine learning-based B-cell epitope prediction tools can be broadly partitioned into two categories. One category including tools such as Bepipred-3.0, Bepipred-2.0, epiDope, are trained on complete antigen protein sequences, whereas the other category including tools such as SVMTriP, LBEEP, epitope1D, are trained on exact linear B-cell epitope sequences. The first category seems to perform poorly on short peptides as observed in a recent study [8]. Both categories are equally important for antibody epitope predictions. The first category is mainly beneficial when the goal is to identify all B-cell epitopes from an entire protein sequence. The second category is applicable when the user seeks to determine if a specific peptide could serve as a potential B-cell epitope. Previous reports suggest that epitope prediction from conserved protein sequences derived from Multiple Sequence Alignments (MSAs) is more accurate [43]. Peptide-based B-cell epitope prediction tools are particularly well suitable for classifying conserved MSA blocks as antibody epitopes or not.

BLMPred is a simple binary classifier developed by training a Support Vector Machine on a huge non-redundant experimentally validated dataset composed of 222030 peptides numerically embedded by the ProtTrans pLM. The dataset has been carefully constructed by maintaining uniform length distributions among both positive and negative samples collected from the IEDB database. A 10-fold cross-validation revealed that the BLMPred model outperformed other traditional machine learning algorithms. The BLMPred model is characterized by low rates of both type-I and type-II errors and therefore has a good predictive capability of identifying B-cell epitopes and the non-antibody epitopes, as evident from Fig. 2. BLMPred outperforms other tested tools on all partitions of the benchmarking datasets in terms of accuracy, recall, and F1 score (Table 1).

Although BLMPred employs a conventional SVM-based classification framework, its novelty arises from the incorporation of protein language model (pLM)-derived embeddings as peptide representations. These embeddings, obtained from ProtTrans models trained on billions of protein sequences, encode contextual, evolutionary, and structural information that classical physicochemical or handcrafted features cannot capture. Unlike conventional descriptors that treat each residue as an independent entity, pLM embeddings model residue co-occurrence patterns, long-range dependencies, and biologically relevant sequence contexts. Such context-aware representations are likely to encompass immunogenic determinants such as surface accessibility, flexibility, and conserved motifs that contribute to antigen-antibody interactions. Therefore, even though the classifier itself is simple, the embeddings provide a rich and biologically meaningful feature space that enhances the model’s ability to distinguish B-cell epitopes from non-epitopes.

The BLMPred model is available as a Github repository (https://github.com/bdbarnalidas/BLMPred.git) with thorough instructions for the users and can be easily cloned/downloaded and executed by the experts and non-experts alike. The B-cell epitopes predicted by BLMPred can be utilized in the fields of immunology and biotechnology for vaccine development and antibody engineering. As a future vision, we can anticipate that combining structure-based embeddings and sequence-based embeddings may further improve the predictive potential of BLMPred. We also wish to extend our approach by training BLMPred on entire protein sequences embedded by ProtTrans to predict one or multiple B-cell epitopes from the protein sequences. We are planning to apply both per-residue as well as per-protein-based embeddings for that purpose. We can also test other widely used pLMs like ESM [20]. Another probable future peptide-based approach would be to map peptides to protein sequences, generate embeddings for proteins, extract sub-embeddings corresponding to the peptide of interest, and use them for training machine learning models. We anticipate that this approach may improve the overall performance since pLMs the embedding vectors for entire protein sequence capture long-range dependencies possibly making the resulting representation more informative.

Abbreviations

Not applicable.

CRediT authorship contribution statement

Dmitrij Frishman: Writing – review & editing, Supervision, Resources, Project administration, Investigation, Funding acquisition, Conceptualization. Barnali Das: Writing – original draft, Visualization, Validation, Software, Methodology, Investigation, Formal analysis, Data curation, Conceptualization.

Authors’ contributions

B.D.: Conception, Design, Data acquisition, Analysis, Software, Writing original draft; D.F.: Conception, Analysis, Supervision, Funding acquisition, Review manuscript.

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors consent for publication.

Funding

This work was funded by the grant 031L0292E from the German Federal Ministry of Education and Research (BMBF).

Declaration of Competing Interest

None.

Acknowledgements

Not applicable.

Footnotes

Appendix A

Supplementary data associated with this article can be found in the online version at doi:10.1016/j.csbj.2025.12.014.

Appendix A. Supplementary material

Supplementary material

mmc1.docx (1.6MB, docx)

Data availability

We used the experimentally validated linear B-cell epitopes from the IEDB database (https://www.iedb.org). The BLMPred datasets created for this paper are publicly available at https://github.com/bdbarnalidas/BLMPred/tree/main/BLMPred_Datasets. The benchmarking datasets used for a performance assessment between different models are publicly available at https://github.com/bdbarnalidas/BLMPred/tree/main/BLMPred_Datasets.

References

  • 1.Ashford J., Reis-Cunha J., Lobo I., Lobo F., Campelo F. Organism-specific training improves performance of linear B-cell epitope prediction. Bioinformatics. 2021;37(24):4826–4834. doi: 10.1093/bioinformatics/btab536. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2.Behmard E., Soleymani B., Najafi A., Barzegari E. Immunoinformatic design of a COVID-19 subunit vaccine using entire structural immunogenic epitopes of SARS-CoV-2. Sci Rep. 2020;10(1):20864. doi: 10.1038/s41598-020-77547-4. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3.Chen T., Guestrin C. Xgboost: a scalable tree boosting system. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 2016 [Google Scholar]
  • 4.Chicco D. Ten quick tips for machine learning in computational biology. BioData Min. 2017;10(1):35. doi: 10.1186/s13040-017-0155-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 5.Chicco D., Jurman G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genom. 2020;21:1–13. doi: 10.1186/s12864-019-6413-7. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6.Clifford J.N., Høie M.H., Deleuran S., Peters B., Nielsen M., Marcatili P. BepiPred-3.0: Improved B-cell epitope prediction using protein language models. Protein Sci. 2022;31(12) doi: 10.1002/pro.4497. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7.Collatz M., Mock F., Barth E., Hölzer M., Sachse K., Marz M. EpiDope: a deep neural network for linear B-cell epitope prediction. Bioinformatics. 2020;37(4):448–455. doi: 10.1093/bioinformatics/btaa773. [DOI] [PubMed] [Google Scholar]
  • 8.da Silva B.M., Ascher D.B., Pires D.E. epitope1D: accurate taxonomy-aware B-cell linear epitope prediction. Brief Bioinforma. 2023;24(3):bbad114. doi: 10.1093/bib/bbad114. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 9.De R.K., Tomar N. Springer; 2014. Immunoinformatics. [Google Scholar]
  • 10.Deszyński P., Młokosiewicz J., Volanakis A., Jaszczyszyn I., Castellana N., Bonissone S., Ganesan R., Krawczyk K. INDI—integrated nanobody database for immunoinformatics. Nucleic Acids Res. 2022;50(D1):D1273–D1281. doi: 10.1093/nar/gkab1021. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11.Elnaggar A., Heinzinger M., Dallago C., Rehawi G., Wang Y., Jones L., Gibbs T., Feher T., Angerer C., Steinegger M. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell. 2021;44(10):7112–7127. doi: 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
  • 12.Emini E.A., Hughes J.V., Perlow D.S., Boger J. Induction of hepatitis A virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol. 1985;55(3):836–839. doi: 10.1128/jvi.55.3.836-839.1985. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13.Hricik T., Bader D., Green O. 2020 IEEE High Performance Extreme Computing Conference (HPEC) IEEE; 2020. Using RAPIDS AI to accelerate graph data science workflows. [Google Scholar]
  • 14.Hu Y., Wang Y., Hu X., Chao H., Li S., Ni Q., Zhu Y., Hu Y., Zhao Z., Chen M. T4SEpp: A pipeline integrating protein language models to predict bacterial type IV secreted effectors. Comput Struct Biotechnol J. 2024;23:801–812. doi: 10.1016/j.csbj.2024.01.015. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 15.Huang Y., Niu B., Gao Y., Fu L., Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26(5):680–682. doi: 10.1093/bioinformatics/btq003. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16.Jespersen M.C., Peters B., Nielsen M., Marcatili P. BepiPred-2.0: improving sequence-based B-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 2017;45(W1):W24–W29. doi: 10.1093/nar/gkx346. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17.Kalejaye L., Wu I.-E., Terry T., Lai P.-K. DeepSP: Deep learning-based spatial properties to predict monoclonal antibody stability. Comput Struct Biotechnol J. 2024;23:2220–2229. doi: 10.1016/j.csbj.2024.05.029. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 18.Kolaskar A.S., Tongaonkar P.C. A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Lett. 1990;276(1-2):172–174. doi: 10.1016/0014-5793(90)80535-q. [DOI] [PubMed] [Google Scholar]
  • 19.Kozlova E.E.G., Cerf L., Schneider F.S., Viart B.T., NGuyen C., Steiner B.T., de Almeida Lima S., Molina F., Duarte C.G., Felicori L. Computational B-cell epitope identification and production of neutralizing murine antibodies against Atroxlysin-I. Sci Rep. 2018;8(1):14904. doi: 10.1038/s41598-018-33298-x. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 20.Lin Z., Akin H., Rao R., Hie B., Zhu Z., Lu W., Smetanin N., Verkuil R., Kabeli O., Shmueli Y. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379(6637):1123–1130. doi: 10.1126/science.ade2574. [DOI] [PubMed] [Google Scholar]
  • 21.Liu F., Yuan C., Chen H., Yang F. Prediction of linear B-cell epitopes based on protein sequence features and BERT embeddings. Sci Rep. 2024;14(1):2464. doi: 10.1038/s41598-024-53028-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22.Liu Y., Liu Y., Wang S., Zhu X. LBCE-XGB: a XGBoost model for predicting linear B-Cell epitopes based on BERT embeddings. Interdiscip Sci Comput Life Sci. 2023;15(2):293–305. doi: 10.1007/s12539-023-00549-z. [DOI] [PubMed] [Google Scholar]
  • 23.Madeira F., Pearce M., Tivey A.R., Basutkar P., Lee J., Edbali O., Madhusoodanan N., Kolesnikov A., Lopez R. Search and sequence analysis tools services from EMBL-EBI in 2022. Nucleic Acids Res. 2022;50(W1):W276–W279. doi: 10.1093/nar/gkac240. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24.Manavalan B., Govindaraj R.G., Shin T.H., Kim M.O., Lee G. iBCE-EL: a new ensemble learning framework for improved linear B-cell epitope prediction. Front Immunol. 2018;9:1695. doi: 10.3389/fimmu.2018.01695. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 25.Mucci J., Carmona S.J., Volcovich R., Altcheh J., Bracamonte E., Marco J.D., Nielsen M., Buscaglia C.A., Agüero F. Next-generation ELISA diagnostic assay for Chagas Disease based on the combination of short peptidic epitopes. PLoS Negl Trop Dis. 2017;11(10) doi: 10.1371/journal.pntd.0005972. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 26.Mullan K.A., Zhang J.B., Jones C.M., Goh S.J., Revote J., Illing P.T., Purcell A.W., La Gruta N.L., Li C., Mifsud N.A. TCR_Explore: a novel webtool for T cell receptor repertoire analysis. Comput Struct Biotechnol J. 2023;21:1272–1282. doi: 10.1016/j.csbj.2023.01.046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 27.Nori H., Caruana R., Bu Z., Shen J.H., Kulkarni J. International conference on machine learning. PMLR; 2021. Accuracy, interpretability, and differential privacy via explainable boosting. [Google Scholar]
  • 28.Nori, H., S. Jenkins, P. Koch and R. Caruana (2019). Interpretml: A unified framework for machine learning interpretability. arXiv preprint arXiv:1909.09223.
  • 29.Odorico M., Pellequer J.L. BEPITOPE: predicting the location of continuous epitopes and patterns in proteins. J Mol Recognit. 2003;16(1):20–22. doi: 10.1002/jmr.602. [DOI] [PubMed] [Google Scholar]
  • 30.Parker J.M., Guo D., Hodges R.S. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and X-ray-derived accessible sites. Biochemistry. 1986;25(19):5425–5432. doi: 10.1021/bi00367a013. [DOI] [PubMed] [Google Scholar]
  • 31.Pedregosa F., Varoquaux G., Gramfort A., Michel V., Thirion B., Grisel O., Blondel M., Prettenhofer P., Weiss R., Dubourg V. Scikit-learn: Machine learning in Python. J Mach Learn Res. 2011;12:2825–2830. [Google Scholar]
  • 32.Ras-Carmona A., Lehmann A.A., Lehmann P.V., Reche P.A. Prediction of B cell epitopes in proteins using a novel sequence similarity-based method. Sci Rep. 2022;12(1):13739. doi: 10.1038/s41598-022-18021-1. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 33.Saha S., Raghava G.P.S. International conference on artificial immune systems. Springer; 2004. BcePred: prediction of continuous B-cell epitopes in antigenic sequences using physico-chemical properties. [Google Scholar]
  • 34.Saravanan V., Gautham N. Harnessing computational biology for exact linear B-cell epitope prediction: a novel amino acid composition-based feature descriptor. Omics a J Integr Biol. 2015;19(10):648–658. doi: 10.1089/omi.2015.0095. [DOI] [PubMed] [Google Scholar]
  • 35.Shirai H., Prades C., Vita R., Marcatili P., Popovic B., Xu J., Overington J.P., Hirayama K., Soga S., Tsunoyama K. Antibody informatics for drug discovery. Biochim Biophys Acta (BBA) Proteins Proteom. 2014;1844(11):2002–2015. doi: 10.1016/j.bbapap.2014.07.006. [DOI] [PubMed] [Google Scholar]
  • 36.Teukam Y.G.N., Dassi L.K., Manica M., Probst D., Schwaller P., Laino T. Language models can identify enzymatic binding sites in protein sequences. Comput Struct Biotechnol J. 2024;23:1929–1937. doi: 10.1016/j.csbj.2024.04.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37.Vardaxis I., Simovski B., Anzar I., Stratford R., Clancy T. Deep learning of antibody epitopes using positional permutation vectors. Comput Struct Biotechnol J. 2024;23:2695–2707. doi: 10.1016/j.csbj.2024.06.005. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 38.Vita R., Mahajan S., Overton J.A., Dhanda S.K., Martini S., Cantrell J.R., Wheeler D.K., Sette A., Peters B. The immune epitope database (IEDB): 2018 update. Nucleic Acids Res. 2019;47(D1):D339–D343. doi: 10.1093/nar/gky1006. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39.Wang L., Zeng Z., Xue Z., Wang Y. DeepNeuropePred: a robust and universal tool to predict cleavage sites from neuropeptide precursors by protein language model. Comput Struct Biotechnol J. 2024;23:309–315. doi: 10.1016/j.csbj.2023.12.004. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40.Wang X., Gao X., Fan X., Huai Z., Zhang G., Yao M., Wang T., Huang X., Lai L. WUREN: Whole-modal union representation for epitope prediction. Comput Struct Biotechnol J. 2024;23:2122–2131. doi: 10.1016/j.csbj.2024.05.023. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41.Yadav S., Vora D.S., Sundar D., Dhanjal J.K. TCR-ESM: employing protein language embeddings to predict TCR-peptide-MHC binding. Comput Struct Biotechnol J. 2024;23:165–173. doi: 10.1016/j.csbj.2023.11.037. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 42.Yao B., Zheng D., Liang S., Zhang C. SVMTriP: a method to predict B-cell linear antigenic epitopes. Immunoinformatics. 2020:299–307. doi: 10.1007/978-1-0716-0389-5_17. [DOI] [PubMed] [Google Scholar]
  • 43.Yasmin T., Nabi A.N. B and T cell epitope-based peptides predicted from evolutionarily conserved and whole protein sequences of Ebola virus as vaccine targets. Scand J Immunol. 2016;83(5):321–337. doi: 10.1111/sji.12425. [DOI] [PubMed] [Google Scholar]
  • 44.Ye Y., Shen Y., Wang J., Li D., Zhu Y., Zhao Z., Pan Y., Wang Y., Liu X., Wan J. SIGANEO: Similarity network with GAN enhancement for immunogenic neoepitope prediction. Comput Struct Biotechnol J. 2023;21:5538–5543. doi: 10.1016/j.csbj.2023.10.050. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45.Zeng Y., Wei Z., Yuan Q., Chen S., Yu W., Lu Y., Gao J., Yang Y. Identifying B-cell epitopes using AlphaFold2 predicted structures and pretrained language model. Bioinformatics. 2023;39(4):btad187. doi: 10.1093/bioinformatics/btad187. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46.Zhao W., Luo X., Tong F., Zheng X., Li J., Zhao G., Zhao D. Improving antibody optimization ability of generative adversarial network through large language model. Comput Struct Biotechnol J. 2023;21:5839–5850. doi: 10.1016/j.csbj.2023.11.041. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary material

mmc1.docx (1.6MB, docx)

Data Availability Statement

We used the experimentally validated linear B-cell epitopes from the IEDB database (https://www.iedb.org). The BLMPred datasets created for this paper are publicly available at https://github.com/bdbarnalidas/BLMPred/tree/main/BLMPred_Datasets. The benchmarking datasets used for a performance assessment between different models are publicly available at https://github.com/bdbarnalidas/BLMPred/tree/main/BLMPred_Datasets.


Articles from Computational and Structural Biotechnology Journal are provided here courtesy of AAAS Science Partner Journal Program

RESOURCES