B-cell epitope prediction in the age of machine learning: advancements and challenges

Fabrizio Gabellieri; Ankita Singh; Sukrit Gupta; Halima Bensmail; Filippo Castiglione; Raghvendra Mall

doi:10.1186/s12967-025-07673-y

. 2026 Jan 22;24:248. doi: 10.1186/s12967-025-07673-y

B-cell epitope prediction in the age of machine learning: advancements and challenges

Fabrizio Gabellieri ¹, Ankita Singh ¹, Sukrit Gupta ², Halima Bensmail ³, Filippo Castiglione ^4,^5,^✉, Raghvendra Mall ^1,^3,^✉

PMCID: PMC12908366 PMID: 41572292

Abstract

Background

The identification of B-cell epitopes, the regions that bind to antibodies, is essential for creating effective prophylactic treatments against infectious diseases and cancer, particularly in the realm of reverse vaccinology. While experimental techniques like X-ray crystallography and peptide arrays help identify epitopes, they are expensive, time-consuming and differ in throughput and precision.

Methods

This review examines how predictive techniques and datasets have evolved for the problem, highlighting recent breakthroughs in data-driven algorithms used to predict B-cell epitopes. We specifically examine how methodologies have progressed from traditional machine learning to cutting-edge deep learning models.

Conclsion

The review summarizes significant research contributions in this domain including linear and conformational epitope prediction techniques, addresses methodological biases, dataset limitations, systematic evaluation challenges that plague the field, and explores future opportunities for innovation.

Graphical Abstract

Keywords: B-cell epitopes, Linear conformational epitopes, Machine learning, Protein Language Models (PLMs)

Introduction

Accurate prediction of B-cell epitopes is a cornerstone in the development of immunotherapy strategies, including vaccines and antibody-based treatments for infectious diseases and cancer. B-cell epitopes are specific regions on antigens that are recognized by B-cell receptors or antibodies, initiating an immune response. It is imperative to identify these epitopes to design targeted interventions which can induce robust and specific immune reactions, thus enhancing the efficacy of immunotherapies.

In the adaptive immune response, the recognition of an epitope by a B-cell receptor is a pivotal event to elicit a humoral response. Upon antigen binding, B cells internalize and process the antigen, presenting fragments of antigens via major histocompatibility complex (MHC, also known as HLA, Human Leukocyte Antigen, in humans) class II molecules to helper T cells. This interaction provides the necessary signals for B cell activation, proliferation, and differentiation into antibody-secreting plasma cells and for memory B cells [1, 2]. The antibodies produced can neutralize pathogens or mark them for destruction, underscoring the importance of accurate epitope recognition in shaping the specificity and longevity of the humoral immune response. Traditionally, the identification of B-cell epitopes relied heavily on experimental methods such as X-ray crystallography, cryo-electron microscopy and peptide assays [3–5]. Although informative, these techniques are often labor intensive, time-consuming, expensive and hit-and-trial based. The emergence of immunoinformatics tools has revolutionized this process by facilitating in silico predictions of B-cell epitopes, thereby, accelerating immunotherapeutic design and reducing associated costs [6] (see Fig. 1).

Fig. 1 — The left panel demonstrates epitope prediction through protein sequence analysis by marking probable regions in orange. The right panel demonstrates how predicted regions interact with B cell surface antibodies by showing the epitope (orange) interacting with the antibody binding site. The figure contrasts two complementary paradigms of B-cell epitope prediction: sequence-based linear approaches that identify probable epitope regions directly from amino-acid sequences, and structure-based conformational approaches that capture antibody–antigen interactions in three-dimensional space. Together, it highlights that while sequence-based methods efficiently detect linear epitope candidates, structure-based methods are essential for modeling realistic binding interfaces and conformational epitopes involved in antibody recognition

The advent of high-throughput technologies such as mass spectrometry [7] has led to the collection of extensive curated datasets of known B-cell epitopes. These detailed repositories, such as IEDB [8], have paved the way for the application of advanced Artificial Intelligence (AI) methodologies, which complement and upgrade traditional Machine Learning (ML) approaches. In particular, Deep Learning (DL) models have demonstrated impressive capacity in capturing intricate patterns within biological data [9, 10], leading to improved performance for the problem of epitope identification [11].

In the context of cancer immunotherapy, AI-driven approaches have been instrumental in neoantigen identification, facilitating the development of personalized cancer vaccines [12]. ML algorithms can predict tumor-specific antigens by analyzing genomics data such as next generation sequencing or transcriptomics, enabling the design of targeted immunotherapies. Additionally, AI has been applied to predict patient responses to immunotherapy, allowing personalized and effective treatment strategies [13, 14].

Despite these advancements, challenges remain in the accurate prediction of B-cell epitopes. The complexity of antigen-antibody interactions, particularly for conformational epitopes formed by discontinuous amino acid sequences, requires continuous refinement of prediction models [15]. Integrating structural information with AI techniques such as AlphaFold3 [16] holds promise for overcoming some of these challenges, as evidenced by recent effort to incorporate protein tertiary structure information into epitope prediction models [17].

Moreover, recently, with the advent of transformer [18] architecture, we have seen the unprecedented advancements in large language models [19] which fall in the realm of generative AI. Large Language Models (LLMs), are advanced DL models trained on vast amounts of text to understand and generate human language. Their applications to the biological domain [20–22] are also making significant strides. LLMs, particularly, the subclass called Protein Language Models (PLMs), are trained on vast amounts of protein sequence and/or structural data. This pretraining allows them to learn intricate contextual relationships and generate more sophisticated and informative numerical representations (embeddings) of amino acid sequences. These embeddings capture subtle patterns and dependencies within the sequence that might be missed by traditional feature engineering, potentially leading to a better understanding of the factors which influence downstream prediction tasks [23, 24]. Several recent studies have demonstrated that incorporating transformer-derived embeddings significantly improves the accuracy of B-cell epitope prediction for both linear and conformational epitopes [25–27]. A brief timeline of B-cell epitope prediction techniques is highlighted in Fig. 2.

Fig. 2 — The development of B-cell epitope prediction methods shows a progression from linear sequence-based scoring to advanced structure-aware deep learning frameworks. The early methods (1977–1996) utilized physicochemical properties and manually derived scoring scales, such as Parker hydrophilicity, Emini surface accessibility, and Kolaskar–Tongaonkar antigenicity, to predict linear B-cell epitopes. The period from 2003 to 2015 brought about curated datasets and structural features which enabled feature-based machine learning models (e.g., SVMs, ANNs) to be implemented via tools like lbtope and cobepro. The period from 2015 to 2020 saw deep learning architectures (e.g., CNNs, RNNs) enable end-to-end prediction directly from raw sequence data. The most recent advances (2020–2022) include transformer-based embeddings (e.g., ESM-2, protbert, scannet) used in tools like bepipred 3.0, which infer linear epitope regions from sequence with embedded structural signals. Structure-based methods such as discotope 3.0 have in parallel enhanced the prediction of conformational B-cell epitopes by using 3-dimensional protein structures to identify spatially clustered antigenic regions. These innovations demonstrate the transition from linear sequence-only methods to hybrid and structure-integrated prediction strategies

In this review, we examine notable ML and AI-driven contributions to the field of B-cell epitope prediction, and discuss emerging trends and future directions in the integration of AI for in silico B-cell epitope identification. Specifically, the remainder of this review is organized as follows. Section 2 presents selected examples illustrating the role of data in shaping methodological advances in B-cell epitope prediction. Section 3 reviews key modeling approaches, covering linear (3.1) and conformational epitope prediction techniques (3.2). Section 4 discusses primary challenges and possible solutions, including applications of AI, protein language models, and structural biology. Section 5 highlights emerging directions followed by concluding remarks.

Datasets fueling in silico development

In B-cell epitope prediction, the quality and structure of training and evaluation datasets often strongly contribute to model success. As ML models become more powerful, their effectiveness increasingly depends on how well the data reflects biological complexity, structural accuracy, and immunological relevance. This principle is well illustrated by the evolution of the SEPPA (SEPPA 1.0, SEPPA 2.0, and SEPPA 3.0) [28], BepiPred (BepiPred-1.0, BepiPred-2.0, BepiPred-3.0, BepiPred-3.0, BP2, BP3, BP2HR and BP3HR) [29], and DiscoTope (DiscoTope-1 and DiscoTope-2) [30] series. This highlights how thoughtful dataset design can shape the performance of epitope prediction models. The SEPPA series demonstrates how dataset design can be tailored in response to specific scientific goals by adapting over time in response to growing biological complexity. SEPPA 1.0 [31] began with 82 high-resolution antigen–antibody complexes, yielding 84 unique epitopes for training. A non-overlapping test set of 119 antigens was compiled from external sources like DiscoTope [32], IEDB, and Epitome [33]. SEPPA 2.0 [34] expanded to 314 structures for a total of 435 epitopes and introduced a subdivision into nine categories based on subcellular localization and immune host characterization. This was done in an effort to capture the influence of biological context on epitope presentation. An independent test set of 42 complexes was sourced from Bpredictor [35], EPMeta [36], and DiscoTope-2.0 [37]. SEPPA 3.0 [28] was trained on a larger dataset of 832 complexes (767 used for training). These included over 500 glycosylated antigens to investigate the role of glycosylation in antigenicity. To support targeted evaluation, two dedicated test sets were established, one consisting of 130 conformational epitope structures released after 2016, and another of 106 glycosylated antigens. Across all versions, SEPPA’s dataset design has shifted from general structural criteria to increasingly nuanced biological and molecular relevance. This shows that high-quality predictions require datasets tailored not just to size but to specificity.

The BepiPred [29] series similarly exemplifies the impact of systematic data refinement, including structural rigor and redundancy control, in improving B-cell epitope prediction. BepiPred-1.0 was developed using antibody-binding peptides identified through experimental studies but these sources often suffered from poor annotation and noise, limiting predictive performance. This shortcoming was directly addressed in BepiPred-2.0 [38], which used only antigen-antibody crystal structures to define epitope residues (i.e., within 4Å of antibody heavy atoms). From a curated set of 649 PDB entries, 160 non-redundant antigens were retained. This resulted in significantly improved performance when compared to BepiPred-1.0 and LBtope [39]. The superiority of the model over BepiPred-1.0 was further confirmed through evaluation on a linear epitope benchmark comprising over 11,000 confirmed IEDB peptides. In BepiPred-3.0 [29], both the scale and sophistication of data curation were expanded. The dataset used in BepiPred-2.0 was formalized as BP2, and a new, larger BP3 dataset (1,466 antigens) was compiled using the same methodology with additional selection filters. To control redundancy without losing epitope information, an epitope-collapse strategy was introduced: sequences were clustered via MMseqs2 [40]. Known epitope labels from cluster members were grafted onto a representative sequence. This reduced BP2 to 238 antigens and BP3 to 603. A more stringent 50% sequence identity threshold yielded the dataset BP3C50ID (358 antigens), while additional 70% filtering produced BP2HR and BP3HR (190 and 398 antigens, respectively). A benchmark study using a random forest classifier confirmed the effectiveness of these refinements. Models trained on BP3 and its condensed versions consistently and significantly outperformed those trained on earlier datasets, with statistically significant gains in predictive accuracy. These results underscore how careful data curation and not just more data, can drive meaningful improvements in model performance. This also highlights an issue with annotation consistency, quality and completeness of structural data across various studies, which has an overall impact in the performance of corresponding B cell epitope prediction model.

Notably, the BepiPred-2.0 and BepiPred-3.0 datasets have since served as foundational resources for more recent attention-based and large language model (LLM)-based approaches, such as CALIBER [41] and DiscoTope-3 [30]. CALIBER drew on the linear dataset from BepiPred-2.0 to develop its linear epitope predictor, while the BepiPred-3.0 dataset has supported the training of conformational epitope prediction models. Likewise, DiscoTope-3 used BepiPred-3.0 training dataset as a reference for selecting both experimentally resolved and AlphaFold-predicted structures. This highlights the dataset’s ongoing relevance in modern structural immunoinformatics. Fig. 3 highlights the evolution of B cell epitope datasets size over time.

Fig. 3 — Dataset size evolution in B-cell epitope prediction research (2006–2024). Across all method versions, both epitope-related and structure-related signals increase steadily from v1.0 to v3.0, indicating consistent scaling and improved coverage. While structural information (‘blue’) dominates in absolute magnitude, epitope-related contributions (‘red’) grow proportionally with newer versions, suggesting that later methods achieve a more balanced and enriched integration of structural context and epitope information, with v3.0 variants showing the strongest overall signal

The first two papers of the DiscoTope series highlight the importance of benchmark completeness, emphasizing how structural context and full epitope inclusion affect evaluation fidelity. The original DiscoTope-1 [32] included 76 X-ray antigen–antibody complexes, with entries grouped into 25 non-homologous clusters to reduce redundancy. However, its annotation strategy was limited: each structure had only one epitope labeled per PDB file, and only the antibody-interacting chain was considered. This latter issue was well illustrated by the proposed example of the antigen with PDB entry 1AR1 [42], where many residues with low contact numbers actually corresponded to membrane-spanning regions of the antigen. DiscoTope-2 [37] directly addressed these issues by creating two versions of each PDB file: one retaining only the interacting chain, and another incorporating additional information of the biologically relevant antigen unit, when available. This led to significant performance gains: for the KvAP potassium channel [43], the area under receiver operating curve (AUC) [44–46] increased from 0.737 to 0.880 upon inclusion of the full biological unit, and further rose to 0.946 when non-accessible residues were excluded. The performance across the 13 proteins impacted by the benchmark adjustment improved AUC from 0.791 to 0.824, and the full dataset’s performance rose from 0.748 to 0.765. This increase in performance is substantial and clinically meaningful. DiscoTope-2 also corrected for lack of annotation by including all known epitopes from antigens with multiple antibody complexes. This raised lysozyme prediction AUC from 0.682 to 0.847, with 5 of 6 other affected proteins showing performance gains. Together, these changes highlight how structurally realistic and complete benchmarks are essential for fair and informative model evaluation.

Finally, we highlight in Table 1, numerous publicly accessible databases that serve as valuable sources for B-cell epitope datasets, supplying extensive and specialized data that can be utilized by algorithms for prediction purposes. These databases provide large, specialized sources of data that algorithms can utilize for predictions. For instance, IEDB-3D 2.0 [47] offers high-quality three-dimensional structural data focused on infectious diseases, while SDAP 2.0 [48] delivers continuously updated information on allergenic structures and epitopes, along with integrated bioinformatics tools for similarity analysis. Databases such as the HIV immunology database [49] and CEDAR [50] compile HIV-associated and cancer-relevant B-cell epitopes, respectively, as outlined in Table 2. AntiJen [55] is notable for encompassing data on a wide range of topics, making it a valuable resource for both B-cell and T-cell epitopes. Additionally, databases like the Conformational Epitope Database (CED) [53] and Epitome [33] provide curated antigen/antibody complex structures, detailed information on the residues involved in interactions, and insights into their sequence and structural environments. The availability of such diverse, high-quality datasets supports the advancement of conformational B-cell epitope prediction tools.

Table 1.

B-cell epitope associated databases. Accessibility of all databases evaluated on 03–11-2025. ^b this database was found to be inaccessible as of 03–11-2025. The choice to include it depended on two factors: links often migrate or get repaired and these databases have been used in the development of tools in the past. The link provided is the last known link used to access the database

Name	Scope	Approximate size	Ref.
IEDB-3D 2.0	Structural B-cell epitopes	5,537 assays (B-cell)	[47]
SDAP 2.0	Allergen-associated B-cell epitopes	4,067 entries	[48]
HIV immunology database	HIV-associated B-cell epitopes	4,583 entries	[49]
CEDAR	Cancer-associated B-cell epitopes	32,825 entries	[50]
IEDB	B-cell epitopes	623,173 entries	[51]
Protegen	Protective antigens	1,631 entries	[52]
CED^b	Conformational B-cell epitopes	225 entries	[53]
Epitome^b	B-cell epitopes	142 entries	[33]
BciPep	B-cell epitopes	3,031 entries	[54]
AntiJen	B-cell epitopes	24,000 entries	[55]

Open in a new tab

Table 2.

Conformational B-cell epitope prediction tools. We provide web-links to the tools along with the algorithm used and the best performance attained by individual model

Year	Name	Algorithm Type	AUC-ROC
Require structural data
2024	Discotope 3.0	XGBoost	0.795 (Solved), 0.783 (AlphaFold)
2024	SEMA 2.0 3D	Language Model	0.731
2022	ScanNet	Deep learning	0.735
2022	Epitope3D	Adaboost Classifier	0.590
2019	SEPPA 3.0	Logistic Regression	0.740 (general), 0.749 (glycoprotein)
2014	Epipred	bespoke	NA
2010	EPSVR	Support Vector Regression	0.597

Open in a new tab

Historical background and recent progress

Linear B-cell epitope prediction models

B-cell epitope prediction has grown significantly since 1980, when it first began with simple scoring-based algorithms and have since evolved into complex AI-powered pipelines. During the late 1970s and early 1980s, researchers developed their initial approaches based on the idea that epitopes are typically hydrophilic and flexible, and they reside on the surface of the protein structure. The development of intuitive, yet manually-derived, tools like the Parker hydrophilicity scale [56], Emini surface accessibility [57], and Kolaskar–Tongaonkar antigenicity method [58], were possible via this approach. However, the biological plausibility of these models did not translate into precise results failing to produce consistent predictions for several protein structures.

During the mid-1990s, pattern recognition techniques replaced earlier approaches by employing fixed-length peptide windows (e.g., 16 mers) to detect local sequence features. The BCEPred [59] and BepiPred [60] prediction tools adopted hidden Markov models as ML components to achieve superior predictive results.

During the early 2000s, supervised learning became possible due to manually curated datasets, which led to the development of tools such as ABCpred [61] and LBtope [39] that utilized multiple sequence descriptors, including amino acid composition and dipeptide frequency and positional bias.

Successively, BepiPred-2.0 [38] employed a random forest classifier including per-residue physicochemical and NetSurfP-predicted [62] structural features across 9-residue windows, using a curated, redundancy-reduced dataset of PDB-extracted antigen–antibody complexes. Evaluation was carried out both on 5 left-out clusters and on linear epitope data from IEDB. Table 3 summarizes the evolution of feature-based linear epitope prediction models using ML techniques as classifiers.

Table 3.

Feature-based machine learning tools for B-cell epitope prediction. Different versions of the SEPPA program have new features added to them, features with ‘*’ were first included in SEPPA 1.0, while features with ‘**’ and ‘***’ appeared in SEPPA 2.0 and SEPPA3.0 respectively

Machine learning tool	Feature types	Year
ABCpred [61]	Amino acid composition and sequence	2006
COBEPRO [63]	Similarity to other epitopes	2009
Epitopia [64]	Amino acid preference, secondary structure preference, surface accessibility, surface structure, evolution rate, polarity scale, flexibility scale, antigenicity scale, hydrophilicity scale	2009
CBTOPE [65]	Amino acid composition, polarity, flexibility, antigenicity, hydrophobicity, sequence, similarity to other epitopes	2010
LBtope [39]	Amino acid composition, sequence, similarity to other epitopes, variable epitope length control	2013
SEPPA , , [28]	Amino acid propensity, sequence combined with structure, solvent accessible surface areas, antigenicity combined with structure, glycosylation combined with structure**	2019
EPSVR [66]	Amino acid, side-chain energy score, surface exposure, antigenicity combined with surface structure, and secondary structure	2020
ScanNet [67]	Amino acid composition, secondary structure, accessible surface area, coordination number (van der Waals interaction), 2D solvent exposure, backbone and sidechain depth (distance from surface), surface convexity index, amino acid conservation	2022

Open in a new tab

While these feature-based ML classifiers laid the groundwork for linear epitope prediction, their reliance on hand-crafted representations motivated a shift toward deep learning approaches that can learn rich, contextual sequence features directly from data, culminating in the adoption of transformer-based protein language models. BepiPred-3.0 [29], along with EpiDope [68] use sequence embeddings from the ESM-2 [20] and ProteinBert [69] models for their predictions respectively. Moreover, LBCE-XGB [25] combined XGBoost with SHapley Additive exPlanations (SHAP)-selected [70], BERT-based embeddings on sequence-derived features. In contrast, BeeTLe [71] applied a BiLSTM-attention-FFNN framework, utilizing amino acid embeddings derived from the eigen decomposition of the exponential of the BLOSUM62 matrix (BLOcks SUbstitution Matrix in which 62 was created from protein sequence blocks where the sequences were no more than 62% identical to each other [72]). To address the class imbalance, BeeTLe employed cross-entropy loss modified via logit adjustment referred to as the focal loss. Together, these models paved the way for even more recent work leveraging LLMs with attention mechanisms. CALIBER [41] employed a BiLSTM architectures using ESM-2 embeddings to predict both linear and conformational epitopes; LBCE-BERT [26] combined protein sequence features with BERT-based embeddings and XGBoost, while EpitopeTransfer [73] introduced a phylogeny-aware transfer learning framework that fine-tuned ESM-1b on higher-order taxonomic data before training pathogen-specific classifiers, significantly boosting performance for neglected pathogens. Fig. 4 highlights the transformer architecture which is the backbone of recent advancements in AI-driven B-cell epitope predictors. Additionally, Table 4 highlights the performance of linear B-cell epitope prediction tools along with their corresponding web-links.

Fig. 4 — Encoder-decoder framework at the heart of recent advancements in generative AI and large language models driving the evolution of advanced B-cell epitope predictors. BERT-based models follow an encoder only framework, whereas GPT-based models are predominantly decoder only. We explicitly highlight the input processing, transformer blocks and the output units of the encoder-decoder framework

Table 4.

Linear B-cell epitope prediction tools. We provide web-links to the tools along with the algorithm used and the best performance attained by individual model. Here ‘NA’ means that this quality metric score was not available in the original paper of that particular method

Year	Name	Algorithm Type	AUC-ROC
2024	LBCE-BERT	Language model	0.757
2024	CLBtope	Hybrid Random Forest + BLAST	0.83
2024	SEMA 2.0 1D	Language model	0.731
2023	Epitope1D	Explainable Boosting Model	0.841 (ABCPred-2), 0.935 (Blind)
2022	BepiPred-3.0	Language model	0.771
2021	BCEPS	SVM	NA
2020	EpiDope	DNN	0.670
2013	LBtope	SVM + iBk	0.670
2012	SVMTriP	SVM	0.702
2010	CBTOPE	SVM	0.890
2008	BCPREDS	SVM	0.699
2006	ABCpred	Neural Network	NA
2007	AAPpred	SVM	0.689

Open in a new tab

These models analyze protein sequences directly to detect contextual cues and long-range relationships, allowing high-throughput end-to-end predictions that do not require manual feature engineering. The transition from empirical rules to context-aware AI systems has led to improved linear epitope prediction with greater scalability, which advances reverse vaccinology [74], as well as therapeutic antibody design and immunodiagnostics.

Conformational B-cell epitopes prediction models

The integration of structural data into prediction processes became a crucial milestone in 2005. The combination of sequence data with structural features through ElliPro [75] and DiscoTope [37] enabled researchers to predict conformational (discontinuous) epitopes for the first time beyond the linear scope of previous models.

DiscoTope [32] introduced predictions for discontinuous epitopes by combining structural proximity, log-odds ratios, and contact numbers. However, DiscoTope faced challenges on elongated or membrane-associated antigens due to reliance on single-chain assumptions. The SEPPA model [31] constructs triangular unit patches of surface residues and uses them to compute a per-residue propensity index, which is then integrated with clustering coefficients reflecting neighboring residues’ spatial compactness for B-cell epitope prediction. DiscoTope-2.0 [37] refined its earlier strategy by using a half-sphere neighbor count and a scoring function that integrates log-odds ratios of amino acids in spatial proximity through a distance-weighted approach. These measures were combined via a weighted-sum to produce the final model scores. While improving performance, it also revealed the bias of the model towards surface proteins and underscored the critical role of data completeness and quality. EPSVR [36] utilized support vector machines on an integrated set of six features, i.e., residue epitope propensity, conservation score, side chain energy score, contact number, surface planarity score, and secondary structure composition for conformational epitope prediction. EpiPred predicts antibody-specific B-cell epitopes by integrating conformational matching of antibody-antigen structures with a dedicated scoring system [76]. Unlike previous approaches that predict general antigenic sites, EpiPred identifies the precise epitope targeted by a given antibody, enhancing both prediction accuracy and antibody-antigen docking performance.

SEPPA 2.0 [34] enhanced the SEPPA 1.0 by adding triangle patch-based accessible surface area residue propensity and a consolidated amino acid index, tailoring models to antigen localization and host species. The approach was improved by SEPPA 3.0 [28], which employed glycosylation triangle ratios and glycosylation-related amino acid indexes, combining them with earlier model features through logistic regression and calibrating predictions using local neighbor context. ScanNet [67] utilized amino acid composition, secondary structure, accessible surface area, van der Waals interactions, 2D-solvent exposure, backbone and side-chain depth and surface convexity index for conformational epitope prediction. DiscoTope-3.0 [30] reframed conformational epitope prediction as a positive-unlabeled (PU) learning problem. This allowed it to address the challenge that only epitope residues are confidently labeled while negatives remain uncertain. By embedding solved and AlphaFold2-predicted structures with ESM-IF1 [20] and employing RSA (Residual Surface Accessibility), pLDDT (predicted Local Distance Difference Test), sequence length, and one-hot encoding, it used an XGBoost ensemble that outperformed both sequence-based models like BepiPred-3.0 and structure-based methods such as ScanNet [67] and SEMA 2.0 3D [77]. Table 2 highlights the performance of conformational B-cell epitope tools along with their corresponding web links.

Challenges and possible solutions

Major challenges

Building accurate ML models for B-cell epitope prediction faces multiple challenges, spanning data limitations, methodological constraints, and validation issues. These challenges affect both linear and conformational epitope prediction, although conformational epitopes present additional complexities owing to their 3D structural dependencies. Below we highlight some of the major challenges:

Data Quality and Labeling Issues: Epitope datasets often are afflicted by noise and inconsistent experimental validation. IEDB and similar repositories are often biased toward specific types of pathogens (e.g., viruses), while data for parasites, bacteria and other microorganisms remain sparse [78, 79]. Manual curation and annotation errors further combine these issues, leading to mislabeled or redundant sequences. For example, a 2025 study [78] on parasite-specific epitopes required aggressive de-duplication (using CD-HIT at 80% similarity) to create a usable dataset, highlighting the noisy nature of unscreened data. Similarly, conformational epitope data are scarce due to the difficulty of obtaining high-resolution antibody-antigen structures [6].

Lack of Pathogen Diversity: Most current AI models for antigen prediction are predominantly trained on datasets enriched with viral and bacterial antigens, which restricts their generalizability to other pathogen types, such as parasites or newly emerging infectious agents [47]. A 2025 study employing a transformer-based model specifically tailored for parasitic pathogens demonstrated that conventional tools exhibit significantly reduced performance on non-viral targets, highlighting a critical bias in existing training data and model design [78]. For conformational epitopes, structural data are skewed toward well-studied proteins like SARS-CoV-2 spike, which may not generalize to other antigens [6].

False Positives and Negatives: Balancing specificity and sensitivity remains a persistent challenge while building in silico models for B-cell epitope prediction. Older methods like BepiPred-3 [29] and ABCpred [61] exhibit high false-positive rates, with MCC (Matthews Correlation Coefficients) as low as 0.31 to 0.32. Even newer tools like EpiDope and BepFAMN [80] struggle. BepFAMN achieved an AUC of 0.78 for linear epitopes but still required F1-score optimization to mitigate errors. Moreover, the lack of a benchmark and standardized evaluation strategy for linear epitopes instigate reproducibility crisis in B-cell epitope prediction research. For conformational epitopes, a 2023 benchmark [6] found that most methods perform no better than random surface residue patches.

Performance Plateau: Despite advances in DL, performance gains in prediction tools have somewhat stagnated. For example, for linear B-cell epitopes, consensus methods combining tools like LBtope and SVMTriP showed only marginal improvements over individual models [81]. Similarly, for conformational epitopes, state-of-the-art methods like DiscoTope-3.0 and BEpro achieved MCCs of 0.52–0.55, far below clinical utility thresholds [6]. Recently, EpiGraph [82] and GraphBepi [83], graph neural network-based techniques improved predictions by integrating structural and sequence data but still faced limitations w.r.t. generalizability.

Low Reproducibility and Consensus: In the field of B-cell epitope prediction, reproducibility issues arise from inconsistent benchmarking and overfitting to small datasets. In [6], the authors found that combining nine conformational epitope predictors yielded consensus results only marginally better than random. Similarly, linear epitope tools like ABCpred [61] and COBEpro [63] showed significant performance drops when tested on independent datasets [81]. However, feature-engineering approaches such as recursive feature elimination can partially address this but come with significant computational burden [78].

Practical solutions

Several recent approaches point toward a more unified framework for B-cell epitope prediction, that is, one that does not treat linear and conformational epitopes as fundamentally separate problems. CALIBER, for instance, uses the same BiLSTM-based architecture for both tasks, trained independently, and achieves consistent results across epitope types. The model takes ESM-2 embeddings as input, showing that pretrained representations can support both types of prediction within a shared setup. BepiPred-3.0 corroborates this direction by using ESM-2 embeddings to capture long-range sequence context that reflects structural relationships, enabling accurate predictions even without explicit 3D input. These examples suggest that unified models, adapted at the training level, could simplify development and make prediction tools more versatile in practice. Beyond these examples, large language models are increasingly being used to generate dense, contextual embeddings that encode subtle biochemical and structural signals from raw protein sequences. Their influence extends across epitope prediction tasks, including linear epitope models such as LBCE-XGB and LBCE-BERT. In parallel, structure-based methods continue to evolve by incorporating new forms of input. DiscoTope-3.0, for example, uses representations derived from inverse folding models and predicted structures, removing the need for crystallographic data while maintaining performance on conformational tasks. Combined with techniques such as positive-unlabeled (PU) learning to address gaps in structural labeling, these approaches extend the role of structural reasoning in epitope prediction without relying on experimentally determined inputs. Finally, several studies point to the importance of refining the data itself. EpitopeTransfer, for instance, highlights the value of incorporating evolutionary relationships into model training, showing that phylogeny-aware fine-tuning can improve prediction for underrepresented organisms and enable better generalization with limited data.

Together, these developments reflect a broader shift toward more adaptable, data-efficient, and biologically informed model. By integrating advances in sequence representation, structural approximation, and context-aware training, quantum computing based solutions [84], the field is moving toward predictive frameworks that are not only more accurate, but also better aligned with the diversity and complexity of real-world immunological data.

In future, it will be imperative to develop transfer learning frameworks that leverage large-scale unlabeled protein data through protein language models. This can substantially improve model generalization on small, specialized epitope datasets. Additionally, establishing unified and standardized benchmarks would enable rigorous, apples-to-apples comparison of diverse linear and conformational epitope prediction algorithms, fostering more transparent and reproducible progress in the field.

Future of generative AI in epitope characterization

While majority of the computational efforts have focused on predicting a single, dominant B-cell epitope from a static structure, a paradigm shift is underway towards generating the complete ensemble of conformational states that constitute an epitope’s functional profile. This shift is powered by generative machine learning models, such as Normalizing Flows [85] and Boltzmann Generators [86], which have demonstrated remarkable success in modeling the complex landscapes of small molecules and are now poised to revolutionize epitope analysis. These models learn an invertible, differentiable transformation that maps a simple probability distribution to the complex, high-dimensional equilibrium distribution of a system, in this case, the structural ensemble of an antigen’s surface [86].

Mapping epitope flexibility and cryptic sites: A primary application of these models in epitope prediction is the rigorous sampling of an antigen’s conformational landscape. Unlike traditional molecular dynamics, which can be trapped in local energy minima, Boltzmann Generators can efficiently sample rare events and transitions. This capability allows for the identification of cryptic epitopes i.e. transiently exposed binding sites that are absent in a single crystal structure but are critical for immune recognition and the design of broad-neutralizing antibodies. By learning the “shape” of the entire free energy surface, these models can generate all plausible conformations an epitope region can adopt, providing a dynamic view that is closer to biological reality [87].

Antibody-aware epitope generation: Moving beyond antigen-agnostic flexibility, these generative frameworks can be conditioned on specific antibody paratopes. A model could be trained to generate only the conformational states of an antigen that are structurally and physico-chemically complementary to a given antibody’s binding site. This “antibody-aware” generation would directly address the challenge of predicting conformational epitopes, which are defined by the interaction with the antibody itself. It thus reframes the problem from “find epitope on this structure” to “generate all antigen that this antibody can bind to” [88].

Accelerating the design of epitope-specific antibodies: The logical extension of this approach is the co-design of antibodies and epitopes. A generative model that understands the joint probability distribution of paratope and epitope conformations could be used to design novel antibody sequences that target a specific, desired epitope conformation. This is analogous to how Normalizing Flows are used for de novo small molecule design, but applied to the protein-protein interaction problem of antibody-antigen binding. This directly connects the challenge of epitope prediction to the ultimate goal of rational antibody design [89].

Conclusion

In conclusion, the innovations for B-cell epitope prediction must overcome entrenched challenges like data scarcity and algorithmic transparency to gain trust and facilitate widespread adoption. Thus, collaborative efforts to standardize benchmarks and expand pathogen coverage are critical for advancing the field. Ultimately, such advancements are indispensable for a more effective and proactive global response to infectious diseases and for the development of immunotherapies and vaccines.

Acknowledgements

Not Applicable.

Author contributions

R.M. and F.C. initiated the study. F.G., A.S., R.M., F.C. collected, curated and aligned the literature. S.G., H.B., R.M. devised the challenges, possible solutions, future directions sections. F.C., A.S., S.G., R.M. created the figures and F.C. and R.M. created the tables. All authors wrote and proof-read the manuscript.

Funding

Not applicable.

Data availability

No new data was generated for this review.

Declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

All authors provide their consent for the publication.

Competing interests

Raghvendra Mall is Scientific Advisor at Eternal U.A.E.. Dr. Raghvendra Mall is a Section Editor for the Medical Bioinformatics section of the Journal of Translational Medicine.

Footnotes

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Contributor Information

Filippo Castiglione, Email: filippo.castiglione@cnr.it.

Raghvendra Mall, Email: ramall@hbku.edu.qa.

References

1.Murphy K, Weaver C. Janeway’s Immunobiology. 9th. Garland Science, New York; 2017.
2.Owen JA, Punt J, Stranford SA, Jones PP. Kuby Immunology. New York: W.H. Freeman; 2013.
3.Potocnakova L, Bhide M, Pulzova LB. An introduction to b-cell epitope mapping and in silico epitope prediction. J Immunol Res. 2016;2016:6760830. 10.1155/2016/6760830. [DOI] [PMC free article] [PubMed]
4.Smyth M, Martin J. X-ray crystallography. Mol Pathol. 2000;53:8–14. [DOI] [PMC free article] [PubMed]
5.Grewal S, Hegde N, Sk Y. Integrating machine learning to advance epitope mapping. Front. Immunol. 2024;15:1463931. [DOI] [PMC free article] [PubMed]
6.Cia G, Pucci F, Rooman M. Critical review of conformational b-cell epitope prediction methods. Brief Bioinform. 2023;24:bbac567. 10.1093/bib/bbac567. [DOI] [PubMed]
7.Sparkman OD, Price P. Mass spectrometry desk reference. 2006.
8.Vita R, et al. The immune epitope database (iedb): 2018 update. Nucleic Acids Res. 2019;8:D339–43. [DOI] [PMC free article] [PubMed]
9.Elbasir A, et al. Deepcrystal: a deep learning framework for sequence-based protein crystallization prediction. Bioinformatics. 2019;35:2216–25. [DOI] [PubMed]
10.Khurana S, et al. Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018;34:2605–13. [DOI] [PMC free article] [PubMed]
11.Liu T, Shi K, Li W. Deep learning methods improve linear b-cell epitope prediction. Bio Data Min. 2020;13:1–17. [DOI] [PMC free article] [PubMed]
12.Cai Y, et al. Artificial intelligence applied in neoantigen identification facilitates personalized cancer immunotherapy. Front Oncol. 2023;9:1054231. [DOI] [PMC free article] [PubMed] [Google Scholar]
13.Li Y, Wu X, Fang D, Luo Y. Informing immunotherapy with multi-omics driven machine learning. NPJ Digit Med. 2024;7:67. [DOI] [PMC free article] [PubMed] [Google Scholar]
14.Jani SP, Mall RP-A. Machine learning framework to identified personalized treatments for acute myeloid leukemia. 2024.
15.Ferdous S, Kelm S, Baker TS, Shi J, Martin ACR. B-cell epitopes: discontinuity and conformational analysis. Mol Immunol. 2019;114:643–50. 10.1016/j.molimm.2019.09.014. [DOI] [PubMed] [Google Scholar]
16.Abramson J, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature. 2024;630:493–500. [DOI] [PMC free article] [PubMed] [Google Scholar]
17.Liang S, Zheng D, Yao B, Zhang C. Epces and epsvr: prediction of b-cell antigenic epitopes on protein surfaces with conformational information. 2020. [DOI] [PubMed]
18.Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.
19.Brown T, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901. [Google Scholar]
20.Lin Z, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. [DOI] [PubMed] [Google Scholar]
21.Brixi G, et al. Genome modeling and design across all domains of life with evo 2. bioRxiv. 2025;2025–02.
22.Cui H, et al. Scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nat Methods. 2024;21:1470–80. [DOI] [PubMed] [Google Scholar]
23.Mall R, Singh A, Patel CN, Guirimand G, Castiglione F. Vish-pred: an ensemble of fine-tuned esm models for protein toxicity prediction. Briefings In Bioinf. 2024;25:bbae270. [DOI] [PMC free article] [PubMed] [Google Scholar]
24.Mall R, Kaushik R, Martinez ZA, Thomson MW, Castiglione F. Benchmarking protein language models for protein crystallization. Sci Rep. 2025;15:2381. [DOI] [PMC free article] [PubMed] [Google Scholar]
25.Liu Y, Liu Y, Wang S, Zhu X. Lbce-xgb: a xgboost model for predicting linear b-cell epitopes based on bert embeddings. Interdiscip Sci: Comput Life Sci. 2023;15:293–305. [DOI] [PubMed] [Google Scholar]
26.Liu F, Yuan C, Chen H, Yang F. Prediction of linear b-cell epitopes based on protein sequence features and bert embeddings. Sci Rep. 2024;14:2464. [DOI] [PMC free article] [PubMed] [Google Scholar]
27.Campelo F, et al. Phylogeny-aware linear b-cell epitope predictor detects targets associated with immune response to orthopoxviruses. Briefings Bioinf. 2024;25:bbae527. [DOI] [PMC free article] [PubMed] [Google Scholar]
28.Zhou C, et al. Seppa 3.0—enhanced spatial epitope prediction enabling glycoprotein antigens. Nucleic Acids Res. 2019;47:W388–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
29.Clifford JN, et al. Bepipred-3.0: improved b-cell epitope prediction using protein language models. Protein Sci. 2022;31:e4497. [DOI] [PMC free article] [PubMed] [Google Scholar]
30.Høie MH, et al. Discotope-3.0: improved b-cell epitope prediction using inverse folding latent representations. Front Immunol. 2024;15:1322712. [DOI] [PMC free article] [PubMed] [Google Scholar]
31.Sun J, et al. Seppa: a computational server for spatial epitope prediction of protein antigens. Nucleic Acids Res. 2009;37:W612–16. [DOI] [PMC free article] [PubMed] [Google Scholar]
32.Haste Andersen P, Nielsen M, Lund O. Prediction of residues in discontinuous b-cell epitopes using protein 3d structures. Protein Sci. 2006;15:2558–67. [DOI] [PMC free article] [PubMed] [Google Scholar]
33.Schlessinger A, Ofran Y, Yachdav G, Rost B. Epitome: database of structure-inferred antigenic epitopes. Nucleic Acids Res. 2006;34:D777–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
34.Qi T, et al. Seppa 2.0—more refined server to predict spatial epitope considering species of immune host and subcellular localization of protein antigen. Nucleic Acids Res. 2014;42:W59–63. [DOI] [PMC free article] [PubMed] [Google Scholar]
35.Zhang W, et al. Prediction of conformational b-cell epitopes from 3d structures by random forests with a distance-based feature. BMC Bioinf. 2011;12:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
36.Liang S, et al. Epsvr and epmeta: prediction of antigenic epitopes using support vector regression and multiple server results. BMC Bioinf. 2010;11:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
37.Kringelum JV, Lundegaard C, Lund O, Nielsen M. Reliable b cell epitope predictions: impacts of method development and improved benchmarking. PLoS Comput Biol. 2012;8:e1002829. [DOI] [PMC free article] [PubMed] [Google Scholar]
38.Jespersen MC, Peters B, Nielsen M, Marcatili P. Bepipred-2.0: improving sequence-based b-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 2017;45:W24–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
39.Singh H, Ansari HR, Raghava GP. Improved method for linear b-cell epitope prediction using antigen’s primary sequence. PLoS One. 2013;8:e62216. [DOI] [PMC free article] [PubMed] [Google Scholar]
40.Steinegger M, Söding J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–28. [DOI] [PubMed] [Google Scholar]
41.Israeli S, Louzoun Y. Single-residue linear and conformational b cell epitopes prediction using random and esm-2 based projections. Briefings Bioinf. 2024;25:bbae084. [DOI] [PMC free article] [PubMed] [Google Scholar]
42.Schneider C, Raybould MI, Deane CM. Sabdab in the age of biotherapeutics: updates including sabdab-nano, the nanobody structure tracker. Nucleic Acids Res. 2022;50:D1368–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
43.Lee S-Y, Lee A, Chen J, MacKinnon R. Structure of the kvap voltage-dependent k+ channel and its dependence on the lipid membrane. Proceedings of the National Academy of Sciences. 2005, 15441–46 102. [DOI] [PMC free article] [PubMed]
44.Park H, et al. Data-driven enhancement of cubic phase stability in mixed-cation perovskites. Mach Learn: Sci Technol. 2021;2:025030. [Google Scholar]
45.Mall R, Suykens JA. Sparse reductions for fixed-size least squares support vector machines on large scale data. 2013.
46.Mall R, et al. Four types of toxic people: characterizing online users’ toxicity over time. 2020.
47.Mendes M, et al. Iedb-3d 2.0: structural data analysis within the immune epitope database. Protein Sci. 2023;32:e4605. [DOI] [PMC free article] [PubMed] [Google Scholar]
48.Negi SS, Schein CH, Braun W. The updated structural database of allergenic proteins (sdap 2.0) provides 3d models for allergens and incorporated bioinformatics tools. J Allergy And Clin Immunol: Global. 2023;2:100162. [DOI] [PMC free article] [PubMed] [Google Scholar]
49.Korber B. HIV molecular immunology 2002. DIANE Publishing; 2002. [Google Scholar]
50.Koşaloğlu-Yalçín Z, et al. The cancer epitope database and analysis resource (cedar). Nucleic acids research. 2023;51:D845–52. [DOI] [PMC free article] [PubMed]
51.Vita R, et al. The immune epitope database (iedb): 2018 update. Nucleic Acids Res. 2019;47:D339–43. [DOI] [PMC free article] [PubMed] [Google Scholar]
52.Ong E, Wong MU, He Y. Identification of new features from known bacterial protective vaccine antigens enhances rational vaccine design. Front Immunol. 2017;8:1382. [DOI] [PMC free article] [PubMed] [Google Scholar]
53.Huang J, Honda W. Ced: a conformational epitope database. BMC Immunol. 2006;7:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
54.Saha S, Bhasin M, Raghava GP. Bcipep: a database of b-cell epitopes. BMC Genomics. 2005;6:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
55.Toseland CP, et al. Antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Res. 2005;1:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]
56.Parker J, Guo D, Hodges R. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and x-ray-derived accessible sites. Biochemistry. 1986;25:5425–32. [DOI] [PubMed] [Google Scholar]
57.Emini EA, Hughes JV, Perlow D, Boger J. Induction of hepatitis a virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol. 1985;55:836–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
58.Kolaskar AS, Tongaonkar PC. A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Lett. 1990;276:172–74. [DOI] [PubMed] [Google Scholar]
59.Saha S, Raghava GPS. Bcepred: prediction of continuous b-cell epitopes in antigenic sequences using physico-chemical properties. 2004.
60.Larsen JEP, Lund O, Nielsen M. Improved method for predicting linear b-cell epitopes. Immunome Res. 2006;2:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]
61.Saha S, Raghava GPS. Prediction of continuous b-cell epitopes in an antigen using recurrent neural network. Proteins: Struct, Function, Bioinf. 2006;65:40–48. [DOI] [PubMed] [Google Scholar]
62.Høie MH, et al. Netsurfp-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res. 2022;50:W510–15. [DOI] [PMC free article] [PubMed] [Google Scholar]
63.Sweredoski MJ, Baldi P. Cobepro: a novel system for predicting continuous b-cell epitopes. Protein Eng, Des Selection. 2009;22:113–20. [DOI] [PMC free article] [PubMed] [Google Scholar]
64.Rubinstein ND, Mayrose I, Martz E, Pupko T. Epitopia: a web-server for predicting b-cell epitopes. BMC Bioinf. 2009;10:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
65.Ansari HR, Raghava GP. Identification of conformational b-cell epitopes in an antigen from its primary sequence. Immunome Res. 2010;6:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
66.Liang S, Zheng D, Yao B, Zhang C. Epces and epsvr: prediction of b-cell antigenic epitopes on protein surfaces with conformational information. Immunoinformatics. 2020;289–97. [DOI] [PubMed]
67.Tubiana J, Schneidman-Duhovny D, Wolfson HJ. Scannet: an interpretable geometric deep learning model for structure-based protein binding site prediction. Nat Methods. 2022;19:730–39. [DOI] [PubMed] [Google Scholar]
68.Collatz M, et al. Epidope: a deep neural network for linear b-cell epitope prediction. Bioinformatics. 2021;37:448–55. [DOI] [PubMed] [Google Scholar]
69.Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38:2102–10. [DOI] [PMC free article] [PubMed] [Google Scholar]
70.Bi Y, et al. An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids. 2020;22:362–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
71.Yuan X. Beetle: a framework for linear b-cell epitope prediction and classification. 2023.
72.Henikoff S, Henikoff JG. Performance evaluation of amino acid substitution matrices. proteins: structure, function, and Bioinformatics. 1993;17:49–61. [DOI] [PubMed]
73.Leite LP, de Campos TE, Lobo FP, Campelo F. Epitopetransfer: a phylogeny-aware transfer learning framework for taxon-specific linear b-cell epitope prediction. bioRxiv. 2025;2025–04.
74.Rappuoli R. Reverse vaccinology. Current opinion in microbiology. 2000;3:445–50. [DOI] [PubMed]
75.Ponomarenko J, et al. Ellipro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinf. 2008;9:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
76.Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving b-cell epitope prediction and its application to global antibody-antigen docking. Bioinformatics. 2014;30:2288–94. [DOI] [PMC free article] [PubMed] [Google Scholar]
77.Ivanisenko NV, et al. Sema 2.0: web-platform for b-cell conformational epitopes prediction using artificial intelligence. Nucleic Acids Res. 2024;52:W533–39. [DOI] [PMC free article] [PubMed] [Google Scholar]
78.Hu R-S, Gu K, Ehsan M, Abbas Raza SH, Wang C-R. Transformer-based deep learning enables improved b-cell epitope prediction in parasitic pathogens: a proof-of-concept study on fasciola hepatica. PLoS Neglected Trop Dis. 2025;19:e0012985. [DOI] [PMC free article] [PubMed] [Google Scholar]
79.Kumar N, Bajiya N, Patiyal S, Raghava GP. Multi-perspectives and challenges in identifying b-cell epitopes. Protein Sci. 2023;32:e4785. [DOI] [PMC free article] [PubMed] [Google Scholar]
80.La Marca AF, Lopes RDS, Lotufo ADP, Bartholomeu DC, Minussi CR. Bepfamn: a method for linear b-cell epitope predictions based on fuzzy-artmap artificial neural network. Sensors. 2022;22:4027. [DOI] [PMC free article] [PubMed] [Google Scholar]
81.Galanis KA, et al. Linear b-cell epitope prediction for in silico vaccine design: a performance review of methods available via command-line interface. Int J Mol Sci. 2021;22:3210. [DOI] [PMC free article] [PubMed]
82.Choi S, Kim D. B cell epitope prediction by capturing spatial clustering property of the epitopes using graph attention network. Sci Rep. 2024;14:27496. [DOI] [PMC free article] [PubMed]
83.Zeng Y, et al. Identifying b-cell epitopes using alphafold2 predicted structures and pretrained language model. Bioinformatics. 2023;39:btad187. [DOI] [PMC free article] [PubMed]
84.Hwang C-C, Hong Y-A. A study on b-cell epitope prediction based on qsvm and vqc. arXiv preprint arXiv:2504.11846 (2025.
85.Papamakarios G, Nalisnick E, Rezende DJ, Mohamed S, Lakshminarayanan B. Normalizing flows for probabilistic modeling and inference. J Mach Learn Res. 2021;22:1–64.
86.Noé F, Olsson S, Köhler J, Wu H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science. 2019;365:eaaw1147. [DOI] [PubMed]
87.Petsalaki E, Russell RB. Peptide-mediated interactions in biological systems: new discoveries and applications. Curr Opin In Biotechnol. 2008;19:344–50. [DOI] [PubMed]
88.Ardizzone L, et al. Conditional invertible neural networks for diverse image-to-image translation. 2020.
89.Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative models for graph-based protein design. Adv Neural Inf Process Syst. 2019;32.

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Data Availability Statement

No new data was generated for this review.

[CR1] 1.Murphy K, Weaver C. Janeway’s Immunobiology. 9th. Garland Science, New York; 2017.

[CR2] 2.Owen JA, Punt J, Stranford SA, Jones PP. Kuby Immunology. New York: W.H. Freeman; 2013.

[CR3] 3.Potocnakova L, Bhide M, Pulzova LB. An introduction to b-cell epitope mapping and in silico epitope prediction. J Immunol Res. 2016;2016:6760830. 10.1155/2016/6760830. [DOI] [PMC free article] [PubMed]

[CR4] 4.Smyth M, Martin J. X-ray crystallography. Mol Pathol. 2000;53:8–14. [DOI] [PMC free article] [PubMed]

[CR5] 5.Grewal S, Hegde N, Sk Y. Integrating machine learning to advance epitope mapping. Front. Immunol. 2024;15:1463931. [DOI] [PMC free article] [PubMed]

[CR6] 6.Cia G, Pucci F, Rooman M. Critical review of conformational b-cell epitope prediction methods. Brief Bioinform. 2023;24:bbac567. 10.1093/bib/bbac567. [DOI] [PubMed]

[CR7] 7.Sparkman OD, Price P. Mass spectrometry desk reference. 2006.

[CR8] 8.Vita R, et al. The immune epitope database (iedb): 2018 update. Nucleic Acids Res. 2019;8:D339–43. [DOI] [PMC free article] [PubMed]

[CR9] 9.Elbasir A, et al. Deepcrystal: a deep learning framework for sequence-based protein crystallization prediction. Bioinformatics. 2019;35:2216–25. [DOI] [PubMed]

[CR10] 10.Khurana S, et al. Deepsol: a deep learning framework for sequence-based protein solubility prediction. Bioinformatics. 2018;34:2605–13. [DOI] [PMC free article] [PubMed]

[CR11] 11.Liu T, Shi K, Li W. Deep learning methods improve linear b-cell epitope prediction. Bio Data Min. 2020;13:1–17. [DOI] [PMC free article] [PubMed]

[CR12] 12.Cai Y, et al. Artificial intelligence applied in neoantigen identification facilitates personalized cancer immunotherapy. Front Oncol. 2023;9:1054231. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR13] 13.Li Y, Wu X, Fang D, Luo Y. Informing immunotherapy with multi-omics driven machine learning. NPJ Digit Med. 2024;7:67. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR14] 14.Jani SP, Mall RP-A. Machine learning framework to identified personalized treatments for acute myeloid leukemia. 2024.

[CR15] 15.Ferdous S, Kelm S, Baker TS, Shi J, Martin ACR. B-cell epitopes: discontinuity and conformational analysis. Mol Immunol. 2019;114:643–50. 10.1016/j.molimm.2019.09.014. [DOI] [PubMed] [Google Scholar]

[CR16] 16.Abramson J, et al. Accurate structure prediction of biomolecular interactions with alphafold 3. Nature. 2024;630:493–500. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR17] 17.Liang S, Zheng D, Yao B, Zhang C. Epces and epsvr: prediction of b-cell antigenic epitopes on protein surfaces with conformational information. 2020. [DOI] [PubMed]

[CR18] 18.Vaswani A, et al. Attention is all you need. Adv Neural Inf Process Syst. 2017;30.

[CR19] 19.Brown T, et al. Language models are few-shot learners. Adv Neural Inf Process Syst. 2020;33:1877–901. [Google Scholar]

[CR20] 20.Lin Z, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science. 2023;379:1123–30. [DOI] [PubMed] [Google Scholar]

[CR21] 21.Brixi G, et al. Genome modeling and design across all domains of life with evo 2. bioRxiv. 2025;2025–02.

[CR22] 22.Cui H, et al. Scgpt: toward building a foundation model for single-cell multi-omics using generative ai. Nat Methods. 2024;21:1470–80. [DOI] [PubMed] [Google Scholar]

[CR23] 23.Mall R, Singh A, Patel CN, Guirimand G, Castiglione F. Vish-pred: an ensemble of fine-tuned esm models for protein toxicity prediction. Briefings In Bioinf. 2024;25:bbae270. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR24] 24.Mall R, Kaushik R, Martinez ZA, Thomson MW, Castiglione F. Benchmarking protein language models for protein crystallization. Sci Rep. 2025;15:2381. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR25] 25.Liu Y, Liu Y, Wang S, Zhu X. Lbce-xgb: a xgboost model for predicting linear b-cell epitopes based on bert embeddings. Interdiscip Sci: Comput Life Sci. 2023;15:293–305. [DOI] [PubMed] [Google Scholar]

[CR26] 26.Liu F, Yuan C, Chen H, Yang F. Prediction of linear b-cell epitopes based on protein sequence features and bert embeddings. Sci Rep. 2024;14:2464. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR27] 27.Campelo F, et al. Phylogeny-aware linear b-cell epitope predictor detects targets associated with immune response to orthopoxviruses. Briefings Bioinf. 2024;25:bbae527. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR28] 28.Zhou C, et al. Seppa 3.0—enhanced spatial epitope prediction enabling glycoprotein antigens. Nucleic Acids Res. 2019;47:W388–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR29] 29.Clifford JN, et al. Bepipred-3.0: improved b-cell epitope prediction using protein language models. Protein Sci. 2022;31:e4497. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR30] 30.Høie MH, et al. Discotope-3.0: improved b-cell epitope prediction using inverse folding latent representations. Front Immunol. 2024;15:1322712. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR31] 31.Sun J, et al. Seppa: a computational server for spatial epitope prediction of protein antigens. Nucleic Acids Res. 2009;37:W612–16. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR32] 32.Haste Andersen P, Nielsen M, Lund O. Prediction of residues in discontinuous b-cell epitopes using protein 3d structures. Protein Sci. 2006;15:2558–67. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR33] 33.Schlessinger A, Ofran Y, Yachdav G, Rost B. Epitome: database of structure-inferred antigenic epitopes. Nucleic Acids Res. 2006;34:D777–80. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR34] 34.Qi T, et al. Seppa 2.0—more refined server to predict spatial epitope considering species of immune host and subcellular localization of protein antigen. Nucleic Acids Res. 2014;42:W59–63. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR35] 35.Zhang W, et al. Prediction of conformational b-cell epitopes from 3d structures by random forests with a distance-based feature. BMC Bioinf. 2011;12:1–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR36] 36.Liang S, et al. Epsvr and epmeta: prediction of antigenic epitopes using support vector regression and multiple server results. BMC Bioinf. 2010;11:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR37] 37.Kringelum JV, Lundegaard C, Lund O, Nielsen M. Reliable b cell epitope predictions: impacts of method development and improved benchmarking. PLoS Comput Biol. 2012;8:e1002829. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR38] 38.Jespersen MC, Peters B, Nielsen M, Marcatili P. Bepipred-2.0: improving sequence-based b-cell epitope prediction using conformational epitopes. Nucleic Acids Res. 2017;45:W24–29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR39] 39.Singh H, Ansari HR, Raghava GP. Improved method for linear b-cell epitope prediction using antigen’s primary sequence. PLoS One. 2013;8:e62216. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR40] 40.Steinegger M, Söding J. Mmseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026–28. [DOI] [PubMed] [Google Scholar]

[CR41] 41.Israeli S, Louzoun Y. Single-residue linear and conformational b cell epitopes prediction using random and esm-2 based projections. Briefings Bioinf. 2024;25:bbae084. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR42] 42.Schneider C, Raybould MI, Deane CM. Sabdab in the age of biotherapeutics: updates including sabdab-nano, the nanobody structure tracker. Nucleic Acids Res. 2022;50:D1368–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR43] 43.Lee S-Y, Lee A, Chen J, MacKinnon R. Structure of the kvap voltage-dependent k+ channel and its dependence on the lipid membrane. Proceedings of the National Academy of Sciences. 2005, 15441–46 102. [DOI] [PMC free article] [PubMed]

[CR44] 44.Park H, et al. Data-driven enhancement of cubic phase stability in mixed-cation perovskites. Mach Learn: Sci Technol. 2021;2:025030. [Google Scholar]

[CR45] 45.Mall R, Suykens JA. Sparse reductions for fixed-size least squares support vector machines on large scale data. 2013.

[CR46] 46.Mall R, et al. Four types of toxic people: characterizing online users’ toxicity over time. 2020.

[CR47] 47.Mendes M, et al. Iedb-3d 2.0: structural data analysis within the immune epitope database. Protein Sci. 2023;32:e4605. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR48] 48.Negi SS, Schein CH, Braun W. The updated structural database of allergenic proteins (sdap 2.0) provides 3d models for allergens and incorporated bioinformatics tools. J Allergy And Clin Immunol: Global. 2023;2:100162. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR49] 49.Korber B. HIV molecular immunology 2002. DIANE Publishing; 2002. [Google Scholar]

[CR50] 50.Koşaloğlu-Yalçín Z, et al. The cancer epitope database and analysis resource (cedar). Nucleic acids research. 2023;51:D845–52. [DOI] [PMC free article] [PubMed]

[CR51] 51.Vita R, et al. The immune epitope database (iedb): 2018 update. Nucleic Acids Res. 2019;47:D339–43. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR52] 52.Ong E, Wong MU, He Y. Identification of new features from known bacterial protective vaccine antigens enhances rational vaccine design. Front Immunol. 2017;8:1382. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR53] 53.Huang J, Honda W. Ced: a conformational epitope database. BMC Immunol. 2006;7:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR54] 54.Saha S, Bhasin M, Raghava GP. Bcipep: a database of b-cell epitopes. BMC Genomics. 2005;6:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR55] 55.Toseland CP, et al. Antijen: a quantitative immunology database integrating functional, thermodynamic, kinetic, biophysical, and cellular data. Immunome Res. 2005;1:1–12. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR56] 56.Parker J, Guo D, Hodges R. New hydrophilicity scale derived from high-performance liquid chromatography peptide retention data: correlation of predicted surface residues with antigenicity and x-ray-derived accessible sites. Biochemistry. 1986;25:5425–32. [DOI] [PubMed] [Google Scholar]

[CR57] 57.Emini EA, Hughes JV, Perlow D, Boger J. Induction of hepatitis a virus-neutralizing antibody by a virus-specific synthetic peptide. J Virol. 1985;55:836–39. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR58] 58.Kolaskar AS, Tongaonkar PC. A semi-empirical method for prediction of antigenic determinants on protein antigens. FEBS Lett. 1990;276:172–74. [DOI] [PubMed] [Google Scholar]

[CR59] 59.Saha S, Raghava GPS. Bcepred: prediction of continuous b-cell epitopes in antigenic sequences using physico-chemical properties. 2004.

[CR60] 60.Larsen JEP, Lund O, Nielsen M. Improved method for predicting linear b-cell epitopes. Immunome Res. 2006;2:1–7. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR61] 61.Saha S, Raghava GPS. Prediction of continuous b-cell epitopes in an antigen using recurrent neural network. Proteins: Struct, Function, Bioinf. 2006;65:40–48. [DOI] [PubMed] [Google Scholar]

[CR62] 62.Høie MH, et al. Netsurfp-3.0: accurate and fast prediction of protein structural features by protein language models and deep learning. Nucleic Acids Res. 2022;50:W510–15. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR63] 63.Sweredoski MJ, Baldi P. Cobepro: a novel system for predicting continuous b-cell epitopes. Protein Eng, Des Selection. 2009;22:113–20. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR64] 64.Rubinstein ND, Mayrose I, Martz E, Pupko T. Epitopia: a web-server for predicting b-cell epitopes. BMC Bioinf. 2009;10:1–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR65] 65.Ansari HR, Raghava GP. Identification of conformational b-cell epitopes in an antigen from its primary sequence. Immunome Res. 2010;6:1–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR66] 66.Liang S, Zheng D, Yao B, Zhang C. Epces and epsvr: prediction of b-cell antigenic epitopes on protein surfaces with conformational information. Immunoinformatics. 2020;289–97. [DOI] [PubMed]

[CR67] 67.Tubiana J, Schneidman-Duhovny D, Wolfson HJ. Scannet: an interpretable geometric deep learning model for structure-based protein binding site prediction. Nat Methods. 2022;19:730–39. [DOI] [PubMed] [Google Scholar]

[CR68] 68.Collatz M, et al. Epidope: a deep neural network for linear b-cell epitope prediction. Bioinformatics. 2021;37:448–55. [DOI] [PubMed] [Google Scholar]

[CR69] 69.Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. Proteinbert: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38:2102–10. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR70] 70.Bi Y, et al. An interpretable prediction model for identifying N(7)-methylguanosine sites based on XGBoost and SHAP. Mol Ther Nucleic Acids. 2020;22:362–72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR71] 71.Yuan X. Beetle: a framework for linear b-cell epitope prediction and classification. 2023.

[CR72] 72.Henikoff S, Henikoff JG. Performance evaluation of amino acid substitution matrices. proteins: structure, function, and Bioinformatics. 1993;17:49–61. [DOI] [PubMed]

[CR73] 73.Leite LP, de Campos TE, Lobo FP, Campelo F. Epitopetransfer: a phylogeny-aware transfer learning framework for taxon-specific linear b-cell epitope prediction. bioRxiv. 2025;2025–04.

[CR74] 74.Rappuoli R. Reverse vaccinology. Current opinion in microbiology. 2000;3:445–50. [DOI] [PubMed]

[CR75] 75.Ponomarenko J, et al. Ellipro: a new structure-based tool for the prediction of antibody epitopes. BMC Bioinf. 2008;9:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR76] 76.Krawczyk K, Liu X, Baker T, Shi J, Deane CM. Improving b-cell epitope prediction and its application to global antibody-antigen docking. Bioinformatics. 2014;30:2288–94. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR77] 77.Ivanisenko NV, et al. Sema 2.0: web-platform for b-cell conformational epitopes prediction using artificial intelligence. Nucleic Acids Res. 2024;52:W533–39. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR78] 78.Hu R-S, Gu K, Ehsan M, Abbas Raza SH, Wang C-R. Transformer-based deep learning enables improved b-cell epitope prediction in parasitic pathogens: a proof-of-concept study on fasciola hepatica. PLoS Neglected Trop Dis. 2025;19:e0012985. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR79] 79.Kumar N, Bajiya N, Patiyal S, Raghava GP. Multi-perspectives and challenges in identifying b-cell epitopes. Protein Sci. 2023;32:e4785. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR80] 80.La Marca AF, Lopes RDS, Lotufo ADP, Bartholomeu DC, Minussi CR. Bepfamn: a method for linear b-cell epitope predictions based on fuzzy-artmap artificial neural network. Sensors. 2022;22:4027. [DOI] [PMC free article] [PubMed] [Google Scholar]

[CR81] 81.Galanis KA, et al. Linear b-cell epitope prediction for in silico vaccine design: a performance review of methods available via command-line interface. Int J Mol Sci. 2021;22:3210. [DOI] [PMC free article] [PubMed]

[CR82] 82.Choi S, Kim D. B cell epitope prediction by capturing spatial clustering property of the epitopes using graph attention network. Sci Rep. 2024;14:27496. [DOI] [PMC free article] [PubMed]

[CR83] 83.Zeng Y, et al. Identifying b-cell epitopes using alphafold2 predicted structures and pretrained language model. Bioinformatics. 2023;39:btad187. [DOI] [PMC free article] [PubMed]

[CR84] 84.Hwang C-C, Hong Y-A. A study on b-cell epitope prediction based on qsvm and vqc. arXiv preprint arXiv:2504.11846 (2025.

[CR85] 85.Papamakarios G, Nalisnick E, Rezende DJ, Mohamed S, Lakshminarayanan B. Normalizing flows for probabilistic modeling and inference. J Mach Learn Res. 2021;22:1–64.

[CR86] 86.Noé F, Olsson S, Köhler J, Wu H. Boltzmann generators: sampling equilibrium states of many-body systems with deep learning. Science. 2019;365:eaaw1147. [DOI] [PubMed]

[CR87] 87.Petsalaki E, Russell RB. Peptide-mediated interactions in biological systems: new discoveries and applications. Curr Opin In Biotechnol. 2008;19:344–50. [DOI] [PubMed]

[CR88] 88.Ardizzone L, et al. Conditional invertible neural networks for diverse image-to-image translation. 2020.

[CR89] 89.Ingraham J, Garg V, Barzilay R, Jaakkola T. Generative models for graph-based protein design. Adv Neural Inf Process Syst. 2019;32.

PERMALINK

B-cell epitope prediction in the age of machine learning: advancements and challenges

Fabrizio Gabellieri

Ankita Singh

Sukrit Gupta

Halima Bensmail

Filippo Castiglione

Raghvendra Mall

Abstract

Background

Methods

Conclsion

Graphical Abstract

Introduction

Fig. 1.

Fig. 2.

Datasets fueling in silico development

Fig. 3.

Table 1.

Table 2.

Historical background and recent progress

Linear B-cell epitope prediction models

Table 3.

Fig. 4.

Table 4.

Conformational B-cell epitopes prediction models

Challenges and possible solutions

Major challenges

Practical solutions

Future of generative AI in epitope characterization

Conclusion

Acknowledgements

Author contributions

Funding

Data availability

Declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Footnotes

Contributor Information

References

Associated Data

Data Availability Statement

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases