Skip to main content
Briefings in Bioinformatics logoLink to Briefings in Bioinformatics
. 2024 May 26;25(4):bbae245. doi: 10.1093/bib/bbae245

Accurate prediction of antibody function and structure using bio-inspired antibody language model

Hongtai Jing 1,2,3, Zhengtao Gao 4, Sheng Xu 5, Tao Shen 6,7, Zhangzhi Peng 8, Shwai He 9, Tao You 10, Shuang Ye 11,12,, Wei Lin 13,14,15,16,17,, Siqi Sun 18,19,
PMCID: PMC11128484  PMID: 38797969

Abstract

In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% nonredundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. The BALMFold structure prediction server is freely available at https://beamlab-sh.com/models/BALMFold.

Keywords: language model, antibody, structure prediction, binding properties

Introduction

Antibodies are essential immune proteins produced by B cells in response to invasion of foreign substances (known as antigens) such as bacteria, viruses and other pathogens. These special proteins can accurately recognize antigens with high specificity, and trigger downstream immune reactions to destruct and eliminate the antigens. Therefore, antibodies can be widely used in clinical medicine, e.g. to identify viral infection, or to reactivate T-cell immunity in cancer treatment. High-throughput sequencing technologies targeting antibody repertoire enable fast acquisition of massive antibody sequence data during antibody maturation, which remarkably reshapes our comprehension of humoral immune responses [1]. However, the development of antibody-based diagnosis and therapeutics remains money- and time-consuming through current wet-lab protocols with insufficient structure information. Using computational methods to predict antibody structures and functions from their sequences could significantly reduce the number of trial-and-error rounds during antibody screening and characterization, and hence largely promote the efficacy of therapeutic antibody development.

The human antibody’s structure is Y-shaped, comprising two identical heavy chains and two identical light chains. The high specificity of antigenic recognition is primarily managed by three complementarity-determining regions (CDRs) located in the fragment of variable (Inline graphic) at the tips of the antibody. Among these three CDRs, CDR 1 and CDR 2 have relatively low sequence diversity and limited structural conformations. In contrast, the third CDR of the heavy chain (CDR H3) is the most diverse portion and plays a crucial role in recognizing a wide range of antigens [2, 3]. Therefore, computational modeling of CDR H3 structures remains challenging and requires a profound understanding of the available CDR H3 sequences (in the billions) and the limited 3D structural data (only in the tens of thousands).

The advancements in deep learning and natural language processing techniques have led to a tremendous development of protein language models, which hold promise in decoding protein functional properties [4–10] and predicting protein structure [11, 12]. These models utilize self-supervised learning paradigms on extensive protein sequence datasets, enabling them to extract intrinsic interdependencies and evolutionary traits critical for precise protein structure and function prediction. For antibodies, previous methods [13, 14] straightforwardly train transformer-based language models using antibody sequences from the Observed Antibody Space (OAS) [15] dataset without aligning highly conserved residues to particular positions. Recent studies [16, 17] have advanced the antibody language model field by incorporating paired sequences and applying transfer learning to antibody hypervariable regions. When addressing antibodies, conventional methods often fail to detect the subtle variations in the highly conserved immunoglobulin fold structure. Utilizing antibody-specific characteristics from unlabeled sequences can produce more biologically relevant representations. These representations facilitate predicting both structure and function independently of homology and template searches. In addition, around 40% of the sequences in the OAS dataset are affected by sequencing artifacts [18], potentially impeding the language model’s ability to capture contextual information effectively. Moreover, given that different positions in antibodies exhibit varying evolutionary rates, it becomes crucial for language models to allocate more attention to learning the distribution of amino acids at those highly diverse positions.

Regarding antibody structure prediction, numerous methods have been developed based on deep learning frameworks [19–23]. AlphaFold2 [11] and AlphaFold2-Multimer [24] have notably demonstrated atomic-level accuracy for general protein structures. However, AlphaFold2 heavily relies on multiple sequence alignments (MSAs), which makes it challenging to find conserved regions or discern homologous templates, especially for antibodies. Precisely predicting CDR loops’ structures from individual antibody sequences remains a significant challenge. Although OmegaFold [25] and ESMFold [26] have shown potential in predicting monomeric protein structures from individual sequences, they may not fully capture the characteristics inherently rooted in antibody structures. IgFold [23], which leverages the pre-trained antibody language model AntiBERTy [14] and graph networks, offers faster prediction of antibody structures compared with AlphaFold2. However, it was found to be suboptimal in accurately predicting conformational structures for nanobody CDR H3 loops using the IgFold algorithm [23], even with the inclusion of crystal structure templates as additional information.

In this article, we proposed the Bio-inspired Antibody Language Model (BALM), which incorporates antibody-aware positional information into the position embedding and employs an adaptive mask strategy in masked language modeling (MLM) to accurately capture the precise biological characteristics. This specialized antibody language model holds the potential to decipher antibody repertoires and offer valuable guidance in therapeutic development efforts.

Methods

Overview of BALM

To comprehend antibody functional properties and structures, we propose an antibody-specific language model designed to efficiently capture the rich information and representation inherent in repertoires. The language model’s learned representation can provide biological properties essential for function and structure prediction in antibody engineering.

Largely adhering to the ESM-2 architecture [26] comprising 150 million parameters, we incorporated 30 transformer encoder blocks [27], each containing 20 multi-head self-attention layers and a feed-forward layer with 640 hidden states, layer normalization and residual connections. The beginning of each sequence is combined with a classification token, enabling the prediction of various tasks. In consideration of the maximum length of the variable heavy chain region of antibodies, input sequences are padded or truncated to a standardized length of 168. While the original ESM-2 architecture employs rotary position embedding (RoPE) [28] to supply token positional information, we substituted RoPE with our bio-inspired antibody positional embedding to account for the distinct functional regions of antibody sequences.

Bio-inspired antibody positional embedding

In the antibody-specific language model, each amino acid is regarded as a token and projected into token embedding. The permutation-invariant self-attention mechanism does not inherently account for the order of input tokens. To address this, positional embedding is used to encode the location information of tokens in the sequence, giving each position a unique representation. This positional embedding is then added before being fed into the encoder block.

Existing alternative approaches, however, do not incorporate antibody evolutionary information. To effectively harness sequential information, we propose a bio-inspired antibody position embedding that aims to (1) exploit evolutionary information in variable regions of antibody sequences, (2) reduce dependency on high-quality MSAs and (3) address the deficiency of a complete antibody corpus in the OAS dataset.

The ImMunoGeneTics (IMGT) system [29] is a standardized numbering method for annotating variable domains of immunoglobulins (IGs) and T cell receptors. The unique IMGT numbering relies on the highly conserved structural features of the variable region. Different regions are expected to contain a specific number of amino acids. Gaps are created in regions with fewer amino acids than expected, and additional positions are inserted between positions 111 and 112 for more than 13 amino acids in the CDR 3. Consequently, there are 128 positions for IG V-domain annotation in antibody sequences with no more than 13 amino acids in the CDR 3.

By specifically emphasizing the positions of CDRs and framework regions, we observe that BALM effectively investigates functionally important regions in a series of downstream tasks and structure prediction tasks. The unique numbering system enhances the prediction and understanding of sequences and structures across a diverse array of antibodies.

BALMFold architecture

Leveraging the potent capabilities of the antibody-specific model BALM, BALMFold demonstrates remarkable precision in predicting antibody structures, especially within the highly variable CDRs. The primary architecture of BALMFold comprises two elements: BALM, responsible for extracting information from the antibody sequence, and the folding block, inclusive of the BAformer and structure module, as illustrated in Fig. 1A.

Figure 1.

Figure 1

Overview of BALM architecture with downstream evaluation.  A, The arrows illustrate the information flow through various blocks within the BALM, which undergoes pre-training on 336M unlabeled antibody sequences clustered with 40% identity using LinClust. The pre-training phase involves MLM, wherein the masked amino acids are predicted. Subsequently, BALM undergoes evaluation across functional properties and structure prediction tasks. These tasks include predicting antigen-binding properties, such as antigen binding, antibody paratope, antibody affinity and antibody redundancy. Leveraging large-scale sequential antibody data, BALM achieves state-of-the-art performance in each task after fine-tuning. Additionally, utilizing structural information embeddings from pre-training, BALMFold introduces an end-to-end approach for predicting the full atomic structure of antibodies. BALMFold includes 30 transformer layers from BALM, four layers of BAformer and eight layers of the structure module. Benchmark results show that BALMFold outperforms alternative methods in accuracy and speed. B, This panel illustrates the preprocessing of input sequences for BALM. Positions in the antibody sequence are determined using the ANARCI tool and based on the IMGT numbering scheme, which provides biological positional information for each amino acid in the sequence. During pre-training with the MLM objective, residues are masked based on their distribution in each region, as opposed to conventional random masking strategies. Before being input into BALM, token embeddings and positional encodings are concatenated. C, This section displays the entropy distribution of amino acids in CDR H3 and FR4 from positions 105 to 128. In the left figure, variations in the circle’s radius represent different positions. The comparison of entropy between the Framework region and CDRs motivates the use of biological features to enhance the model’s capability. The amino acid distribution is derived from all heavy chain sequences of paired antibodies obtained from the OAS database.

Within the framework, BALM manipulates the antibody sequence to construct residue embeddings, thereby encoding amino acid representations that have been learned. Pair embeddings are generated by integrating singular representation pairwise embeddings, thus encoding learned residue–residue interactions. The subsequent step involves introducing these residue and pair embeddings into the BAformer, a four-layered structure which refines features via the exchange of singular and pair representations.

The BAformer is structured into two separate elements that individually update single and pair embeddings. The first component incorporates four self-attention and transition blocks, which together form a feed-forward network responsible for updating the single embedding by discerning global dependencies within antibody sequences. Following its update, the single embedding is subject to a pairwise product procedure, computing interactions between individual amino acids via an outer product operation. The second component merges the pair embedding with the computed pairwise features, and subsequent to this combination, the pair embedding is updated utilizing four triangle update and transition blocks. Consequently, a single layer of the BAFormer module aligns functionally with four layers of the embedding update modules in AlphaFold2 [11], but delivers greater computational efficiency. The structural module with shared weight parameters subsequently employs the refined residue and pair embeddings to predict the 3D coordinates of protein backbones and side chains. It comprises eight invariant point attention (IPA) layers tasked with predicting the positions, orientations and angles of backbones and side chains. Following each layer’s completion, the projected positions are transferred to the succeeding layer to act as structural initiators. Furthermore, at a global level, we cycle single and pair embeddings back to the BAformer three times. With the support of our pre-trained BALM, our model negates the necessity for exhaustive searches for sequence homologs and structural templates, significantly diminishing runtime while sustaining excellent accuracy levels.

During the BALMFold training phase, we freeze the weights of our pre-trained antibody language model to minimize the loss function. As indicated in Eq. (1), the ensemble training loss combines four components: frame aligned point error (FAPE) loss on all atoms denoted by Inline graphic, distogram loss Inline graphic, confidence loss Inline graphic and structure violation loss Inline graphic. The primary component Inline graphic, introduced by AlphaFold2 [11], quantifies the discrepancies in inter-residue distances and orientations between predicted and actual atom coordinates following global alignment. Additionally, Inline graphic signifies an averaged cross-entropy loss aimed at minimizing the divergence between the predicted and actual distogram derived from antibody structures, and Inline graphic involves the predicted lDDT [30] extracted from the final residue representations of the structure module, intended to penalize instances where confidence and accuracy are misaligned. Finally, Inline graphic includes violations of steric constraints, such as angles and bond lengths. To avoid potential structural clashes, the output of BALMFold undergoes selective relaxation through the Amber relaxation module, serving as an energetic minimization process.

graphic file with name DmEquation1.gif (1)

Training details

Pre-training objective The antibody vocabulary comprises 33 tokens, following the ESM-1b configuration [6]. The parameters of BALM are initialized using the protein language model ESM-2 checkpoint with 150M parameters, as released by Huggingface. Our antibody-specific language model employs the MLM objective [31] to predict masked amino acids. The MLM loss is defined as

graphic file with name DmEquation2.gif (2)

The pre-training objective minimizes the negative log-likelihood of the actual residue, given the masked sequence. We generally adopted the BERT setting [31] to process selected tokens. Based on a predefined probability distribution of amino acids, 80% of masked tokens were replaced by <mask>, 10% by random amino acids and the remaining 10% of masked amino acids were left unchanged.

Pre-trained language models typically mask a certain percentage of tokens. The MLM performance depends on the proportion of masked tokens during the pre-training process. An appropriate mask ratio is crucial for ensuring the MLM effectively learns meaningful contextual representation from the corpus. Amino acid variability in antibody sequences differs between framework regions and CDRs, as illustrated in Fig. 1C. Predicting a conserved amino acid that has been masked is more straightforward than predicting a non-conserved one. Consequently, a constant masking probability distribution might hinder the extraction of CDR mutation knowledge from the antibody sequence. If masked positions are uniformly sampled, the relatively low loss on conserved locations would reduce the overall loss during the pre-training process. Despite lower MLM loss in conserved regions, this does not necessarily mean the model effectively captures essential antibody properties. Thus, masking different positions with varying probabilities is required.

Positions with a greater variety of categories in Fig. 1C should have higher masking probabilities than those with fewer categories. To align the mask probability distribution with the training corpus, we proposed a bio-inspired entropy mask that masks residues based on the entropy of amino acid distribution at each position. The mask ratio is rescaled to maintain a 15% ensemble mask ratio. The entropy is calculated as

graphic file with name DmEquation3.gif (3)

The entropy mask strategy enables the model to focus on learning patterns in both conserved and variable regions, enhancing its ability to capture the diverse properties of antibody sequences. In highly conserved sites with low computed entropy, the mask ratio may be extremely low, approaching zero. Considering the vital biological evolutionary information within these conserved regions, we adjusted the mask probability for positions with a mask probability ranging from Inline graphic10% to 10%. This adjustment ensures that the model can learn the broad biological properties of antibodies through amino acid restoration. We also set the masked probability to 20% for insertions in CDR 3 not included in Fig. 1C. Based on this biologically meaningful regulation, the ensemble mask rate ranges from 17% to 20%, varying according to the length of sequences (complete masking probability for each position is presented in Supplementary Figure 4).

Evolutionary velocity of antibody language model

The vector field of immune system responses to specific antigen is considered to be evolutionary velocity of antibody. Based on pseudo log-likelihoods [32], evo-velocity [33] score between sequence Inline graphic and sequence Inline graphic is computed as

graphic file with name DmEquation4.gif (4)

Here, Inline graphic represents the position set of different residues between two sequences after pairwise sequence alignment, and Inline graphic indicates the representation of sequence excluding residue on position Inline graphic.

Results

Leveraging biological features to improve antibody function and structure predictions. We trained a bio-inspired, antibody-specific language model, BALM, on 336 million unlabeled sequences with 150 million parameters (Fig. 1A). BALM utilizes a transformer-based self-attention mechanism and incorporates a novel antibody positional encoding method (Fig. 1B). Respective residues were subject to masking in accordance with their entropy distribution, and the ensuing task for BALM involved predicting these missing residues (Fig. 1C). All antibody sequences are sourced from the OAS [15] dataset and were clustered using LinClust [34] with a 40% sequence identity. BALM effectively captures meaningful contextual embeddings of antibody biological features to infer binding function. Leveraging the learned representations of the pre-trained language model, we developed BALMFold as an end-to-end, atomic-level structure prediction algorithm that operates on single sequences. In order to address the challenges of limited homology and the scarcity of structural templates for antibodies, BALMFold employs BALM in conjunction with antibody domain knowledge and a training dataset comprising 2371 paired and 805 single-chain antibody structures with less than 3 Å resolution from SAbDab [35] (before 1 July 2021).

To optimally leverage the characteristics inherent in antibodies, we pre-trained BALM with antibody domain knowledge. Unlike common protein sequences, antibody sequences contain distinct regions with varying functions beyond amino acids. The CDR H3 loop is a highly variable region of the antibody sequence that directly interacts with antigens and is a critical component of the antibody sequence, determining specificity and diversity of the antibody repertoire. The frameworks represent the relatively conserved segments of variable domains surrounding the CDRs.

As discernible in the reduced color gradations from CDR H3 to FR 4 in Fig. 1C, our analysis underscores the remarkably variable nature inherent to the CDR H3, as manifested in its colorful segments. In addition, the right side of Fig. 1C demonstrates that the entropy value for CDR H3 (Entropy = 37.1) is considerably, by a factor of seven, superior to that of FR 4 (Entropy = 5.4). In contrast to conventional absolute positional encoding, BALM introduces different gaps into distinct antibody sequences based on functional regions. The position ID for each residue is assigned using the antigen receptor numbering and receptor classification (ANARCI) tool [36], following the IMGT [29] numbering scheme. To evaluate the effectiveness of antibody functional properties encoded in the antibody language model representation, we conducted experiments on four antigen-binding-property-related downstream tasks. These tasks encompass antigen binding [37], paratope [13], redundancy prediction [15] during immune response, and binding affinity [38–40] between wild-type and mutants sequences.

The language model addresses the challenge of limited training samples in antibody 3D structure prediction. For rapidly evolving proteins, MSA-based and template-based approaches have been proved to be unsuitable and time-consuming for antibody domains. The lack of evolutionary information from sequences may impair the performance of evolutionary methods such as AlphaFold2 [11] and RoseTTAFold [12]. To capture the inherent evolutionary information within antibody sequences, we leveraged distinct biological features across various antibody domains using single sequences as input, rather than relying on explicit homologous sequences. Integrating the potent language model BALM, BALMFold comprises two additional modules: BAformer and the structure module. BAformer updates single and pair representations based on BALM’s meaningful representations, while the IPA operates in 3D space to generate relative rotations and translations. The structure module produces 3D full atom coordinate structure predictions. We compared BALMFold with several recent methods [11, 23–26] in antibody structure prediction and conducted an comparable nonredundant benchmark involving 197 paired antibodies and 71 nanobodies, in alignment with IgFold [23].

BALM learns biological representation from unlabeled antibody sequences. Uncovering inherent patterns and properties is crucial for comprehending the specificity and function of antibodies. Recent evidence suggests that language models possess a potent capability for capturing semantic information encoded within input amino acid sequences, thereby facilitating the prediction of structure and function [8, 13, 26]. For the purpose of examining the representation gleaned by BALM during its pre-training phase, we conducted a projection of 60 000 antibody sequences from the OAS dataset, selected randomly but with an equal number of sequences representing each species. We utilized the uniform manifold approximation and projection (UMAP) algorithm [41] to map the 640-dimensional final layer embeddings of these sequences into a two-dimensional space (Figs 2A2B). Despite the absence of additional biological information besides antibody numbering, the acquired representations effectively categorize species and V-gene family, outperforming the models including ESM-2 and AntiBERTy. Particularly, BALM demonstrates superior capacity in differentiating human and mouse sequences compared with these two baseline models. Regarding variable gene families, the model might encounter challenges in their differentiation due to the insufficiency of sequence data available for each V gene family. Despite this, BALM’s overall embedding is still reasonably proficient at grouping V-gene families.

Figure 2.

Figure 2

Representation of BALM encoding rich biological insights.  A, UMAP representation of species. A selection of 60 000 sequences from six species was evenly extracted from the OAS dataset. The final hidden layer of BALM was projected into a two-dimensional space using UMAP. The points are labeled according to diverse species, and the sequence embeddings from BALM are grouped by species. The delineation between various species is more pronounced compared with two other benchmark baselines. B, UMAP representation of V-gene family. Using the same dataset of 60 000 sequences, the points are labeled according to distinct V-gene families. The representation of BALM reveals clustering within the V-gene families. C, UMAP representation of evolutionary velocity. The arrows denote the direction of evolutionary velocity. In the top figures, points are annotated based on the Levenshtein distance from corresponding germlines to sequences. In the lower figures, HIV and SARS-CoV-2 patients exhibiting different V-gene family counts are shown. Specifically, donor RU3 and N152 are HIV patients predominantly exhibiting one and two V-gene families, respectively. Subject Q is a SARS-CoV-2 patient with multiple V-gene families. D, t-SNE mapping of amino acid biochemical properties. The last hidden layer embeddings of both pre-trained (lower) and untrained (upper) BALM are projected into a two-dimensional plane using t-SNE. Each point represents an amino acid labeled with its biochemical properties. For the pre-trained BALM, residues cluster according to properties such as charge, aromaticity and aliphatic nature. In contrast, the no-pretrain embedding space shows an irregular distribution instead of fine-grained discrepancies in biophysical attributes.

Antibodies possess amino acid composition and distribution that share some similarities with common proteins, yet they exhibit unique biochemical properties owing to their specialized function in the immune system. The presence of charged, hydrophobic, aromatic and polar residues contributes to the overall structure, stability and function of antibodies, particularly in the CDRs crucial for antigen recognition. The 640-dimensional final hidden layer embedding of BALM is projected into two dimensions using t-SNE [42]. When compared with random weights, pre-trained BALM reveals distinct clustering patterns of antibody residues based on different biochemical properties (Fig. 2D).

BALM learns antibody mutation trajectories from germline. Pharmaceutical development necessitates comprehensive insights into mutation trajectories and immune responses. Diverse repertoires arise through the process of somatic recombination, containing various stages of affinity maturation with distinct trajectories. During B lymphocyte development, VDJ gene segments are randomly selected to create a functional VDJ exon. Different V gene segments, belonging to distinct families, encode the variable region of the heavy chain, resulting in a vast diversity of antigen-binding sites. In the adaptive immune system, V-gene segments contribute to varying levels of affinity maturation and specificity for foreign substances.

Taking inspiration from the protein evolution analysis in [33], the observation of locally optimal traversal in the antibody landscape can provide an intuitive explanation for mutation strategies. Especially when dealing with multiple positional mutations beyond germline, language models learn broader high-dimensional representations of antibody sequences rather than individual changes. To construct a visual representation of mutation trajectories derived from germline sequences, we have collated three complete immune repertoires from individuals with the IGHG isotype. The repertoires of RU3 [44] and N152 [45] consist of 77 271 and 102 213 HIV-specific unique sequences, respectively, while that of subject-Q [46] comprises 60 026 unique sequences targeted against severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). By projecting the last hidden layer from the language model of sequences into two dimensions using UMAP (Fig. 2C), each node in the graph represents one antibody sequence. The sequences are then connected with nearby sequences using the Inline graphic-nearest neighbors method. The shade of nodes depicts the Levenshtein distance between sequences and germline sequences, while the direction of arrows illustrates the affinity maturation trajectories of immune repertoires in different V-gene groups.

The language model uncovers the manifold features of affinity maturation for individual V-gene segments. Leveraging the inherent properties of antibodies, the language model effectively learns about variable regions and the distinct contributions of various V-gene segments to antibody responses. Despite the expectation of mutations favoring higher likelihood sequences in immune responses, we observe a directionality of evolution toward sequences that are closer to germline sequences for each cluster. The trajectory of sequences belonging to the same V-gene group reveals the affinity maturation of antibodies. The direction of evolutionary velocity indicates that the pre-trained language model tends to predict missing residues closer to low edit distance from germline rather than highly mutated sequences. Over time, as antibodies diverge from the original germline sequence, repertoires contain a greater number of sequences that are close to the germline during the affinity maturation process (see Supplementary Figure 5).

BALM learns binding properties from large-scale antibody sequences. The downstream functional tasks include predicting and characterizing multiple facets of antibody function. Four specific tasks were evaluated for BALM, addressing essential questions related to antibodies: (1) antigen-binding capacity [37, 43]; (2) binding site locations [13]; (3) redundancy in immune responses to antigens [15]; and (4) antigen affinity [38–40]. Rapid and precise prediction of specificity and affinity to target molecules is essential for developing novel therapies and understanding the immune system. Evaluation metrics comprise F1 score, Matthews’ correlation coefficient (MCC), area under the receiver operating characteristic curve (AUC), average precision scores with recall (APR) and Pearson’s correlation coefficient.

Optimizing therapeutic antibodies necessitates determining their binding specificity with target antigens. The prediction of antigen binding is a binary sequence-level classification task that involves interactions with binding targets, where the CDR H3 region plays a pivotal role. To investigate the interaction between the target antigen human epidermal growth factor receptor 2 (HER2) and the clinically approved wild-type antibody trastuzumab, we compiled a dataset of antibody-expressing sequences that replace the wild-type trastuzumab sequence with variant heavy chain CDR H3 fragments, as obtained from [37, 43]. The antigen-binding dataset, derived from one germline sequence, comprises 21 612 sequences and follows a 70%/15%/15% split. Despite the high similarity in amino acid composition per position between binding and non-binding sequences, BALM achieves an APR of 92.9 and outperforms other language models across nearly all metrics. Compared with BALM without pre-training, which starts with randomly initialized weights, BALM leverages inherent antibody features and delivers results comparable with other pre-trained models (Table 1), showcasing its robust predictive capabilities in specificity.

Table 1.

Comparison of language models in predicting antibody binding characteristics. The mean values and the standard deviations for four assessment metrics are presented, with the standard deviation computed after three repetitions under identical configurations, excluding random seed variations. ESM-2 (containing 150M parameters) and ESM-1b (with 650M parameters) are included for comparison. The performance metrics of EATLM on antigen binding are derived from [43], and average values of ProtBERT and AntiBERTa for paratope prediction are sourced from [13]. ‘BALM w/o PT’ implies that the model utilizes the architecture of BALM but does not leverage the weights learned from pre-training on the vast dataset of antibody sequences. Despite not being explicitly trained on antibody sequences and residue classification tasks, BALM achieves state-of-the-art performance in all evaluation metrics across the three tasks, except for the F1 score in the binding prediction task

Task Model F1 MCC AUC APR
Binding BALM w/o PT Inline graphic Inline graphic Inline graphic Inline graphic
AntiBERTy Inline graphic Inline graphic Inline graphic Inline graphic
AbLang-H Inline graphic Inline graphic Inline graphic Inline graphic
EATLM Inline graphic Inline graphic Inline graphic
ESM-2 Inline graphic Inline graphic Inline graphic Inline graphic
ESM-1b Inline graphic Inline graphic Inline graphic Inline graphic
BALM Inline graphic Inline graphic Inline graphic Inline graphic
Paratope BALM w/o PT Inline graphic Inline graphic Inline graphic Inline graphic
ProtBERT Inline graphic Inline graphic Inline graphic Inline graphic
AntiBERTa Inline graphic Inline graphic Inline graphic Inline graphic
AntiBERTy Inline graphic Inline graphic Inline graphic Inline graphic
AbLang-H Inline graphic Inline graphic Inline graphic Inline graphic
ESM-2 Inline graphic Inline graphic Inline graphic Inline graphic
ESM-1b Inline graphic Inline graphic Inline graphic Inline graphic
BALM Inline graphic Inline graphic Inline graphic Inline graphic
Redundancy BALM w/o PT Inline graphic Inline graphic Inline graphic Inline graphic
AntiBERTy Inline graphic Inline graphic Inline graphic Inline graphic
AbLang-H Inline graphic Inline graphic Inline graphic Inline graphic
ESM-2 Inline graphic Inline graphic Inline graphic Inline graphic
ESM-1b Inline graphic Inline graphic Inline graphic Inline graphic
BALM Inline graphic Inline graphic Inline graphic Inline graphic

Paratopes, the antibody binding sites that interact with antigens, are critical for understanding the binding mechanism and optimizing antibody binding properties in structural immunology. Predicting paratopes involves a token binary classification task, where the binding probability is determined for each position. BALM demonstrates superior performance compared with existing methods and outperforms AntiBERTa by 2 points on APR. Comparing the accuracy across various regions, BALM consistently outperforms the two baseline models on all CDRs, as depicted in Fig. 3C. Interestingly, even without pre-training, BALM with evolutionary position embedding achieves superior performance compared with the protein language model ProtBERT [8].

Figure 3.

Figure 3

Efficient enhancement of antibody engineering by BALM.  A, Self-attention map for the CDR H3 of the SARS-CoV-2 therapeutic antibody Tixagevimab’s heavy chain, derived from the last hidden layer of the Inline graphic head in BALM. The horizontal and vertical axes represent the residues in the CDR H3 and the positions according to IMGT numbering, respectively. More intense cell coloration indicates higher attention weights and stronger contextual correlations between the associated residues. B, A crystallographic structure (PDB: 8D8R) depicting the interactions mediated by hydrogen and disulfide bonds between two cysteine residues at positions 109 and 112A in Tixagevimab’s heavy chain, as illustrated in the left figure. C, Boxplot of paratope prediction accuracy across CDRs and the framework region. Each boxplot consists of a box that encloses the interquartile range, with a line at the median accuracy, and outliers depicted as individual points. ‘Fr’ represents the framework region in sequences. D, Average precision recall gain from pre-training across five selected methods compared with the baseline BALM without pre-training. The top line represents the value for BALM without pre-training. ‘PT’ stands for pre-training. E, Comparison of affinity maturation prediction, assessed through Pearson’s Inline graphic, with other language models. Each error bar represents the standard deviation. AntiBERTy with 26M parameters were fine-tuned to compare.

Antibody redundancy, characterized by the capacity of multiple antibodies to recognize and bind to the same antigen, is a significant facet of our immune response. During somatic hypermutation, antibody genes can undergo a plethora of mutations, engendering a broad range of antibodies with varying antigen specificities. This variability enhances the immune system’s ability to identify and eliminate pathogens. High redundancy in neutralizing antibody responses represents a robust mechanism to withstand mutations, underscoring redundancy’s importance in enabling the immune system to efficiently neutralize a diverse array of antigens [47]. After applying the preprocessing procedures, we derived a final dataset comprising 31 718 sequences labeled as high or low. BALM achieved an impressive APR score of 76.1, outpacing the second-best model by a margin of 10 points (Table 1). Even without pre-training, BALM with randomly assigned weights demonstrated comparable performance with ESM-2. In addition, the augmentation in APR scores attributed to pre-training intensifies as tasks become more tightly linked with antibodies (Fig. 3D).

The binding affinity of an antibody is a multifaceted attribute, influenced by the antibody’s variable regions, epitope external features and environmental determinants. Elevated affinity is indicative of enhanced antigen neutralization and initiation of downstream immune responses. The binding free-energy change (Inline graphic) quantifies the stability alterations resulting from mutations in protein sequences, comparing wild-type and mutation-type sequences. This value embodies the free energy discrepancy between the bound and unbound states of protein–ligand complexes. Our binding affinity dataset, an antibody subset derived from the structural kinetic and energetic database of mutant protein interactions (SKEMPI) V2.0 [38], encompasses experimentally determined binding free energy changes upon mutation for protein–protein interactions. We computed Pearson’s Inline graphic between the experimental and predicted Inline graphic values using our model and other benchmark models. As depicted in Fig. 3E, BALM’s performance surpasses that of ESM-2 and AntiBERTy in terms of Pearson’s Inline graphic score. Notably, without pre-training, BALM achieved an average score of 72.2, outpacing the scores of ESM-2 by 5.6 and pre-trained AntiBERTy by 11.7. When compared with BERT-based models with a similar parameter count such as BALM and ESM-2, our BALM model equipped with antibody-specific positional embedding achieves a superior Pearson’s Inline graphic score.

BALM encodes antibody structural representations. The relative paucity of experimentally validated antibody structures (approximately a thousand) presents a substantial impediment for the precise prediction of antibody structures. BALM distills patterns from extensive sequence datasets to counterbalance this structural information deficit. The multi-head self-attention mechanisms integral to the pre-trained antibody language model are adept at capturing the sophisticated interdependencies among various positions within the antibody molecule. As a practical demonstration of BALM’s proficiency in learning pairwise interaction patterns, we selected to use Tixagevimab as the example input, a recently approved SARS-CoV-2 therapeutic monoclonal antibody. Figure 3A exhibits the Inline graphic head of BALM’s last hidden layer, with high attention scores serving as indicators of significant contextual associations among the residues. The heatmap of CDR H3 reveals a pair of cysteine residues at positions 109 and 112A, distinguishable by their darkest hue. Correspondingly, the crystal structure presented in Fig. 3B shows that these cysteine residues are linked via two non-covalent interactions, specifically hydrogen and disulfide bonds. The heatmap’s inferred contacts offer an intuitive approach for implicitly defining antibody structure. Significantly, BALM exhibits the capacity to discern the underlying information embedded in Tixagevimab’s heavy chain, making it valuable for understanding the three-dimensional structure of the antibody.

BALMFold accurately and efficiently predicts on antibody structure benchmark with single sequences. Although AlphaFold2 has attained significant accuracy in predicting protein structures, a reliable antibody structure prediction still presents a conundrum due to high variability and the scarcity of potent MSAs information. For predicting antibody structures from sequences, the language model emerges as a critical element in extracting sequence embedding. BALM has earlier been shown to successfully encapsulate the inherent biological characteristics of antibodies. Derived from BALM, we introduce BALMFold, a novel tool for the precise prediction of full-atom antibody structures, with a particular focus on CDR loops. The performance evaluation of BALMFold was carried out using the same benchmark dataset as IgFold, comprising 197 paired and 71 nano antibodies from SAbDab [35] spanning July 2021 to September 2022. It is important to note that the training set shares no similarity cutoff with the benchmark dataset, which was made public after the cutoff date in July 2021. BALMFold was compared with a series of deep learning-based structure prediction methods [11, 23–26].

We conducted the evaluation of root-mean-square deviation (RMSD) values in relation to structures determined through experimental methods. Following an optimal superimposition process conducted with the corresponding chain of experimental structures, the RMSD values were separately calculated in each regions according to Chothia numbering. The calculation focused on all the backbone heavy atoms present in CDR loops and frameworks. As evidenced in Figs 4A4B, the smaller regions observed in the radar graph correspond to lower RMSD values, indicating superior performance. The average RMSD values of BALMFold significantly outperform all other approaches across all CDRs and frameworks for both paired and single-chain antibodies. In addition, the orientational coordinate distance (OCD) [48] is employed to assess the orientation of the heavy and light chains in paired antibodies (in Table 2). In an investigation of 197 paired antibodies, BALMFold achieved a superior average RMSD score Inline graphicÅ specifically on the CDR H3 loop (Table 2). Meanwhile, for the additional set of 71 nanobodies, BALMFold accomplished an average value of 3.69Å on the CDR 3 loop, contrasting with other four models which exceeded 4Å (Table 3). Without explicitly considering inter-chain interactions, our method surpass AlphaFold2-Multimer with an average RMSD of Inline graphicÅ on CDR H3. Unlike AlphaFold2, BALMFold avoids reliance on the laborious process of searching for MSAs and templates, harnessing the pre-trained BALM to extract structural and evolutionary features directly from sequences. In comparison with the methods featuring a similar structure module to AlphaFold2, our language model exhibits noteworthy prowess in antibody structure prediction solely from antibody sequences. Balancing high-quality predictions for both nanobodies and paired antibodies is a common challenge among existing structure prediction approaches. While IgFold and the multimer version of AlphaFold2 show superior performance with paired antibodies, they falter with nanobodies.

Figure 4.

Figure 4

BALMFold accurately predicts antibody structure from sequences.  A, Mean RMSD performance on the IgFold benchmark for 197 paired antibodies, considering eight regions. The label AlphaFold2-M refers to the multimer variant of AlphaFold2. Higher structure prediction accuracy is correlated with smaller model areas depicted in the figure, reflecting lower RMSD values. B, Mean RMSD performance on the IgFold benchmark for 71 nanobodies, spanning four regions. C, Comparative analysis of average structure prediction runtimes. D, Head-to-head comparison of RMSD values for the CDR H3 loop between BALMFold and OmegaFold for nanobodies. Predicted targets by BALMFold falling into the upper left quadrant indicate lower RMSD than alternate method. E, Visualization of nanobody native experimental structure (gray) and predictions by BALMFold (green), IgFold (red) and OmegaFold (orange) for target 7DJY (Inline graphic residues) and target 7Z1B (Inline graphic residues), respectively. The CDR H3 loops are highlighted. f, Head-to-head comparison of RMSD values for the CDR H3 loop between BALMFold and IgFold for paired antibodies. g, Visualization of paired antibody native experimental structure (gray) and predictions by BALMFold (green), IgFold (red) and AlphaFold2-M (blue) for target 7S5Q (Inline graphic residues) and target 7S8H (Inline graphic residues), respectively. h, Illustration of the correlation between pLDDT and lDDT for Inline graphic atom, with Pearson’s Inline graphic and two-sided t-test P-value provided. The shaded region corresponds to the 95% confidence interval. i, Scatter plots with regression lines between pLDDT and atomic RMSD values.

Table 2.

Comparison of strucxxxture prediction performance of methods with average RMSD scores on paired antibody benchmark

Method OCD CDR H1 CDR H2 CDR H3 FR H CDR L1 CDR L2 CDR L3 FR L
BALMFold 3.31 0.77 0.60 3.05 0.38 0.59 0.37 0.94 0.35
AlphaFold2-M 4.18 0.95 0.74 3.56 0.69 0.84 0.51 1.59 0.66
IgFold 3.84 0.91 0.82 3.42 0.50 0.79 0.51 1.22 0.47
ESMFold - 0.94 0.94 4.41 0.50 1.01 0.53 1.59 0.45
OmegaFold - 0.89 0.72 3.55 0.47 0.73 0.41 1.17 0.41

Table 3.

Comparison of structure prediction performance of methods with average RMSD scores on nanobody benchmark

Method CDR 1 CDR 2 CDR 3 FR
BALMFold 1.50 0.83 3.69 0.53
AlphaFold2 1.61 0.88 4.00 0.57
IgFold 1.80 1.10 4.31 0.63
ESMFold 1.73 0.87 4.78 0.58
OmegaFold 1.79 0.88 4.05 0.57

Computational efficiency is paramount for wide-scale practical applications in the design of antibody therapeutics. On average, BALMFold can predict antibody structures from amino acid sequences within 5 s. In contrast to AlphaFold2 and AlphaFold2-Multimer, which heavily lean on the costly procedure of searching for MSAs and templates, BALMFold employs BALM to extract evolutionary signals, circumventing the need for additional MSAs or templates. This provides for swift and precise antibody structure prediction. For a runtime comparison, the models were run on a single NVIDIA V100 GPU. As shown in Fig. 4C, BALMFold’s inference speed significantly outstrips that of both AlphaFold2 and AlphaFold2-Multimer, is approximately six times faster than IgFold and is comparable with ESMFold.

Accurate prediction on CDR H3 loop. The variability inherent to the structure of the CDR H3 loop imparts antibodies with a profound capacity to recognize and bind to an expansive range of antigens. Nevertheless, this characteristic also complicates the accurate prediction of the CDR H3 loop, presenting a significant challenge in the field.

While five of the CDR loops conform to relatively predictable canonical folds, our contributions extend to all CDR loops with a particular emphasis on the CDR H3 loop. Utilizing the pre-trained BALM, our approach predicted the CDR H3 loop within 2Å RMSD for 70 out of 197 targets in the paired antibodies benchmark, reflecting a 49% improvement over the next best-performing method. Notably, our method BALMFold achieved a minimum RMSD of 0.34Å on target 7UEN. Furthermore, BALMFold’s predictions consistently demonstrate sub-1Å RSMD on the CDR 3 loop of the light chain. Within the single-chain antibody benchmark, approximately 35% of BALMFold’s predicted targets outperform the other four models by achieving the lowest RMSD. BALMFold notably exhibited a superior performance on hard targets as well, producing an RMSD of 2.4Å on 7M1H, while alternative models surpassed 5Å.

As expected, increasing the length of the CDR 3 loop resulted in greater conformational diversity, thus compounding the prediction difficulty. We observed a correlation describing how the RMSD of prediction rises monotonically with the length of the CDR3, with paired antibodies displaying heightened sensitivity to the length of the CDR3 compared with single-chain antibodies (Fig. 5). We performed direct comparisons of BALMFold with two competing models, OmegaFold and IgFold, across two domain datasets (Figs 4D4F). Notably, scatter plots indicate superior performance of BALMFold in upper triangle regions. As depicted in Fig. 4D, BALMFold surpassed OmegaFold by achieving lower average RMSD values and standard deviation. Against IgFold, BALMFold outperformed by 0.37Å with P-value of 0.07 in a two-sided Inline graphic-test on paired antibodies (Fig. 4F).

Figure 5.

Figure 5

Scatter plot of RMSD versus CDR H3 length for pair and nano antibody. Each point corresponds to a single antibody prediction structure of BALMFold. The shaded area represents the 95% confidence interval of the trend.

To assess the contribution of the language model to structure prediction, we conducted an ablation study by substituting BALM with ESM-2, each model possessing 150M parameters. In this experiment, we maintained identical structure modules and training hyperparameters. In the CDR H3 regions, BALMFold equipped with BALM outperforms ESM-2, demonstrating RMSD improvements of 0.45 and 0.14 for paired and nano antibodies, respectively (see Supplementary Tables 6 and 7).

We conducted case studies on two domains to evaluate accurate prediction of the CDR loop (Figs 4E4G). For single-chain instance 7DJY, BALMFold demonstrated superior accuracy, vastly outperforming IgFold (0.76Å versus 5.84Å). Clearly, the latter’s predictions deviated entirely from the native conformation, while BALMFold presented divergent predictions showcasing markedly improved accuracy in loop conformation predictions previously mispredicted by IgFold. In the case of paired antibody target 7N3H, BALMFold achieved a significantly lower prediction RMSD (0.43Å) than AlphaFold2-M (3.08Å).

To estimate the predictive reliability of BALMFold, we computed the local distance difference test (lDDT) for each residue’s Inline graphic atom. For high diversity CDR 3, we observed a notable correlation between the predicted lDDT (pLDDT) and the actual lDDT (Fig. 4H). This strong correlation is reflected by Pearson’s Inline graphic of 0.68 for nanobodies and 0.67 for paired antibodies. Regression plots for CDR 1 and CDR 2 are provided in Supplementary Figure 6. In terms of the confidence level for the predicted residue–residue distances and structural precision, we conducted a further exploration of the correlation between pLDDT and RMSD values (Fig. 4I). We observed a close correlation, marked by Pearson’s Inline graphic values of -0.56 for nanobodies and -0.58 for paired antibodies. The plots for other CDRs are depicted in Supplementary Figure 7. These observations provide crucial perspectives in identifying probable regions of unreliable prediction and assess the overall quality of the predicted antibody structures.

Discussion and conclusion

Language models have demonstrated their capacity to learn inherent biological information from sequences, enabling them to predict functional properties and structures of antibodies [6, 13, 14, 26]. Anfinsen’s dogma posits that the native conformation of a protein is exclusively determined by its amino acid sequence [49]. Antibody-specific tasks often lack effective MSAs and templates for CDRs. Moreover, 40% of antibody repertoires in the OAS dataset are missing the first 15 amino acids [18]. To thoroughly consider the unique CDRs and framework regions of antibody sequences, we propose a BALM trained on 336M antibody sequences, which has proven its effectiveness in prediction of various antigen functional properties. We conducted an extensive analysis to uncover the evolutionary trajectories of antibody sequences in response to specific diseases. When combined with a language model, several machine learning approaches have demonstrated their success in predicting protein structures [11, 23–26, 50]. Capitalizing on the meaningful representations of language models, BALMFold is developed to predict structures directly from antibody sequences. Exploiting the biological features of antibodies, BALMFold outperforms IgFold, OmegaFold, AlphaFold2 and ESMFold on benchmark that contains 268 antibodies. By eliminating the search process for MSAs, BALMFold predicts structures more rapidly than MSA-based and template-based approaches. Accurate and fast prediction of antibody structures is crucial for understanding the interactions between antibodies and their target antigens, even without explicit evolutionary information.

While BALM has achieved impressive results across various tasks, antibody structure prediction with atomic accuracy remains an unsolved challenge, particularly for CDR 3. Incorporating inter-chain interactions into the model could potentially enhance the prediction accuracy of paired antibodies. Drawing inspiration from existing numbering schemes, specific antibody numbering schemes for language models could be developed in the future. Limited by computational resource, a higher capacity language model potentially yields better performance. By incorporating bio-inspired antibody positional embedding and antibody features, analogous generative models after training hold the potential of designing therapeutics in antibody discovery [51]. A biologically motivated approach can effectively extract representations of antibody sequences to advance antibody research and therapeutic development.

Key Points

  • We pre-trained BALM by effectively utilizing the distinctive biological properties inherent in antibody sequences.

  • Biological representations generated by BALM suggest the evolutionary direction of antibodies upon exposure to antigens.

  • We thoroughly evaluated BALM’s performance on various tasks, including antigen-binding prediction, paratope prediction, redundancy prediction during maturation and binding affinity prediction, which demonstrated state-of-the-art performance in each of these domains.

  • Leveraging single sequences as inputs, BALMFold outperforms state-of-the-art approaches in terms of accuracy and efficiency when predicting antibody structures.

Supplementary Material

BALM_SI_bib_final_bbae245

Author Biographies

Hongtai Jing is a PhD candidate at Fudan University.

Zhengtao Gao is a research assistant at Fudan University.

Sheng Xu is a research assistant at Shanghai AI Laboratory.

Tao Shen is senior AI researcher at Zelixir Biotech.

Zhangzhi Peng is a research assistant at Fudan University.

Shwai He is a research assistant at Fudan University.

Tao You is a research assistant at Fudan University.

Shuang Ye is a professor at Fudan University.

Wei Lin is a professor at Fudan University.

Siqi Sun is a professor at Fudan University.

Contributor Information

Hongtai Jing, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China.

Zhengtao Gao, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China.

Sheng Xu, Shanghai AI Laboratory, Shanghai 200232, China.

Tao Shen, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China; Zelixir Biotech, Shanghai 201206, China.

Zhangzhi Peng, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China.

Shwai He, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China.

Tao You, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China.

Shuang Ye, Department of Gynecologic Oncology, Fudan University Shanghai Cancer Center, Shanghai 200032, China; Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China.

Wei Lin, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China; Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China; MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China; Shanghai AI Laboratory, Shanghai 200232, China; School of Mathematical Sciences and Shanghai Center for Mathematical Sciences, Fudan University, Shanghai 200433, China.

Siqi Sun, Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China; Shanghai AI Laboratory, Shanghai 200232, China.

Funding

This work is supported by Shanghai Artificial Intelligence Laboratory. W. L is supported by National Natural Science Foundation of China (No. 11925103), Science and Technology Commission of Shanghai Municipality (Grant No. 22JC1402500, 22JC1401402, 20JC1413400, 2021SHZDZX0103, 21511100200 and 22dz1200502), Innovation Program of Shanghai Municipal Education Commission (Grant No. 2023ZKZD04). S.S is supported by funds from the Focus Project of AI for Science of Comprehensive Prosperity Plan for Disciplines of Fudan University, Netmind.AI, and Protagolabs Inc. S. Y is supported by National Natural Science Foundation of China (No. 82373419).

Data availability

The datasets utilized in this project are publicly available. The antibody sequences used for the pre-training of the language model were sourced from the OAS database https://opig.stats.ox.ac.uk/webapps/oas/. The paratope prediction dataset was download from https://github.com/alchemab/antiberta and the antigen-binding prediction dataset was obtained from https://zenodo.org/record/7340488. The affinity prediction task was conducted using SKEMPI 2.0 https://life.bsc.es/pid/skempi2. Antibody structures for the training process were retrieved from the SAbDab database https://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/. The target protein data bank (PDB) ids for the antibody structure prediction benchmark were obtained from the Zenodo database https://doi.org/10.5281/zenodo.7677723, with the corresponding experimentally determined structures fetched from the SAbDab dataset.

Code availability

The BALMFold structure prediction server can be reached at https://beamlab-sh.com/models/BALMFold. The inference code for BALM is made available at https://github.com/BEAM-Labs/BALM. The pre-trained weights for BALM can also be obtained from the aforementioned GitHub repository. The pre-training code for the model is derived from HuggingFace https://huggingface.co/facebook/esm2_t30_150M_UR50D. For IMGT numbering, the ANARCI tool is available at https://opig.stats.ox.ac.uk/webapps/newsabdab/sabpred/anarci/.

References

  • 1. Georgiou G, Ippolito GC, Beausang J, et al.  The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotechnol  2014;32(2):158–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 2. Bashford-Rogers RJM, Laura Bergamaschi EF, McKinney DCP, et al.  Analysis of the b cell receptor repertoire in six immune-mediated diseases. Nature  2019;574(7776):122–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 3. Marks C, Deane CM. Antibody h3 structure prediction. Comput Struct Biotechnol J  2017;15:222–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 4. Rao R, Bhattacharya N, Thomas N, et al.  Evaluating protein transfer learning with tape. Adv Neural Inf Process Syst  2019;32:9689–701. [PMC free article] [PubMed] [Google Scholar]
  • 5. Madani A, Krause B, Greene ER, et al.  Large language models generate functional protein sequences across diverse families. Nat Biotechnol  2023;41:1099–106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 6. Rives A, Meier J, Sercu T, et al.  Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci  2021;118(15):e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 7. Meier J, Rao R, Verkuil R, et al.  Language models enable zero-shot prediction of the effects of mutations on protein function. Adv Neural Inf Process Syst  2021;34:29287–303. [Google Scholar]
  • 8. Elnaggar A, Heinzinger M, Dallago C, et al.  Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell  2021;44(10):7112–27. [DOI] [PubMed] [Google Scholar]
  • 9. Bao W, Yujian G, Chen B, Huiping Y. Golgi_df: Golgi proteins classification with deep forest. Front Neurosci  2023;17:1197824. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 10. Bao W, Cui Q, Chen B, et al.  Phage_unir_lgbm: phage virion proteins classification with unirep features and lightgbm model. Comput Math Methods Med  2022;2022:1–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 11. Jumper JM, Evans R, Pritzel A, et al.  Highly accurate protein structure prediction with alphafold. Nature  2021;596:583–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 12. Baek M, DiMaio F, Anishchenko I, et al.  Accurate prediction of protein structures and interactions using a three-track neural network. Science  2021;373(6557):871–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 13. Leem J, Mitchell LS, Farmery JHR, et al.  Deciphering the language of antibodies using self-supervised learning. Patterns  2022;3(7):100513. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 14. Ruffolo JA, Gray JJ, Sulam J. Deciphering antibody affinity maturation with language models and weakly supervised learning. NeurIPS Workshop on Machine Learningin Structural Biology. Preprint at arXiv. https://doi.org/10.48550/arXiv.2112.07782 2021. [Google Scholar]
  • 15. Olsen TH, Boyles F, Deane CM. Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci  2022;31(1):141–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 16. Burbach SM, Briney B. Improving antibody language models with native pairing. Patterns 2024;5(5):100967. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 17. Singh R, Im C, Sorenson T, et al.  Learning the language of antibody hypervariability. bioRxiv  2023;2023. [Google Scholar]
  • 18. Olsen TH, Moal IH, Deane CM. Ablang: an antibody language model for completing antibody sequences. Bioinform Adv  2022;2(1):vbac046. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 19. Schritt D, Li S, Rozewicki J, et al.  Repertoire builder: high-throughput structural modeling of b and t cell receptors. Mol Syst Des Eng  2019;4(4):761–8. [Google Scholar]
  • 20. Leem J, Dunbar J, Georges G, et al.  Abodybuilder: Automated antibody structure prediction with data–driven accuracy estimation. MAbs, 2016;8(7):1259–68. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 21. Ruffolo JA, Sulam J, Gray JJ. Antibody structure prediction using interpretable deep learning. Patterns  2022;3(2):100406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 22. Abanades B, Georges G, Bujotzek A, Deane CM. Ablooper: fast accurate antibody cdr loop structure prediction with accuracy estimation. Bioinformatics  2022;38(7):1877–80. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 23. Ruffolo JA, Chu L-S, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun  2023;14(1):2389. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 24. Evans R, O’Neill M, Pritzel A, et al.  Protein complex prediction with alphafold-multimer. BioRxiv  2021;2021–10. [Google Scholar]
  • 25. Ruidong W, Ding F, Wang R, et al.  High-resolution de novo structure prediction from primary sequence. BioRxiv  2022;2022–07. [Google Scholar]
  • 26. Lin Z, Akin H, Rao R, et al.  Evolutionary-scale prediction of atomic-level protein structure with a language model. Science  2023;379(6637):1123–30. [DOI] [PubMed] [Google Scholar]
  • 27. Vaswani A, Shazeer N, Parmar N, et al.  Attention is all you need. Adv Neural Inf Process Syst  2017;30. [Google Scholar]
  • 28. Su J, Lu Y, Pan S, et al.  Roformer: enhanced transformer with rotary position embedding. Neurocomputing 2024;568:127063. [Google Scholar]
  • 29. Lefranc M-P, Pommié C, Ruiz M, et al.  Imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-like domains. Dev Comp Immunol  2003;27(1):55–77. [DOI] [PubMed] [Google Scholar]
  • 30. Mariani V, Biasini M, Barbato A, Schwede T. Lddt: a local superposition-free score for comparing protein structures and models using distance difference tests. Bioinformatics  2013;29(21):2722–8. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 31. Devlin J, Chang M-W, Lee K, Toutanova K. Bert: pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT 2019;4171–86. [Google Scholar]
  • 32. Hsu C, Nisonoff H, Fannjiang C, Listgarten J. Combining evolutionary and assay-labelled data for protein fitness prediction. bioRxiv  2021;2021. [DOI] [PubMed] [Google Scholar]
  • 33. Hie BL, Yang KK, Kim PS. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Syst  2022;13(4):274–85. [DOI] [PubMed] [Google Scholar]
  • 34. Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun  2018;9(1):2542. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 35. Schneider C, Raybould MIJ, Deane CM. Sabdab in the age of biotherapeutics: updates including sabdab-nano, the nanobody structure tracker. Nucleic Acids Res  2022;50(D1):D1368–72. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 36. Dunbar J, Deane CM. Anarci: antigen receptor numbering and receptor classification. Bioinformatics  2016;32(2):298–300. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 37. Mason DM, Friedensohn S, Weber CR, et al.  Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat Biomed Eng  2021;5(6):600–12. [DOI] [PubMed] [Google Scholar]
  • 38. Jankauskaitė J, Jiménez-García B, Dapkūnas J, et al.  Skempi 2.0: an updated benchmark of changes in protein–protein binding energy, kinetics and thermodynamics upon mutation. Bioinformatics  2019;35(3):462–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 39. Xiong P, Zhang C, Zheng W, Zhang Y. Bindprofx: assessing mutation-induced binding affinity change by protein interface profiles with pseudo-counts. J Mol Biol  2017;429(3):426–34. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 40. Zhang N, Chen Y, Haoyu L, et al.  Mutabind2: predicting the impacts of single and multiple mutations on protein-protein interactions. Iscience  2020;23(3):100939. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 41. McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. Journal of Open Source Software 2018;3(29):861. [Google Scholar]
  • 42. Van der Maaten L, Hinton G. Visualizing data using t-sne. J Mach Learn Res  2008;9(11). [Google Scholar]
  • 43. Wang D, Fei YE, and Zhou H. On pre-training language model for antibody. In The Eleventh International Conference on Learning Representations  2023.
  • 44. Zhou T, Zhu J, Xueling W, et al.  Multidonor analysis reveals structural elements, genetic determinants, and maturation pathway for hiv-1 neutralization by vrc01-class antibodies. Immunity  2013;39(2):245–58. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 45. Soto C, Gilad Ofek M, Joyce G, et al.  Developmental pathway of the mper-directed hiv-1-neutralizing antibody 10e8. PloS One  2016;11(6):e0157409. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 46. Kim SI, Noh J, Kim S, et al.  Stereotypic neutralizing vh antibodies against sars-cov-2 spike protein receptor binding domain in patients with covid-19 and healthy individuals. Sci Transl Med  2021;13(578):eabd6990. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 47. Barabasi A-L, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nat Rev Genet  2004;5(2):101–13. [DOI] [PubMed] [Google Scholar]
  • 48. Marze NA, Lyskov S, Gray JJ. Improved prediction of antibody vl–vh orientation. Protein Eng Des Sel  2016;29(10):409–18. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 49. Anfinsen CB, Haber E, Sela M, WhiteFH, Jr. The kinetics of formation of native ribonuclease during oxidation of the reduced polypeptide chain. Proc Natl Acad Sci  1961;47(9):1309–14. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 50. Chowdhury R, Bouatta N, Biswas S, et al.  Single-sequence protein structure prediction using a language model and deep learning. Nat Biotechnol  2022;40(11):1617–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • 51. Shin J-E, Riesselman AJ, Kollasch AW, et al.  Protein design and variant prediction using autoregressive generative models. Nat Commun  2021;12(1):2403. [DOI] [PMC free article] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

BALM_SI_bib_final_bbae245

Data Availability Statement

The datasets utilized in this project are publicly available. The antibody sequences used for the pre-training of the language model were sourced from the OAS database https://opig.stats.ox.ac.uk/webapps/oas/. The paratope prediction dataset was download from https://github.com/alchemab/antiberta and the antigen-binding prediction dataset was obtained from https://zenodo.org/record/7340488. The affinity prediction task was conducted using SKEMPI 2.0 https://life.bsc.es/pid/skempi2. Antibody structures for the training process were retrieved from the SAbDab database https://opig.stats.ox.ac.uk/webapps/newsabdab/sabdab/. The target protein data bank (PDB) ids for the antibody structure prediction benchmark were obtained from the Zenodo database https://doi.org/10.5281/zenodo.7677723, with the corresponding experimentally determined structures fetched from the SAbDab dataset.


Articles from Briefings in Bioinformatics are provided here courtesy of Oxford University Press

RESOURCES