Abstract
Antibodies are versatile proteins with both the capacity to bind a broad range of targets and a proven track record as some of the most successful therapeutics. However, the development of novel antibody therapeutics is a lengthy and costly process. It is challenging to predict the functional and biophysical properties of antibodies from their amino acid sequence alone, requiring numerous experiments for full characterization. Machine learning, specifically deep representation learning, has emerged as a family of methods that can complement wet lab approaches and accelerate the overall discovery and engineering process. Here, we review advances in antibody sequence representation learning, and how this has improved antibody structure prediction and facilitated antibody optimization. We discuss challenges in the development and implementation of such models, such as the lack of publicly available, well-curated antibody function data and highlight opportunities for improvement. These and future advances in machine learning for antibody sequences have the potential to increase the success rate in developing new therapeutics, resulting in broader access to transformative medicines and improved patient outcomes.
B cells provide immune protection through the production of antibodies, which are soluble proteins that can bind almost any target molecule with high specificity. These proteins begin as B-cell receptors (BCRs) on the surfaces of B cells. Antigen recognition then activates the B cell to produce and secrete its BCR as antibodies (Fig. 1A). In humans, each BCR is made of two heavy-light chain pairs. At the distal protrusions of each chain, there are the variable domains, VH and VL, which recognize the antigen. The remainder of the antibody, or the constant region, is relatively conserved. Each variable domain has three polymorphic complementarity determining regions (CDRs) that largely form the BCR's binding site, the paratope (Fig. 1B). Much of a BCR's diversity is focused in the third CDR of the heavy chain (CDRH3). Figure 1C shows the sequence and structural alignment of certolizumab (pink), atezolizumab (blue), and ipilumab (orange), all of which bind different antigens. The structures of the three antibodies have good alignment, apart from the CDRH3; this alludes to the pivotal role of CDRH3 in antigen recognition.
Figure 1.
Overview of B-cell receptor ([BCR] antibody) structure. (A) BCRs on the surfaces of B cells are secreted as antibodies upon antigen recognition and subsequent B-cell activation. (B) BCRs and antibodies are composed of two heavy-light chain dimers. Within each chain are the CDRs: CDRH1 (yellow), CDRH2 (orange), CDRH3 (pink), CDRL1 (black), CDRL2 (gray), and CDRL3 (green). In most antibodies, the ensemble of the six CDRs largely form the binding site, or “paratope.” (C) Despite the conserved fold, certolizumab (pink), atezolizumab (blue), and ipilumab (orange) bind different antigens, and this is reflected by variations in their CDRH3 sequence and structure. The heatmap shows the pairwise root-mean-square deviations (measured in angstroms) between CDRH3 loop structures of the three antibodies.
To cover the huge breadth of antigens, B cells rely on several diversification mechanisms (Georgiou et al. 2014). First, B cells undergo V(D)J recombination to produce its BCR. From a vast pool of somatically encoded gene segments, one variable (V), one joining (J), and one diversity (D) gene segment are randomly shuffled and joined. Heavy chains recombine V, D, and J segments, while light chains only recombine V and J. The assembly of heavy and light chains introduces a further level of diversity, with some heavy chains being capable of forming multiple different heavy-light chain pairings (Jaffe et al. 2022). Once a B cell is activated following antigen recognition, it triggers somatic hypermutation, which introduces a series of mutations to the BCR sequence to improve selectivity and affinity.
Together, this leads to a comprehensive BCR repertoire with an estimated diversity of ∼1015 variants (Rees 2020). Understanding an individual's BCR repertoire has proven to be highly valuable for gaining new insights into disease biology (Bashford-Rogers et al. 2019; Galson et al. 2020; Park et al. 2022; Wang et al. 2022a; Yu et al. 2022), building diagnostic tools (Konishi et al. 2019; Zaslavsky et al. 2022), and discovering novel therapeutics (Krawczyk et al. 2019, 2021). Collecting data at the scale of the BCR repertoire has been facilitated by the evolution of modern next-generation sequencing (NGS) platforms and software, which has increased throughput and quality, while reducing the costs of sequencing (Chaudhary and Wesemann 2018).
The most common approach of sequencing the BCR repertoire is “bulk sequencing” of the RNA encoding the BCR heavy chain; it is now possible to recover tens to hundreds of thousands of BCR heavy chain sequences per sample. A related approach is to computationally reconstruct BCR reads from transcriptomics data sets, using methods such as TRUST4 (Song et al. 2021). Both approaches come at the expense of losing the heavy-light chain pairing information for constructing the full BCR. Single-cell sequencing can provide the sequences of the paired BCR heavy and light chains, but it comes at the cost of throughput. For a more extensive overview on sequencing approaches, we refer the reader to Zheng et al. (2022). There are now several community efforts to collect the growing resource of publicly available BCR repertoire data from bulk and single-cell methods (Corrie et al. 2018; Kovaltsuk et al. 2018). For example, the Observed Antibody Space (OAS) has a catalog of 2.4 billion unpaired BCR sequences and 1.5 million paired BCRs, while the iReceptor database contains 232 million unpaired BCRs.
Acquiring function data at the scale of the BCR repertoire is difficult due to the costly nature of in vitro experimentation. However, the large amount of publicly available data and the underlying biological complexity suggest that antibody sequences are a model system for predictive analyses using machine learning techniques. Indeed, machine learning has been applied to a wide range of tasks related to modeling antibody structure and function (Fig. 2), such as humanization (Prihoda et al. 2022), thermostability (Harmalkar et al. 2023; Nijkamp et al. 2023), and binding site (paratope) prediction (Liberis et al. 2018; Ambrosetti et al. 2020; Del Vecchio et al. 2021; Leem et al. 2022). For all these studies, the key ingredient behind their success is being able to represent antibody sequences in a numeric format that can act as input “features” for machine learning models. However, learning representations of antibody sequences is itself a major challenge; recent advancements in deep learning for natural language processing (NLP) have inspired a new generation of models.
Figure 2.
Applications enabled by machine learning (ML)-enriched antibody sequence representations. ML has been deployed across many parts of the antibody discovery and engineering process. Each of these areas benefit from having a strong foundational model that can represent antibody sequences in a numeric format.
In this work, we focus on the development and application of antibody sequence representation models, and comment on their impact in therapeutic antibody discovery. First, we provide an overview of historical work on learning antibody sequence representations before the widespread adoption of NLP methods. We then describe the cross-pollination of NLP onto general protein representation learning, and how that has inspired training antibody-specific models. Next, we discuss the application of antibody-specific models in two areas: antibody structure prediction and antibody engineering. Finally, we provide a perspective on some of the challenges in the development of such models and provide our thoughts on the future use of antibody language models (ALMs) for antibody discovery.
EVOLUTION OF ANTIBODY SEQUENCE REPRESENTATION MODELS
Antibodies are an atypical subset of sequences in the general protein space. Owing to somatic hypermutation, antibodies can exhibit a great deal of length and sequence variation. This diversity is especially present within the CDRs at the antigen-binding interface. This is the converse of the general protein case, where protein–protein interaction interfaces are generally well conserved and other regions are not as strictly conserved (Esmaielbeiki et al. 2016). These fundamental differences suggest that specialized models may be required; indeed, there is evidence in both bioinformatics (Sippl 1990; Imrie et al. 2018) and NLP (Chalkidis et al. 2020; Xue et al. 2020; Gu et al. 2021) to suggest that domain-specific models can outperform general models in domain-specific tasks.
There is an extensive history of antibody sequence modeling approaches, and we notionally divide this into three “eras” (Fig. 3A). The first era we call the “pre-deep learning era,” which comprises invariant, context-free representations such as Hidden Markov Models (HMMs) (Dunbar and Deane 2016), position-specific substitution matrix (PSSM) probabilities (Wong et al. 2019), physicochemical vectors such as Atchley factors (Townsend et al. 2016), and vectors of k-mer frequencies (Greiff et al. 2017; Weber et al. 2022). Many of these methods were introduced very early in the history of bioinformatics tools; for example, HMMs were used since the 1990s to predict membrane protein topology (Bystroff and Krogh 2008). However, they were only deployed for antibody-specific tasks in the latter 2000s with a growing volume of antibody sequence and structural data.
Figure 3.
Evolution of antibody sequence representation learning models. (A) We separate the evolution of representation learning approaches into three eras. Some approaches, despite being introduced in an “earlier” era, may still have been used in a later time point. For example, the position-specific substitution matrix (PSSM) is a classic representation learning model from bioinformatics techniques that were still being used in 2020, despite the use of convolutional neural networks in 2018. (B) Taxonomy of transformer architectures for building antibody language models (ALMs). (NLP) Natural language processing, (BCR) B-cell receptor.
The second era of approaches is hallmarked by machine learning models from computer vision, such as convolutional neural networks (Liberis et al. 2018; Konishi et al. 2019; Mason et al. 2021), variational autoencoders (Friedensohn et al. 2020), and generative adversarial networks (Amimeur et al. 2020; Lim et al. 2022). In the present era, inspiration has been drawn from NLP, applying skip-gram models such as word2vec (Chen et al. 2020; Ostrovsky-Berman et al. 2021), recurrent neural networks such as long short-term memory networks (Wollacott et al. 2019; Saka et al. 2021), or gated recurrent unit networks (Akbar et al. 2021). Remarkably, within the last 2 years alone, there has been an explosion of transformer neural networks for antibody sequence representation learning. We show a breakdown of these approaches according to the class of model architecture and the pretraining mechanism (Fig. 3B; Ruffolo et al. 2021; Bachas et al. 2022; Gao et al. 2022, 2023; Leem et al. 2022; Melnyk et al. 2022; Olsen et al. 2022; Prihoda et al. 2022; Shuai et al. 2023; Chen et al. 2023; Chu and Wei 2023; Nijkamp et al. 2023).
The transformer model (Vaswani et al. 2017) and its variants, such as bidirectional encoder representations from transformers (BERTs) (Devlin et al. 2018) and generative pretrained transformer 2 (GPT-2) (Radford et al. 2019), have radically shifted the paradigm in NLP. In brief, transformers use an attention mechanism to learn a contextualized representation, or embedding, for each word in an input sentence. These word embeddings can then be used for a wide range of applications, such as text classification and text generation.
Transformers are conceived in three different forms: an encoder–decoder model (e.g., for text summarization), an encoder-only form (e.g., for sentiment analysis), or as a decoder-only model (e.g., for sequence generation). These transformer-based “language models” (LMs) are first pretrained on a self-supervised task, such as masked language modeling (MLM) for BERT variants (Fig. 4A), or causal language modeling (CLM) for GPT-2 variants (Fig. 4B). In a second step, the pretrained model is specialized to predict an outcome of interest (Devlin et al. 2018; Raffel et al. 2019). This can be done by using, for example, a single neural network layer that uses the pretrained model's embeddings as its features (Fig. 4C).
Figure 4.
Pretraining and fine-tuning regime for language models. (A) Masked language modeling for proteins and antibodies involves randomly perturbing a subset of residues and reconstructing the perturbed positions. (B) Causal language modeling is a next residue prediction task where the model is given information only up to the position it needs to predict. (C) Two-step training regime for paratope prediction using transformer-based antibody language models (ALMs). First, a transformer model is pretrained on a self-supervised task, such as masked language modeling (MLM), with a very large antibody sequence data set. Next, in the transfer learning step, the pretrained transformer's outputs act as the input for a second neural network with a lower volume of labeled data. (BCR) B-cell receptor.
The main advantage of LMs and the two-step transfer learning approach is that they harness big volumes of unlabeled text data to develop the initial comprehension of the language. The pretrained LM is also generalizable and can be adapted for classification or regression tasks. To learn latent patterns in huge text corpora, transformer models are often set up with millions of learnable parameters; for instance, BERT-base contains 110 million parameters (Devlin et al. 2018). By scaling up the number of parameters, in some cases to many billions of parameters, a transformer model can capture highly complex, nuanced patterns in a language. These large models can also thrive in “few-shot” learning scenarios (i.e., cases where there are shallow amounts of “labeled” data). All these factors have contributed to the huge popularity of transformers and their variants in building LMs (Lin et al. 2022).
The rapid evolution of transformer-based LMs prompted adaptations of transformers for building protein LMs (PLMs), starting with evolutionary scale modeling (ESM) (Rives et al. 2021), ProtTrans (Elnaggar et al. 2022), and a multitude of PLMs since. Much like the LMs developed for NLP applications, PLMs learn protein sequence representations from a large corpora of unlabeled protein sequences. Furthermore, PLMs have been designed with the aim to have a single model that can generalize across a mixture of downstream tasks for any protein family.
Sequence representations from PLMs have been applied to various types of antibody property prediction. Embeddings from ESM models have been used as input to regression and classification models predicting antibody solubility (Feng et al. 2022) and thermostability (Zainchkovskyy et al. 2022; Harmalkar et al. 2023). Estimated sequence likelihoods from ESM, UniRep, and ProGen2 models were tested for rank correlations with measures of thermostability and expression quality with mixed results (Harmalkar et al. 2023; Nijkamp et al. 2023).
Motivated by the differences between antibodies and other proteins, various ALMs have been developed in parallel using transformers. Since the first ALM described by Prihoda et al. (2022), 14 different transformer-based ALMs have been described in the literature in total, which is by far the most for a single protein family (Fig. 3B). The key difference between PLMs and ALMs is their underlying pretraining corpus. PLMs are typically pretrained on nonredundant splits of UniRef, covering a broad spectrum of protein families. In contrast, ALMs are exclusively pretrained on nonredundant collections of BCR repertoires, such as a clustered, nonredundant subset of the OAS database.
The most common strategy for constructing ALMs is using encoder-only models that are derivatives of BERT (Devlin et al. 2018); 10 of the 14 ALMs in the literature are based on this architecture (Fig. 3B). The underlying bidirectional attention mechanism of these encoder-only architectures make them ideal for learning representations of sequences. For example, models like AntiBERTa are pretrained on millions of BCR sequences from the OAS database using MLM. In this task, 15% of the amino acids in the BCR heavy or light chain are randomly perturbed, and the model is tasked with reconstructing the correct amino acids in their places (Fig. 4A). AntiBERTa then leverages its self-attention mechanism to understand the sequence context before and after the masked positions to predict the correct amino acids (Leem et al. 2022). Through pretraining, AntiBERTa learns to extract latent information from antibody sequences, and express this information in its output embeddings.
Similar to how LMs provide embeddings for each word in a sentence, ALMs compute embeddings for each amino acid in the broader amino acid sequence. In contrast to previous representation learning approaches for antibody sequences, ALM embeddings leverage sequence context. In other words, the representation for a single amino acid at a specific position will vary depending on the sequence upstream or downstream of that position.
Ultimately, these contextualized embeddings provide a valuable starting point for downstream modeling tasks and can yield highly accurate models (Fig. 4C). For instance, AntiBERTa has been used for predicting the paratopes of antibody sequences with high accuracy. Although there are only hundreds of antibody sequences with known paratope information, the bulk of the pattern recognition is handled during pretraining on millions of sequences. Effectively, pretraining makes paratope prediction a much more amenable task.
The remaining four ALMs are generative. ProGen2-OAS, IgLM, pAbT5, and xTrimoPGLM contain decoder layers, making them more suited for synthesizing novel antibody sequences in silico. For example, ProGen2-OAS is pretrained using a CLM objective, where amino acids are autoregressively predicted on a position-by-position basis, from the amino to carboxyl terminus (Fig. 4B). Like the encoder-only architectures described above, generative models also learn internal representations during CLM pretraining. However, due to their autoregressive nature, decoders use a unidirectional attention mechanism attending only to the portion of the sequence preceding each residue. As such, their representations can be less performant than bidirectional encoders in downstream prediction tasks (Devlin et al. 2018; Elnaggar et al. 2022). To circumvent these limitations, xTrimoPGLM-Ab has recently been proposed to leverage bidirectional attention while using a decoder architecture. This is done by first pretraining with MLM, then using an alternation between MLM and CLM (Chen et al. 2023). The representation capacities of xTrimoPGLM-Ab have only been tested on a limited number of tasks, although it has shown promising results.
MLM is the dominant pretraining strategy among ALMs, although there can be variations. For instance, Gao et al. (2023) employ higher masking rates than the traditional 15%, while Gao et al. (2022) only mask the CDR residues as opposed to the entire heavy chain. Generative models such as IgLM and xTrimoPGLM-Ab rely on span predictions (Fig. 4B), while pAbT5 uses machine translation (Chu and Wei 2023). Currently, it is unclear which pretraining objective is supreme, as each of these ALMs have been evaluated on separate bespoke tasks. For instance, AntiBERTy has been trained using MLM and used for antibody structure prediction and engineering. On the other hand, IgLM has only been tested for its capability of generating new variant antibodies. This makes it challenging to speculate how IgLM will perform on structure prediction.
STRUCTURE PREDICTION DRIVEN BY ALMs
Antibody structure prediction is a subproblem within the field of protein structure prediction. Accurate structure predictions are a precursor to many subsequent analyses, such as antibody–antigen docking and optimization (Hummer et al. 2022). Accurate antibody structure predictions can shed light on the residues that are important for binding the target (paratope). Structure predictions can also highlight where the antibody may bind on the target (epitope) to deconvolute the antibody's mechanism of action. Finally, structural models can pinpoint liabilities to the development and manufacturing process. Most structure prediction tools can predict the overall fold of the antibody structure with high accuracy, although the CDRs can pose a significant obstacle for current models. In particular, the CDRH3 is often poorly modeled. This is expected, as the CDRH3 is the most polymorphic in terms of sequence length, sequence diversity, and conformational plasticity (Fig. 1C; Fernández-Quintero et al. 2023a).
A fundamental ingredient of modern protein structure prediction tools such as AlphaFold2 and RoseTTAFold is the input multiple sequence alignment (MSA) (Baek et al. 2021; Jumper et al. 2021). Sites of sequence covariation in the MSA encode contacts in three-dimensional space, and thus act as constraints for structure prediction (Kuhlman and Bradley 2019). However, as antibody sequences are hypervariable, especially in the CDRs, it is impractical to build a sufficiently deep MSA for high-resolution antibody models. This has prompted the development of antibody structure prediction tools that leverage components in AlphaFold2 but avoid MSA inputs. ALMs offer a unique solution for this problem as they can represent salient features of the antibody sequence without necessitating homologous antibody sequences in their training sets. Furthermore, self-attention matrices from ALMs can allude to potential structural contacts within the antibody (Leem et al. 2022; Prihoda et al. 2022).
To our knowledge, four antibody structure prediction tools use ALMs: AbBERT-HMPN (Gao et al. 2022), IgFold (Ruffolo et al. 2023), xTrimoAbFold (Wang et al. 2022b), and xTrimoPGLM-AbFold (Chen et al. 2023). AbBERT-HMPN uses the AbBERT ALM to generate embeddings that act as features for a structure prediction graph neural network. IgFold extracts antibody sequence embeddings and attention matrices from the AntiBERTy ALM as node and edge features for a graph transformer. This effectively replaces the MSA of AlphaFold2. In a more explicit substitution of the MSA, xTrimoAbFold uses the embeddings and attention matrices from an ALM as direct inputs to AlphaFold2's Evoformer and structure modules. xTrimoPGLM-AbFold is conceptually similar to xTrimoAbFold, although it uses fewer Evoformer and structure module blocks. All four approaches report excellent accuracies, although IgFold, xTrimoAbFold, and xTrimoPGLM-AbFold have the added advantage of being able to predict structures in seconds. This makes it feasible to use these tools for B-cell repertoire data sets, allowing users to structurally cluster leads and identify functionally convergent clones (Raybould et al. 2021; Robinson et al. 2021).
Ignoring ALMs and PLMs altogether are ImmuneBuilder and EquiFold (Lee et al. 2022; Abanades et al. 2023). Like xTrimoAbFold, ImmuneBuilder depends on AlphaFold2's structure module for coordinate generation. Instead of using an ALM input, ImmuneBuilder uses a one-hot encoding of the antibody sequence. On the other hand, EquiFold uses a bespoke antibody sequence representation for predicting the structure using the Equiformer model (Liao and Smidt 2022). Despite having simpler input features, both models can generate exceptional predictions across all antibody regions. However, structure prediction tools that leverage ALMs produce the most accurate results, with xTrimoPGLM-AbFold reporting the lowest root-mean-square deviations (RMSDs).
One of the challenges in evaluating the state of antibody structure prediction is the lack of a blinded, community-accepted benchmark, along with an accepted comparison metric, akin to critical assessment of protein structure prediction (CASP) (Moult et al. 1995). As a result, it is difficult to determine the precise level of value that ALMs add to structure prediction accuracy. It is also worth noting that publicly available antibody structures cover a biased proportion of the total antibody sequence space. For example, there is an overwhelming coverage of SARS-CoV-2-binding antibodies, but a paucity of antibodies binding targets related to other diseases (Dunbar et al. 2014). Training on structural data alone, as in the case of ImmuneBuilder and EquiFold, may yield models that only perform well on a subset of antibodies (e.g., antibodies with shorter CDRH3 lengths) or antibodies that are more rigid and crystallizable. ALM-based models may be more robust to these scenarios as they would have been exposed to a broader spectrum of antibody sequences through their pretraining procedure.
ANTIBODY ENGINEERING
The “fitness” of an antibody is a highly nuanced term, encompassing a wide range of properties, such as thermostability and biochemical activity (Dallago et al. 2021). An antibody engineer must consider not only antibody thermostability and developability but also the target that the antibody binds, the affinity at which the target is bound, and the specificity of binding to a particular region within a target. Further still, function can also refer to downstream effects of the antibody, such as the antibody's agonistic or antagonistic properties (Schardt et al. 2022), its capacity to produce an immune response (referred to as “immunogenicity”), and off-target toxicity (Fernández-Quintero et al. 2023b). Successful modulation of these properties, and others, contributes to making a safe and efficacious therapeutic antibody. To date, ALMs have only been used to engineer the variable domain sequence; antibody function can also be affected by the constant region, which has been outside the remit of ALM-based techniques.
As with general proteins, the landscape for each of these properties is hilly, and combining mutations may have nonadditive epistatic effects. Moreover, optimizing for a single property in isolation may be detrimental to another. Computational methods can reduce the risks of antibody engineering by helping protein engineers distil a set of advantageous mutations from the wider combinatoric space. Typical pipelines feature a model that predicts a single aspect of antibody function, such as thermostability, from which mutations are then proposed. For therapeutic applications, ALM embeddings have mostly been used to predict binding affinity, safety, or developability.
Binding Affinity Optimization Using ALMs
Affinity maturation models typically harness sequence representations as the input for a regression model on a quantitative measure of affinity, such as KD. Both PLMs and ALMs have been shown to be adaptable to this purpose (Hie et al. 2022a; Li et al. 2023). These models have reported success in training target-specific regression models when provided with training data of 103–105 binders and nonbinders to a target antigen from high-throughput assays. For instance, the AlphaSeq assay uses a yeast mating system to obtain hundreds of thousands of antibody–antigen interactions for fine-tuning ALMs (Engelhart et al. 2022; Li et al. 2023). The ACE assay from Bachas et al. (2022) is another route to obtaining high-throughput data sets that can fuel ALMs.
One study that demonstrates how ALMs can tie in with computational optimization algorithms is the work by Li et al. (2023). Using the AlphaSeq data set from Engelhart et al. (2022), a pretrained BERT transformer is fine-tuned to predict affinity and carve a mutational landscape to prioritize mutations. The proposed design variants achieve nearly 30-fold higher affinity than a baseline approach using PSSMs. Another striking aspect of the model is that despite being only trained on sequences with at most three mutations in the CDRs, the optimization model can generate variants with over 20 mutations. However, the implications of such a mutated antibody sequence with respect to immunogenicity and safety were not assessed. While this is a promising technique, it may be impractical for many, as the model required ∼104 antibodies with known binding affinities. Furthermore, the target is a peptide; antigen-binding dynamics will likely be very different for larger proteins with conformational epitopes.
Although a generalized regression model that can predict binding to any arbitrary target is of great interest, at this time it is beyond reach, given the publicly available training data. To date, most antibody-binding data sets are either narrow and deep (i.e., many antibody variants for a single antigen) (Mason et al. 2021; Engelhart et al. 2022) or wide and shallow (i.e., antibodies against many different targets, but with very few examples per target) (Dunbar et al. 2014; Hie et al. 2022a).
An alternative approach to predicting affinity is to rank sequences as opposed to regressing a specific value. For instance, Ruffolo et al. (2023) ordered ALM embeddings along an affinity maturation pathway using the evolutionary velocity technique (Hie et al. 2022b). In brief, an ALM is first used to calculate the likelihoods of various antibody sequence mutants, and sequences are sorted by difference in likelihood from a parent sequence.
Another possible strategy for affinity maturation that has not yet been extensively explored is leveraging paratope predictions. ALMs can predict paratopes with good accuracy (Leem et al. 2022; Wang et al. 2023), and are comparable in performance to structure-based paratope prediction tools. By identifying sites that are most likely to affect antigen binding, this can help prioritize positions to manipulate for affinity maturation. While this would not necessarily provide guidance on the impact of mutations (i.e., does a provisional change increase or decrease affinity), it could be coupled with methods such as phage display to design new variants.
Safety
In terms of safety, ALMs have been promising for antibody “humanization.” This involves engineering antibodies extracted from immunized animals to resemble a human antibody sequence and reduce immunogenic risk (Prihoda et al. 2022). The ability of ALMs to make residue-level humanness predictions, alongside a conditional probability distribution for each amino acid at a given position, makes them useful for guiding humanization campaigns.
The Sapiens ALM from Prihoda et al. (2022) can propose mutations that match decisions that would have been undertaken by experimental scientists. On the other hand, IgLM takes a slightly different approach; instead of point mutations, it proposes a contiguous “span” of mutations to facilitate humanization (Shuai et al. 2023). As IgLM is a generative model, it is possible to tune the model to explore a larger set of possible mutations, as opposed to taking the most probable mutation. While both models can increase humanness scores, the scores themselves are weakly correlated with experimentally determined immunogenicity (Prihoda et al. 2022); moreover, there are only ∼200 antibody sequences with experimental immunogenicity data in the public domain (Marks et al. 2021). Thus, while these two approaches demonstrate the proof-of-principle that ALMs can help create safer therapeutics, more predictive safety metrics, standardized assays for robust validation, and the data to support creating such metrics, are critical (Ducret et al. 2022).
Developability
When manufacturing an antibody molecule for therapeutic use, it is imperative that it retains favorable biophysical characteristics, such as high thermostability and low aggregation propensity. One solution is to predict sequence liabilities (e.g., solvent-accessible deamidation motifs) within the variable domains using structural models (Leem et al. 2016; Raybould et al. 2019). Many of the factors that influence the shelf-life of a molecule can be outside of the antibody sequence itself, such as formulation (Fernández-Quintero et al. 2023b). Nevertheless, modifications of the amino acid sequence can influence developability, and several studies have attempted using ALMs for enhancing developability.
Harmalkar et al. (2023) used ESM-1b and AntiBERTy to predict thermostability. The authors found that AntiBERTy had poorer zero-shot performance than ESM-1b, although in fine-tuning, the ALM was superior and had stronger out-of-distribution performance. When taken forward for optimization, ESM-1b had better agreement with experimental data than structure-based methods or AntiBERTy. However, the caveat of this work is that it only features 20 experimental data points. It is worth noting that large-scale developability data sets are rare in the public domain. To our knowledge, the largest such set only has ∼400 antibody sequences’ worth of data (Shehata et al. 2019), while the next largest contains ∼140 antibodies (Jain et al. 2017). Neither provide a sufficient volume for deep learning models to learn the patterns underpinning thermostability. We expect with more data, there will be a more optimistic outlook for ALMs in supporting antibody developability.
Multiparameter Optimization
Optimizing an antibody sequence should ideally be done within a multiparameter optimization (MPO) framework to directly model the trade-offs of optimizing one property over another. Recently, a Bayesian approach has been published for antibody MPO (Khan et al. 2023), although it does not use an ALM and is outside the scope of this review.
Two studies have attempted to harmonize the various facets of antibody fitness into one global metric for optimization. Bachas et al. (2022) calculated the pseudo–log likelihood of an antibody sequence from their ALM, which they referred to as “naturalness.” They showed that naturalness is broadly correlated with immunogenicity, developability, and expression titer, but not binding affinity. Nijkamp et al. (2023) used a similar concept of calculating the perplexity of an antibody sequence from the ProGen2-OAS ALM. Perplexity was tested for its correlation with thermostability and binding affinity but was found to be weakly correlated with both. This may be explained by the fact that representations from decoders are generally less performant for sequence classification tasks than those from encoder-based models (Elnaggar et al. 2022). Computing a single metric that can capture the multitude of antibody properties remains an unsolved challenge. However, distilling such a metric will be a boon for antibody engineering as it will simplify optimization routines and lead to more interpretable outcomes.
CHALLENGES AND OPPORTUNITIES
While ALMs have been successful in many applications, their predictive power can be variable; in some applications, there is a clear gap in the volume of labeled data, which could give a “false negative” view on the utility of ALMs. Two key challenges remain in the pathway toward a general-purpose ALM that can truly revolutionize antibody drug discovery: creating large-scale data sets for machine learning, and technical standardization.
Creating Large-Scale Data Sets for Machine Learning
Data is the bedrock for successful machine learning. Ideal data sets for testing ALMs should have as many of the following characteristics as possible: low measurement noise, broad antibody sequence and broad target coverage, high-throughput, and translational relevance. To our knowledge, no publicly available data set satisfies all these requirements.
Antibody data sets are typically “narrow and deep” in nature. For example, the recently published data set of antibody-binding data to SARS-CoV-2 peptides contains over 100,000 antibody sequences (Engelhart et al. 2022). It can be useful for screening and designing new antibodies against the SARS-CoV-2 virus (Li et al. 2023). However, it is challenging to extrapolate how an ALM fine-tuned on this data set will generalize to other antigens, especially nonviral targets. Another narrow and deep data set is the publicly available set of more than 30,000 trastuzumab variants that were screened using yeast display (Mason et al. 2021). While this data set is valuable for training and evaluating ALMs that predict HER2 binding, all the antibody sequence variation is focused in the CDRH3 and CDRL3. This would raise challenges on whether an ALM would be generalizable in predicting variants outside these regions. Here, we believe that assays like ACE, AlphaSeq, and MIPSA are particularly promising, as they can sample thousands of antibody–antigen interactions (Younger et al. 2017; Bachas et al. 2022; Credle et al. 2022). Data sets can also be “wide and shallow”; SAbDab is a primary example, which contains antibody-binding affinity data for a diverse set of antibodies and antigens. However, for most antigens, there are only one or two cognate antibodies.
We believe that benchmark data sets in deep learning research, such as GLUE for NLP, or ImageNet for computer vision, provide an ideal template for future work in developing ALMs (Russakovsky et al. 2015; Wang et al. 2018). Both GLUE and ImageNet have thousands of examples for various subtasks, giving researchers access to large data sets for training and evaluating models, and have many different tasks to obtain a more holistic view on the power of their models.
SAbDab follows this premise, as it can act as a benchmark for binding affinity prediction, paratope prediction, or structure prediction but at low volumes for each separate task (Dunbar et al. 2014). In addition, SAbDab is highly skewed with antiviral antibodies, particularly SARS-CoV-2 binders, which may hamper the translational impact of models. The recently published ATUE benchmark is another excellent attempt at creating a community benchmark for gauging model performance (Wang et al. 2023). So far in ATUE, there are four different tasks: classifying trastuzumab variants for antigen-binding, paratope prediction, B-cell-type prediction, and ranking SARS-CoV-2 binders from collections of BCR repertoires. Apart from paratope prediction, each data set contains over 10,000 training examples, making it a good initial platform for evaluating models. The most impactful data sets will only come about through closer partnerships between experimental and dry-lab scientists, as well as interorganizational collaborations, especially on core questions in drug discovery.
Technical Standardization
Given the relatively recent development of ALMs, it is expected that there is not as extensive literature on best practices for building these models. Indeed, many of the hyperparameter choices for setting up an ALM (e.g., type of transformer architecture, number of layers) are based on successful recipes in NLP. For instance, large LMs in NLP can feature many self-attention heads, but the justification for having so many self-attention heads for antibody sequences has yet to be established. In fact, correct design choices for many aspects of ALMs are unresolved. For example, some ALMs are heavy or light-chain-specific (e.g., AbLang-VH and AbLang-VL; Olsen et al. 2022), while others accept either chain (e.g., AntiBERTa; Leem et al., 2022). Only xTrimoPGLM-Ab, to our knowledge, can accept both chains simultaneously as its input (Chen et al. 2023).
The closest attempt at investigating “good standards” for building ALMs has been the work by Gao et al. (2023). Here, the authors ran several ablations to determine how certain hyperparameter choices can affect reconstruction of masked sequences. Their work provides a template for some of the considerations that should be made in building ALMs. However, this study did not consider the impact of pretraining data size, nor how the changes in these pretraining regimes affect downstream fine-tuning performance. This is where studies in PLM design, primarily those investigating the impact of scaling up models, can provide a blueprint for how we should consider designing new ALMs in the future (Elnaggar et al. 2023; Lin et al. 2023; Nijkamp et al. 2023).
Another initiative that can facilitate standardization is agreeing on a set of software libraries and packages for development. This should enable more robust comparisons between models and reduce engineering burden. Most ALMs use PyTorch and the HuggingFace transformers library (Leem et al. 2022; Nijkamp et al. 2023; Ruffolo et al. 2023; Shuai et al. 2023), although some use FAIRSeq (Olsen et al. 2022; Prihoda et al. 2022).
CONCLUDING REMARKS
In this work, we describe the evolution and application of ALMs for representation learning of antibody sequences. Following a long history of using context-free methods, transformer-based ALMs have changed the paradigm for antibody sequence representation learning in a span of 2 years. They are now a “Swiss-army knife” that can be used across a suite of bioinformatics challenges in therapeutic antibody discovery, including antibody structure prediction and antibody engineering. A striking feature of ALMs is their sheer generalizability: a prime example is the AntiBERTy model, which has been used for structure prediction and engineering several different antibody properties. Considering that AntiBERTy is one of many encoder-only models, other ALMs such as AntiBERTa, AbLang, and Sapiens can equally be customized for these applications as well.
We anticipate further adoption of ALMs, and an increasing volume of experimental evidence that validates the use of these models. The field is still rapidly evolving; xTrimoPGLM-Ab has recently broken the one billion parameter barrier, and we expect further scaling up of ALMs. Furthermore, we anticipate tighter integration between ALMs and other types of deep learning architectures, such as diffusion models (Luo et al. 2022). Taking inspiration from the broader NLP domain, we also believe that ALMs will become multimodal: whether that is integration with structural data, imaging data, or other types of phenotypic readouts that can map the relationship between an antibody's sequence with respect to function.
ACKNOWLEDGMENTS
We would like to thank the wider team at Alchemab Therapeutics for their thoughts and discussion points to prepare the manuscript.
Footnotes
Editors: Peter K. Koo, Christian Dallago, Ananthan Nambiar, and Kevin K. Yang
Additional Perspectives on Machine Learning for Protein Science and Engineering available at www.cshperspectives.org
REFERENCES
- Abanades B, Wong WK, Boyles F, Georges G, Bujotzek A, Deane CM. 2023. Immunebuilder: deep-learning models for predicting the structures of immune proteins. Commun Biol 6: 575. 10.1038/s42003-023-04927-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Akbar R, Robert PA, Pavlović M, Jeliazkov JR, Snapkov I, Slabodkin A, Weber CR, Scheffer L, Miho E, Haff IH, et al. 2021. A compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding. Cell Rep 34: 108856. 10.1016/j.celrep.2021.108856 [DOI] [PubMed] [Google Scholar]
- Ambrosetti F, Olsen TH, Olimpieri PP, Jiménez-García B, Milanetti E, Marcatilli P, Bonvin AMJJ. 2020. proABC-2: prediction of antibody contacts v2 and its application to information-driven docking. Bioinformatics 36: 5107–5108. 10.1093/bioinformatics/btaa644 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Amimeur T, Shaver JM, Ketchem RR, Taylor JA, Clark RH, Smith J, Van Citters D, Siska CC, Smidt P, Sprague M, et al. 2020. Designing feature-controlled humanoid antibody discovery libraries using generative adversarial networks. bioRxiv 10.1101/2020.04.12.024844 [DOI] [Google Scholar]
- Bachas S, Rakocevic G, Spencer D, Sastry AV, Haile R, Sutton JM, Kasun G, Stachyra A, Gutierrez JM, Yassine E, et al. 2022. Antibody optimization enabled by artificial intelligence predictions of binding affinity and naturalness. bioRxiv 10.1101/2022.08.16.504181 [DOI] [Google Scholar]
- Baek M, DiMaio F, Anishchenko I, Dauparas J, Ovchinnikov S, Lee GR, Wang J, Cong Q, Kinch LN, Schaeffer RD, et al. 2021. Accurate prediction of protein structures and interactions using a three-track neural network. Science 373: 871–876. 10.1126/science.abj8754 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bashford-Rogers RJM, Bergamaschi L, McKinney EF, Pombal DC, Mescia F, Lee JC, Thomas DC, Flint SM, Kellam P, Jayne DRW, et al. 2019. Analysis of the B cell receptor repertoire in six immune-mediated diseases. Nature 574: 122–126. 10.1038/s41586-019-1595-3 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Bystroff C, Krogh A. 2008. Hidden Markov models for protein structure prediction. Methods Mol Biol 413: 173–198. 10.1007/978-1-59745-574-9_7 [DOI] [PubMed] [Google Scholar]
- Chalkidis I, Fergadiotis M, Malakasiotis P, Aletras N, Androutsopoulos I. 2020. LEGAL-BERT: the muppets straight out of law school. arXiv 10.48550/arXiv.2010.02559 [DOI] [Google Scholar]
- Chaudhary N, Wesemann DR. 2018. Analyzing immunoglobulin repertoires. Front Immunol 9: 462. 10.3389/fimmu.2018.00462 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Chen X, Dougherty T, Hong C, Schibler R, Zhao YC, Sadeghi R, Matasci N, Wu YC, Kerman I. 2020. Predicting antibody developability from sequence using machine learning. bioRxiv 10.1101/2020.06.18.159798 [DOI] [Google Scholar]
- Chen B, Cheng X, Geng Y, Li S, Zeng X, Wang B, Gong J, Liu C, Zeng A, Dong Y, et al. 2023. xTrimoPGLM: unified 100B-scale pre-trained transformer for deciphering the language of protein. bioRxiv 10.1101/2023.07.05.547496 [DOI] [Google Scholar]
- Chu SKS, Wei KY. 2023. Conditional generation of paired antibody chain sequences through encoder-decoder language model. arXiv 10.458550/arXiv.2301.02748 [DOI] [Google Scholar]
- Corrie BD, Marthandan N, Zimonja B, Jaglale J, Zhou Y, Barr E, Knoetze N, Breden FMW, Christley S, Scott JK, et al. 2018. Ireceptor: a platform for querying and analyzing antibody/B-cell and T-cell receptor repertoire data across federated repositories. Immunol Rev 284: 24–41. 10.1111/imr.12666 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Credle JJ, Gunn J, Sangkhapreecha P, Monaco DR, Zheng XA, Tsai HJ, Wilbon A, Morgenlander WR, Rastegar A, Dong Y, et al. 2022. Unbiased discovery of autoantibodies associated with severe COVID-19 via genome-scale self-assembled DNA-barcoded protein libraries. Nat Biomed Eng 6: 992–1003. 10.1038/s41551-022-00925-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dallago C, Mou J, Johnston KE, Wittmann BJ, Bhattacharya N, Goldman S, Madani A, Yang KK. 2021. FLIP: benchmark tasks in fitness landscape inference for proteins. bioRxiv 10.1101/2021.11.09.467890 [DOI] [Google Scholar]
- Del Vecchio A, Deac A, Liò P, Veličković P. 2021. Neural message passing for joint paratope-epitope prediction. arXiv 10.458550/arXiv.2106.00757 [DOI] [Google Scholar]
- Devlin J, Chang MW, Lee K, Toutanova K. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv 10.48550/arXiv.1810.04805 [DOI] [Google Scholar]
- Ducret A, Ackaert C, Bessa J, Bunce C, Hickling T, Jawa V, Kroenke MA, Lamberth K, Manin A, Penny HL, et al. 2022. Assay format diversity in pre-clinical immunogenicity risk assessment: toward a possible harmonization of antigenicity assays. MAbs 14: 1993522. 10.1080/19420862.2021.1993522 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunbar J, Deane CM. 2016. ANARCI: antigen receptor numbering and receptor classification. Bioinformatics 32: 298–300. 10.1093/bioinformatics/btv552 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Dunbar J, Krawczyk K, Leem J, Baker T, Fuchs A, Georges G, Shi J, Deane CM. 2014. SAbdab: the structural antibody database. Nucleic Acids Res 42: D1140–D1146. 10.1093/nar/gkt1043 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, et al. 2022. Prottrans: toward understanding the language of life through self-supervised learning. IEEE Trans Pattern Anal Mach Intell 44: 7112–7127. 10.1109/TPAMI.2021.3095381 [DOI] [PubMed] [Google Scholar]
- Elnaggar A, Essam H, Salah-Eldin W, Moustafa W, Elkerdawy M, Rochereau C, Rost B. 2023. Ankh: optimized protein language model unlocks general-purpose modelling. arXiv 10.48550/arXiv.2301.06568 [DOI] [Google Scholar]
- Engelhart E, Emerson R, Shing L, Lennartz C, Guion D, Kelley M, Lin C, Lopez R, Younger D, Walsh ME. 2022. A dataset comprised of binding interactions for 104,972 antibodies against a SARS-CoV-2 peptide. Sci Data 9: 653. 10.1038/s41597-022-01779-4 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Esmaielbeiki R, Krawczyk K, Knapp B, Nebel JC, Deane CM. 2016. Progress and challenges in predicting protein interfaces. Brief Bioinform 17: 117–131. 10.1093/bib/bbv027 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Feng J, Jiang M, Shih J, Chai Q. 2022. Antibody apparent solubility prediction from sequence by transfer learning. iScience 25: 105173. 10.1016/j.isci.2022.105173 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernández-Quintero ML, Kokot J, Waibl F, Fischer ALM, Quoika PK, Deane CM, Liedl KR. 2023a. Challenges in antibody structure prediction. MAbs 15: 2175319. 10.1080/19420862.2023.2175319 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Fernández-Quintero ML, Ljungars A, Waibl F, Greiff V, Andersen JT, Gjølberg TT, Jenkins TP, Voldborg BG, Grav LM, Kumar S, et al. 2023b. Assessing developability early in the discovery process for novel biologics. MAbs 15: 2171248. 10.1080/19420862.2023.2171248 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Friedensohn S, Neumeier D, Khan TA, Csepregi L, Parola C, de Vries ARG, Erlach L, Mason DM, Reddy ST. 2020. Convergent selection in antibody repertoires is revealed by deep learning. bioRxiv 10.1101/2020.02.25.965673 [DOI] [Google Scholar]
- Galson JD, Schaetzle S, Bashford-Rogers RJM, Raybould MIJ, Kovaltsuk A, Kilpatrick GJ, Minter R, Finch DK, Dias J, James LK, et al. 2020. Deep sequencing of B cell receptor repertoires from COVID-19 patients reveals strong convergent immune signatures. Front Immunol 11: 605170. 10.3389/fimmu.2020.605170 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Gao K, Wu L, Zhu J, Peng T, Xia Y, He L, Xie S, Qin T, Liu H, He K, et al. 2022. Incorporating pre-training paradigm for antibody sequence-structure co-design. arXiv 10.48550/arXiv.2211.0840 [DOI] [Google Scholar]
- Gao X, Cao C, Lai L. 2023. Pre-training with a rational approach for antibody. bioRxiv 10.1101/2023.01.19.524683 [DOI] [Google Scholar]
- Georgiou G, Ippolito GC, Beausang J, Busse CE, Wardemann H, Quake SR. 2014. The promise and challenge of high-throughput sequencing of the antibody repertoire. Nat Biotechnol 32: 158–168. 10.1038/nbt.2782 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Greiff V, Weber CR, Palme J, Bodenhofer U, Miho E, Menzel U, Reddy ST. 2017. Learning the high-dimensional immunogenomic features that predict public and private antibody repertoires. J Immunol 199: 2985–2997. 10.4049/jimmunol.1700594 [DOI] [PubMed] [Google Scholar]
- Gu Y, Tinn R, Cheng H, Lucas M, Usuyama N, Liu X, Naumann T, Gao J, Poon H. 2021. Domain-specific language model pretraining for biomedical natural language processing. Acm Trans Comput Healthc 3: 1–23. 10.1145/3458754 [DOI] [Google Scholar]
- Harmalkar A, Rao R, Xie YR, Honer J, Deisting W, Anlahr J, Hoenig A, Czwikla J, Sienz-Widmann E, Rau D, et al. 2023. Toward generalizable prediction of antibody thermostability using machine learning on sequence and structure features. MAbs 15: 2163584. 10.1080/19420862.2022.2163584 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hie BL, Shanker VR, Xu D, Bruun TUJ, Weidenbacher PA, Tang S, Kim PS. 2022a. Efficient evolution of human antibodies from general protein language models. Nat Biotechnol 10.1038/s41587-023-01763-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Hie BL, Yang KK, Kim PS. 2022b. Evolutionary velocity with protein language models predicts evolutionary dynamics of diverse proteins. Cell Syst 13: 274–285.e6. 10.1016/j.cels.2022.01.003 [DOI] [PubMed] [Google Scholar]
- Hummer AM, Abanades B, Deane CM. 2022. Advances in computational structure-based antibody design. Curr Opin Struc Biol 74: 102379. 10.1016/j.sbi.2022.102379 [DOI] [PubMed] [Google Scholar]
- Imrie F, Bradley AR, van der Schaar M, Deane CM. 2018. Protein family-specific models using deep neural networks and transfer learning improve virtual screening and highlight the need for more data. J Chem Inf Model 58: 2319–2330. 10.1021/acs.jcim.8b00350 [DOI] [PubMed] [Google Scholar]
- Jaffe DB, Shahi P, Adams BA, Chrisman AM, Finnegan PM, Raman N, Royall AE, Tsai F, Vollbrecht T, Reyes DS, et al. 2022. Functional antibodies exhibit light chain coherence. Nature 611: 352–357. 10.1038/s41586-022-05371-z [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jain T, Sun T, Durand S, Hall A, Houston NR, Nett JH, Sharkey B, Bobrowicz B, Caffry I, Yu Y, et al. 2017. Biophysical properties of the clinical-stage antibody landscape. Proc Natl Acad Sci 114: 944–949. 10.1073/pnas.1616408114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A, et al. 2021. Highly accurate protein structure prediction with AlphaFold. Nature 596: 583–589. 10.1038/s41586-021-03819-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Khan A, Cowen-Rivers AI, Grosnit A, Deik DGX, Robert PA, Greiff V, Smorodina E, Rawat P, Akbar R, Dreczkowski K, et al. 2023. Toward real-world automated antibody design with combinatorial Bayesian optimization. Cell Rep Methods 3: 100374. 10.1016/j.crmeth.2022.100374 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Konishi H, Komura D, Katoh H, Atsumi S, Koda H, Yamamoto A, Seto Y, Fukayama M, Yamaguchi R, Imoto S, et al. 2019. Capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of B-cell receptors using supervised machine learning. BMC Bioinformatics 20: 267. 10.1186/s12859-019-2853-y [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. 2018. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J Immunol 201: 2502–2509. 10.4049/jimmunol.1800708 [DOI] [PubMed] [Google Scholar]
- Krawczyk K, Raybould MIJ, Kovaltsuk A, Deane CM. 2019. Looking for therapeutic antibodies in next-generation sequencing repositories. MAbs 11: 1197–1205. 10.1080/19420862.2019.1633884 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Krawczyk K, Buchanan A, Marcatili P. 2021. Data mining patented antibody sequences. MAbs 13: 1892366. 10.1080/19420862.2021.1892366 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Kuhlman B, Bradley P. 2019. Advances in protein structure prediction and design. Nat Rev Mol Cell Bio 20: 681–697. 10.1038/s41580-019-0163-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lee JH, Yadollahpour P, Watkins A, Frey NC, Leaver-Fay A, Ra S, Cho K, Gligorijević V, Regev A, Bonneau R. 2022. Equifold: protein structure prediction with a novel coarse-grained structure representation. bioRxiv 10.1101/2022.10.07.511322 [DOI] [Google Scholar]
- Leem J, Dunbar J, Georges G, Shi J, Deane CM. 2016. A Bodybuilder: automated antibody structure prediction with data–driven accuracy estimation. MAbs 8: 1259–1268. 10.1080/19420862.2016.1205773 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Leem J, Mitchell LS, Farmery JHR, Barton J, Galson JD. 2022. Deciphering the language of antibodies using self-supervised learning. Patterns 3: 100513. 10.1016/j.patter.2022.100513 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Li L, Gupta E, Spaeth J, Shing L, Jaimes R, Engelhart E, Lopez R, Caceres RS, Bepler T, Walsh ME. 2023. Machine learning optimization of candidate antibodies yields highly diverse sub-nanomolar affinity antibody libraries. Nat Commun 14: 3454. 10.1038/s41467-023-39022-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Liao YL, Smidt T. 2022. Equiformer: equivariant graph attention transformer for 3D atomistic graphs. arXiv 10.48550/arXiv.2206.11990 [DOI] [Google Scholar]
- Liberis E, Veličković P, Sormanni P, Vendruscolo M, Liò P. 2018. Parapred: antibody paratope prediction using convolutional and recurrent neural networks. Bioinformatics 34: 2944–2950. 10.1093/bioinformatics/bty305 [DOI] [PubMed] [Google Scholar]
- Lim YW, Adler AS, Johnson DS. 2022. Predicting antibody binders and generating synthetic antibodies using deep learning. MAbs 14: 2069075. 10.1080/19420862.2022.2069075 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Lin T, Wang Y, Liu X, Qiu X. 2022. A survey of transformers. AI Open 3: 111–132. 10.1016/j.aiopen.2022.10.001 [DOI] [Google Scholar]
- Lin Z, Akin H, Rao R, Hie B, Zhu Z, Lu W, Smetanin N, Verkuil R, Kabeli O, Shmueli Y, et al. 2023. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science 379: 1123–1130. 10.1126/science.ade2574 [DOI] [PubMed] [Google Scholar]
- Luo S, Su Y, Peng X, Wang S, Peng J, Ma J. 2022. Antigen-specific antibody design and optimization with diffusion-based generative models for protein structures. bioRxiv 10.1101/2022.07.10.499510 [DOI] [Google Scholar]
- Marks C, Hummer AM, Chin M, Deane CM. 2021. Humanization of antibodies using a machine learning approach on large-scale repertoire data. Bioinformatics 37: 4041–4047. 10.1093/bioinformatics/btab434 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Mason DM, Friedensohn S, Weber CR, Jordi C, Wagner B, Meng SM, Ehling RA, Bonati L, Dahinden J, Gainza P, et al. 2021. Optimization of therapeutic antibodies by predicting antigen specificity from antibody sequence via deep learning. Nat Biomed Eng 5: 600–612. 10.1038/s41551-021-00699-9 [DOI] [PubMed] [Google Scholar]
- Melnyk I, Chenthamarakshan V, Chen PY, Das P, Dhurandhar A, Padhi I, Das D. 2022. Reprogramming large pretrained language models for antibody sequence infilling. arXiv 10.48550/arXiv.2210.07144 [DOI] [Google Scholar]
- Moult J, Pedersen JT, Judson R, Fidelis K. 1995. A large-scale experiment to assess protein structure prediction methods. Proteins 23: ii–iv. 10.1002/prot.340230303 [DOI] [PubMed] [Google Scholar]
- Nijkamp E, Ruffolo J, Weinstein EN, Naik N, Madani A. 2023. ProGen: exploring the boundaries of protein language models. Cell Syst 14: 968–978.e3. 10.1016/j.cels.2023.10.002 [DOI] [PubMed] [Google Scholar]
- Olsen TH, Moal IH, Deane CM. 2022. Ablang: an antibody language model for completing antibody sequences. Bioinform Adv 2: vbac046. 10.1093/bioadv/vbac046 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ostrovsky-Berman M, Frankel B, Polak P, Yaari G. 2021. Immune2vec: embedding B/T cell receptor sequences in ℝN using natural language processing. Front Immunol 12: 680687. 10.3389/fimmu.2021.680687 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Park JC, Noh J, Jang S, Kim KH, Choi H, Lee D, Kim J, Chung J, Lee DY, Lee Y, et al. 2022. Association of B cell profile and receptor repertoire with the progression of Alzheimer's disease. Cell Rep 40: 111391. 10.1016/j.celrep.2022.111391 [DOI] [PubMed] [Google Scholar]
- Prihoda D, Maamary J, Waight A, Juan V, Fayadat-Dilman L, Svozil D, Bitton DA. 2022. Biophi: a platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. MAbs 14: 2020203. 10.1080/19420862.2021.2020203 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Radford A, Wu J, Child R, Luan D, Amodei D, Sutskever I. 2019. Language models are unsupervised multitask learners. Open AI 1: 9. https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf [Google Scholar]
- Raffel C, Shazeer N, Roberts A, Lee K, Narang S, Matena M, Zhou Y, Li W, Liu PJ. 2019. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv 10.48550/arXiv.1910.10683 [DOI] [Google Scholar]
- Raybould MIJ, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, Bujotzek A, Shi J, Deane CM. 2019. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci 116: 4025–4030. 10.1073/pnas.1810576116 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Raybould MIJ, Marks C, Kovaltsuk A, Lewis AP, Shi J, Deane CM. 2021. Public baseline and shared response structures support the theory of antibody repertoire functional commonality. PLoS Comput Biol 17: e1008781. 10.1371/journal.pcbi.1008781 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rees AR. 2020. Understanding the human antibody repertoire. MAbs 12: 1729683. 10.1080/19420862.2020.1729683 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Ott M, Zitnick CL, Ma J, et al. 2021. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci 118: e2016239118. 10.1073/pnas.2016239118 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Robinson SA, Raybould MIJ, Schneider C, Wong WK, Marks C, Deane CM. 2021. Epitope profiling using computational structural modelling demonstrated on coronavirus-binding antibodies. PLoS Comput Biol 17: e1009675. 10.1371/journal.pcbi.1009675 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Ruffolo JA, Gray JJ, Sulam J. 2021. Deciphering antibody affinity maturation with language models and weakly supervised learning. arXiv 10.48550/arXiv.2112.07782 [DOI] [Google Scholar]
- Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. 2023. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun 14: 2389. 10.1038/s41467-023-38063-x [DOI] [PMC free article] [PubMed] [Google Scholar]
- Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, et al. 2015. Imagenet large scale visual recognition challenge. Int J Comput Vision 115: 211–252. 10.1007/s11263-015-0816-y [DOI] [Google Scholar]
- Saka K, Kakuzaki T, Metsugi S, Kashiwagi D, Yoshida K, Wada M, Tsunoda H, Teramoto R. 2021. Antibody design using LSTM based deep generative model from phage display library for affinity maturation. Sci Rep 11: 5852. 10.1038/s41598-021-85274-7 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Schardt JS, Jhajj HS, O'Meara RL, Lwo TS, Smith MD, Tessier PM. 2022. Agonist antibody discovery: experimental, computational, and rational engineering approaches. Drug Discov Today 27: 31–48. 10.1016/j.drudis.2021.09.008 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Shehata L, Maurer DP, Wec AZ, Lilov A, Champney E, Sun T, Archambault K, Burnina I, Lynaugh H, Zhi X, et al. 2019. Affinity maturation enhances antibody specificity but compromises conformational stability. Cell Rep 28: 3300–3308.e4. 10.1016/j.celrep.2019.08.056 [DOI] [PubMed] [Google Scholar]
- Shuai RW, Ruffolo JA, Gray JJ. 2023. IgLM: infilling language modelling for antibody sequence design. Cell Syst 14: 979–989.e4. 10.1016/j.cels.2023.10.001 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Sippl MJ. 1990. Calculation of conformational ensembles from potentials of mena force. J Mol Biol 213: 859–883. 10.1016/S0022-2836(05)80269-4 [DOI] [PubMed] [Google Scholar]
- Song L, Cohen D, Ouyang Z, Cao Y, Hu X, Liu XS. 2021. TRUST4: immune repertoire reconstruction from bulk and single-cell RNA-seq data. Nat Methods 18: 627–630. 10.1038/s41592-021-01142-2 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Townsend CL, Laffy JMJ, Wu YCB, O'Hare JS, Martin V, Kipling D, Fraternali F, Dunn-Walters DK. 2016. Significant differences in physicochemical properties of human immunoglobulin kappa and lambda CDR3 regions. Front Immunol 7: 388. 10.3389/fimmu.2016.00388 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I. 2017. Attention is all you need. arXiv 10.48550/arXiv.1706.03762 [DOI] [Google Scholar]
- Wang A, Singh A, Michael J, Hill F, Levy O, Bowman SR. 2018. GLUE: a multi-task benchmark and analysis platform for natural language understanding. arXiv 10.48550/arXiv.1804.07461 [DOI] [Google Scholar]
- Wang P, Luo M, Zhou W, Jin X, Xu Z, Yan S, Li Y, Xu C, Cheng R, Huang Y, et al. 2022a. Global characterization of peripheral b cells in Parkinson's disease by single-cell RNA and BCR sequencing. Front Immunol 13: 814239. 10.3389/fimmu.2022.814239 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wang Y, Gong X, Li S, Yang B, Sun Y, Shi C, Wang Y, Yang C, Li H, Song L. 2022b. xTrimoABFold: de novo antibody structure prediction without MSA. arXiv 10.48550/arXiv.2212.00735 [DOI] [Google Scholar]
- Wang D, Ye F, Zhou H. 2023. On pre-trained language models for antibody. bioRxiv 10.1101/2023.01.29.525793 [DOI] [Google Scholar]
- Weber CR, Rubio T, Wang L, Zhang W, Robert PA, Akbar R, Snapkov I, Wu J, Kuijjer ML, Tarazona S, et al. 2022. Reference-based comparison of adaptive immune receptor repertoires. Cell Rep Methods 2: 100269. 10.1016/j.crmeth.2022.100269 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wollacott AM, Xue C, Qin Q, Hua J, Bohnuud T, Viswanathan K, Kolachalama VB. 2019. Quantifying the nativeness of antibody sequences using long short-term memory networks. Protein Eng Des Sel 32: 347–354. 10.1093/protein/gzz031 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Wong WK, Georges G, Ros F, Kelm S, Lewis AP, Taddese B, Leem J, Deane CM. 2019. SCALOP: sequence-based antibody canonical loop structure annotation. Bioinformatics 35: 1774–1776. 10.1093/bioinformatics/bty877 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Xue L, Constant N, Roberts A, Kale M, Al-Rfou R, Siddhant A, Barua A, Raffel C. 2020. Mt5: a massively multilingual pre-trained text-to-text transformer. arXiv 10.48550/arXiv.2010.11934 [DOI] [Google Scholar]
- Younger D, Berger S, Baker D, Klavins E. 2017. High-throughput characterization of protein–protein interactions by reprogramming yeast mating. Proc Natl Acad Sci 114: 12166–12171. 10.1073/pnas.1705867114 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yu K, Ravoor A, Malats N, Pineda S, Sirota M. 2022. A pan-cancer analysis of tumor-infiltrating B cell repertoires. Front Immunol 12: 790119. 10.3389/fimmu.2021.790119 [DOI] [PMC free article] [PubMed] [Google Scholar]
- Zainchkovskyy Y, Ferkinghoff-Borg J, Bennett A, Egebjerg T, Lorenzen N, Greisen PJ, Hauberg S, Stahlhut C. 2022. Probabilistic thermal stability prediction through sparsity promoting transformer representation. arXiv 10.48550/arXiv.2211.05698 [DOI] [Google Scholar]
- Zaslavsky ME, Craig E, Michuda JK, Ram-Mohan N, Lee JY, Nguyen KD, Hoh RA, Pham TD, Parsons ES, Macwana SR, et al. 2022. Disease diagnostics using machine learning of immune receptors. bioRxiv 10.1101/2022.04.26.489314 [DOI] [Google Scholar]
- Zheng B, Yang Y, Chen L, Wu M, Zhou S. 2022. B-cell receptor repertoire sequencing: deeper digging into the mechanisms and clinical aspects of immune-mediated diseases. iScience 25: 105002. 10.1016/j.isci.2022.105002 [DOI] [PMC free article] [PubMed] [Google Scholar]




