Abstract
The development of stable antibodies formed by compatible heavy (H) and light (L) chain pairs is crucial in both in vivo maturation of antibody-producing cells and ex vivo designs of therapeutic antibodies. We present ImmunoMatch, a machine-learning framework trained on paired H and L sequences from human B cells to identify molecular features underlying chain compatibility. ImmunoMatch distinguishes cognate from random H–L pairs and captures differences associated with κ and λ light chains, reflecting B cell selection mechanisms in the bone marrow. We apply ImmunoMatch to reconstruct paired antibodies from spatial VDJ sequencing data and study the refinement of H–L pairing across B cell maturation stages in health and disease. We find further that ImmunoMatch is sensitive to sequence differences at the H–L interface. These insights provide a computational lens into the broader biological principles governing antibody assembly and stability.
Subject terms: Machine learning, Software, Adaptive immunity, Lymphocytes
ImmunoMatch is a machine-learning framework that learns from heavy and light immunoglobulin chains to predict cognate sequence pairings.
Main
The immune system produces an extraordinarily diverse repertoire of antibodies to combat a wide range of immune challenges from invading pathogens as well as endogenous, aberrantly expressed antigens in contexts such as cancer. Our antibody repertoire is diverse, encompassing over 1012 distinct antibodies1–4. Such diversity is achieved mainly by two processes: first, a random recombination of genetic fragments in the immunoglobulin gene locus occurs independently for the two types of antibody chains, the heavy (H) and light (L) chains5–9, which assemble to form an antibody molecule. These recombination processes already generate 2 × 106 different combinations of H–L chains10, the sequence diversity of which is further potentiated by the imprecise joining of these gene fragments11–14. Second, mutations accumulated in the antibody variable region substantially increase repertoire diversity15–18. Investigation into the nature, dynamics and regulation of the antibody response in vivo19–21, as well as the engineering of highly specific antibodies22–24, are both active fields of modern biomedical research.
Studies of antibodies have gradually recognized the importance of a wide variety of ‘developability’ factors beyond the binding affinity of the antibody to its antigen25–28. One such issue pertains to thermostability, which is crucial in ensuring that the H and L chain can assemble to constitute a manufacturable, functional antibody therapeutic29–31. Using high-throughput sequencing methods, to identify stable H–L pairs, researchers generated separate sequencing libraries of H and L chains, manually paired expanded H and L clonotypes, and expressed and tested these antibodies in vitro32,33; this has been, for instance, applied to identify broadly neutralizing antibodies against HIV-1 (refs. 34,35). Recent single-cell methods produce paired H–L sequences from thousands of antibody-producing cells36–39, which can circumvent problems in the traditional approaches, yet these methods are costly and still fall short of the true repertoire diversity40. The study of how stable antibody H–L pairs are generated is also relevant in basic biology: B cells, precursors to the antibody-secreting plasma cells, express B cell receptors (BCRs) comprising mainly the membrane-bounded version of antibody molecules41,42. During its development in the bone marrow, a B cell undergoes checks to ensure its BCR can be stably assembled and thus can sustain cellular signals to maintain viability43–46, while removing autoreactive B cells via cell death47–50. This poses an inherent challenge in characterizing cases where stable, non-autoreactive H–L pairs fail to be formed in vivo. Such knowledge is important to understand the interplay between the formation of a functional BCR and the development of the antibody response, versus the possible autoimmunity arising from defects of this process51,52. Therefore, deciphering the molecular rules governing the pairing of H and L chains will benefit both basic B cell immunology and antibody discovery.
The issue of whether H–L antibody pairing specificity is predictable has been under debate for the past few decades. Seminal antibody structural analyses highlighted the interaction between the H and L chains at the antigen-binding site, consisting mainly of hypervariable regions in each chain known as the complementarity determining region (CDR) (Fig. 1a)53,54. Contact between the H and L chains is crucial to maintain antibody stability, as well as the orientation of the antigen-binding site55–58. The H–L interface is formed by contacts in the CDR region as well as the antibody framework region (FWR). Molecular biology experiments have identified FWR mutations, which would alter H–L interaction geometries and consequently abolish antigen binding55,57. Furthermore, mouse models coexpressing engineered H and L chains often further edit the L chains by introducing mutations which enhance the viability of B cells59–61. Computationally, earlier analyses focused on observations of nonrandom, over-represented H–L chain partners; however, statistical power to identify such associations were limited by the small number of paired H–L sequences and structures available for this type of analysis62–64. The development of single-cell sequencing methods, where single B cells are isolated, followed by extraction and sequencing of their H and L chain transcripts36–38, has provided new insights to this problem. For instance, statistical analyses in DeKosky et al.65 suggested that H–L pairing preference was random; however, others argued this could be due to idiosyncrasies in the experimental protocol38. More recently, a growing body of evidence suggested coherence between H and L chain choices in B cells. A study by Jaffe et al.39 using newer single-cell methods found that H chains in mature, antigen-experienced B cells tended to use more restricted L chain partners than their naive counterparts. Furthermore, comparisons between artificial intelligence (AI) models trained on paired antibody H–L sequences posited that such models trained on paired data outperformed those trained on unpaired chains, in terms of learning biologically interpretable sequence embeddings and predicting antigen specificity66; moreover, given the H chain sequence, a stable L chain partner can be generated de novo67. Taken together, these findings suggest the existence of a set of molecular rules underlying the specificity of antibody H–L pairs. Tools for exploring these rules will hold the promise to understand and design better, more stable antibody pairs, facilitating the development of antibody therapeutics.
Fig. 1. ImmunoMatch for predicting cognate antibody chain pairing.
a, Illustration of the antibody H–L interface (Protein Data Bank 6zlr), and its diversity potentiated by genetic recombinations in vivo. Inset on the right illustrates the H–L interface, with amino acid side chains (in sticks) to highlight positions in direct contact with the partner chains. CDR, complementarity determining region. b, Schematic of model training using curated positive and pseudo-negative examples from single-cell antibody repertoire datasets. See main text for further details. c, Proportion of the total surface area of the interface formed between the VH and the VL, contributed by individual CDR loops and the FWR, for n = 3,781 human antibody structures70. The box-and-whisker plots depict distribution medians, lower and upper quartiles, and 1.5 × interquartile range. d–f, Accuracy of models trained solely on H and L chain V and J gene usage (d), one-hot-encoded CDRH3 and CDRL3 sequences (e) and full-length VH and VL sequences (f). Each data point represented a separate fold from the tenfold validation strategy employed during training (Methods), tested on the withheld test set (n = 23,388). The box-and-whisker plots depict distribution medians, lower and upper quartiles, and 1.5 × interquartile range. The final ImmunoMatch model is indicated in f. g, ROC of the final ImmunoMatch model, calculated on the withheld test set (AUC = 0.75) and an external test set constituted by n = 3 donors unseen during training (AUC = 0.66).
Here, we present ImmunoMatch, a suite of fine-tuned AI models for the classification of cognate antibody chain pairs. Based on an antibody-specific language model (AntiBERTa268), ImmunoMatch was fine-tuned on full-length H and L chain variable domain sequences extracted from paired antibody repertoire data from healthy donors. This model outperformed baseline models using either CDR sequences or immunoglobulin gene usage as inputs. We found that further optimization to generate ImmunoMatch variants specific to antibody L chain types improved classification performance. We applied ImmunoMatch to study B cell development through the lens of optimizing H–L pairing, and identified chain pairing refinement as a hallmark of B cell maturation in both health and disease. We also validated ImmunoMatch in its ability to recover partner chains in therapeutic antibodies, and highlighted its ability to pinpoint important sequence patterns driving these predictions. Our results underscore the complexity in H–L chain pairing, and highlight the importance of chain pairing in understanding B cell development and engineering stable, functional antibodies.
Results
Machine-learning models for identifying cognate antibody chain pairing
We posited that machine-learning methods could allow us to test two competing hypotheses, namely that antibody H and L chain pairing preference can be predicted from sequence information, or that such pairing is random. Framing this as a binary classification task to distinguish between cognate, observed H–L pairs from randomly generated pairs, we curated paired H–L sequences, sampled from single-cell antibody repertoire datasets where their cell origin was barcoded as short nucleotide strings40 (Fig. 1b). The coexistence of a H and a L chain with the same cell barcode was taken as evidence for paired chains, constituting our positive training examples. Owing to the removal of nonviable H–L pairs by natural selection69, it was not possible to obtain negative training examples. We instead used a random shuffling strategy to generate ‘pseudo-negative’ examples, exchanging the light-chain partners between the observed, positive pairs. This procedure also guaranteed a balanced dataset with equal amounts of positive and pseudo-negative examples. Using three separate datasets38,39,65 covering six donors, in total we curated 233,880 H–L pairs for training and testing, after balancing sample sizes over each donor to avoid bias due to the immunological background of individual donors.
We tested the contribution of different input features in combination with multiple machine-learning strategies. Analyzing antibody structural data70,71, we observed that both the antibody FWR and the CDR3 contributed substantially to the interface between the variable heavy (VH) and variable light (VL) domains (Fig. 1c). Indeed, logistic regression and XGBoost models built solely on V and J gene usage achieved accuracies of 0.50 and 0.52, respectively, indicating limited predictive capability for heavy–light pairing preferences (Fig. 1d). To improve predictive performance, we next explored using CDR3 sequences as predictive features, in view of its substantial contribution to the VH–VL interface (Fig. 1c) and their high sequence diversity. We used a one-hot encoding approach and trained a convolutional neural network (CNN) with the CDR3 fragments of the H and L chains, leveraging its ability to capture local patterns within the data. Although the CNN model demonstrated moderate performance, attempts to further improve the model by changing the optimizer, incorporating additional convolutional layers or by adopting the ResNet72 architecture, yielded minimal improvement (Fig. 1e). Taken together, these results highlighted the inherent limitations of only considering CDR3 or gene usage, potentially due to the lack of information on specific framework residues that participate in the VH–VL interface (Fig. 1c).
We therefore explored strategies to incorporate full-length VH and VL sequences in prediction, by capitalizing on recent advancement in protein language models73, as well as those specifically trained using antibody sequences68. The transformer architecture, as employed in language models, excels at capturing long-range amino acid interactions74,75. Here, we compared an antibody-specific language model (AntiBERTa2; ref. 68) against a generic protein language model (ESM-2 (ref. 73), 150M parameters) (Fig. 1f). We observed that by fine-tuning ESM-2, its classification performance increased substantially, comparable to the performance of AntiBERTa2 before fine-tuning. The superior performance of AntiBERTa2 suggested that antibody-specific characteristics learned during pretraining were insightful for our task (Fig. 1f). Through these investigations of different machine-learning architectures (Fig. 1d–f), we used the fine-tuned model based on AntiBERTa2 as a final instance to classify antibody cognate H–L chain pairing. This model, ImmunoMatch, demonstrated an area under the receiver operator characteristic (AUC-ROC) of 0.75 (Fig. 1g). To further validate the generalizability of ImmunoMatch, we curated data from n = 3 donors unseen during both pretraining and fine-tuning from Jaffe et al.39. Testing ImmunoMatch on this external evaluation dataset, ImmunoMatch has an area under the receiver operating characteristic curve (AUC-ROC) of 0.66 (Fig. 1g). Details on other performance metrics of ImmunoMatch can be found in Table 1. We further confirmed that model performance was not impacted by immunoglobulin V and J gene usage (Extended Data Fig. 1a), suggesting that ImmunoMatch has learnt features beyond gene usage in H–L pairing prediction. We also validated that the combination of donors used here for training is informative for the model to learn pairing preferences (Supplementary Note 1, Supplementary Figs. 1–3 and Supplementary Table 2). Altogether, this suggests that ImmunoMatch can distinguish cognate antibody VH–VL pairing from randomly paired chains.
Table 1.
Performance metrics of ImmunoMatch and its variants
| ImmunoMatch | ImmunoMatch-κ | ImmunoMatch-λ | |
|---|---|---|---|
| F1 score | 0.677 | 0.839 | 0.796 |
| Accuracy | 0.666 | 0.817 | 0.764 |
| MCC | 0.333 | 0.660 | 0.556 |
| Precision | 0.655 | 0.747 | 0.700 |
| Recall | 0.701 | 0.958 | 0.922 |
| AUC-ROC | 0.753 | 0.885 | 0.831 |
MCC, Matthew’s correlation coefficient.
Extended Data Fig. 1. Prediction accuracy on withheld sequences, grouped by the frequency of V and J gene usage in the H and L chains.
For each chain we grouped sequences by their gene usage into five bins. Results were shown for (a) ImmunoMatch (n = 23,388), (b) ImmunoMatch-κ (n = 21,598) and (c) ImmunoMatch- λ (n = 21,598) models separately.
ImmunoMatch performance could be further optimized via a light-chain-specific training strategy
We investigated whether the performance of ImmunoMatch can be further improved. Human antibodies use either one of the two types of light chains, κ and λ, which are encoded by distinct DNA fragments located on separate chromosomes in the human genome5. The VL domains encoded by κ and λ genes are substantially different, as evidenced by pairwise sequence comparison of n = 3,832 antibody structures70 (Fig. 2a): on average κ VL domains share 47.8% sequence identity with λ VL domains, lower than the average identity of 69.1% within κ light chains, and 61.8% within λ. We therefore hypothesized that training separate models on VH–Vκ and VH–Vλ sequence pairs could learn pairing patterns specific to either type of light chains. Two specialized models, ImmunoMatch-κ and ImmunoMatch-λ (Fig. 2b), were fine-tuned using the same workflow as the original ImmunoMatch, with the sole exception that the models were only exposed to a light chain of one specific type during training.
Fig. 2. An L chain-specific training strategy of ImmunoMatch was consistent with the in vivo mechanism of B cell development.
a, Sequence identity comparison between VL sequences taken from human antibody structures (n = 3,832) utilizing the κ and λ light chains. The averaged sequence identity within κ sequences, within λ sequences, and between κ and λ, are given on the plot. b, Strategy to extract H–κ and H–λ pairs from publicly available datasets to train separate ImmunoMatch-κ and ImmunoMatch-λ models. c, Pairing scores of H–κ pairs calculated by ImmunoMatch and ImmunoMatch-κ. d, Pairing scores of H–λ pairs calculated by ImmunoMatch and ImmunoMatch-λ. e, Accuracy of ImmunoMatch-κ and ImmunoMatch-λ on withheld test sets comprised solely of H–κ and H–λ paired sequences. Each data point corresponded to a separate training fold in the cross-validation framework, tested on a withheld test set (n = 21,598). The box-and-whisker plots depict distribution medians, lower and upper quartiles and 1.5 × interquartile range. f, Schematic to illustrate the formation of the BCR in vivo.
We investigated the change in pairing scores from the original ImmunoMatch model and light-chain-specific models, for paired VH–Vκ and VH–Vλ accordingly (Fig. 2c,d). We noticed that there were some antibodies about which the original ImmunoMatch model was ambivalent (pairing scores peaked at around 0.5) for both H–κ and H–λ pairs. These peaks shifted toward pairing scores of 1 when ImmunoMatch-κ (Fig. 2c) and ImmunoMatch-λ (Fig. 2d) were used for prediction. Therefore, ImmunoMatch-κ and ImmunoMatch-λ are more specialized in capturing signals embedded in the sequences that are informative of H–L pairing. The performance of these variants of ImmunoMatch was further evaluated using separate datasets of antibodies with κ and λ light chains, which were withheld from fine-tuning. The distributions of pairing scores for their withheld test sets are shown in Extended Data Fig. 2, and a detailed summary of performance metrics of these models can be found in Table 1. ImmunoMatch-κ achieved high accuracy (0.817) on κ datasets, while ImmunoMatch-λ performed comparably well on λ datasets, with an accuracy of 0.764 (Fig. 2e and Extended Data Fig. 1b,c), both represent substantial improvements over the original ImmunoMatch (Table 1).
Extended Data Fig. 2. Distribution of pairing scores from withheld test sequences for the ImmunoMatch models.

Shown here data for the ImmunoMatch (top, n = 23,388), ImmunoMatch-κ (bottom left, n = 21,598) and ImmunoMatch-λ (bottom right, n = 21,598) models. The positive (orange) and pseudo-negative (blue) pairs were visualized with separate density curves.
We also examined the generalizability of ImmunoMatch-κ and ImmunoMatch-λ, by testing them on H–L pairs of light-chain types different to their respective training sets. When ImmunoMatch-κ was tested on λ datasets, we observed that this model could still achieve an accuracy above 0.5, albeit with performance decreased in comparison to the κ test set (Fig. 2e). Of note, the performance of ImmunoMatch-λ on κ datasets remained largely unaffected by the differing distributions of the light-chain types between training and testing data (Fig. 2e). This suggests that ImmunoMatch-λ is more generalizable in learning pairing rules for antibodies with κ and λ light chains. We further investigated the reason behind this by analyzing their confusion matrices, and found that ImmunoMatch-λ made fewer false-negative predictions on H–κ sequences, compared to applying ImmunoMatch-κ on H–λ pairs (Extended Data Fig. 3). This generalizability, caused by fewer false-negative predictions, may be linked to the process of B cell development in vivo (Fig. 2f). Initially, the heavy chains undergo gene rearrangement, followed by the formation of H–κ pairs. They are then subject to central tolerance69, which either removes B cells expressing unstable and autoreactive pairs of heavy and light chains by signaling them to cell death, or instructs them to rearrange the λ gene locus to generate a H–λ pair9,76–78. These H-λ pairs are also subject to positive selection of B cells that express a stable H–L chain pair46,79,80 and negative selection of those which react to self-antigens (Fig. 2f)76. If the H chain is able to pair with κ, the B cell will proceed in maturation, even though this H chain can theoretically form a pair with λ as well. ImmunoMatch-κ thus has difficulty in distinguishing between true negative and false negative H–λ pairs; however, for an observed H–λ pair, it implies that the H chain would have failed to pair with κ. Therefore, ImmunoMatch-λ is still able to capture the signals embedded in the sequence of negative H–κ examples, leading to a low number of false-negative cases. In comparison to H–κ, H–λ pairs represent a more homogeneous dataset encompassing H–L pairing features, leading to different performance of the two models (Discussion).
Fig. 3. ImmunoMatch facilitates pairing of heavy and light immunoglobulin chains in spatial transcriptomics data.
a, Distribution of ImmunoMatch pairing scores in single-cell, H–L paired sequences (n = 1,326) from n = 2 breast tumors analyzed in Engblom et al.81. b, Comparison between ImmunoMatch pairing score versus the ‘repair’ score from Engblom et al., for n = 112 H–L pairs reconstructed from spatial VDJ sequencing generated from 10x Visium profiling of breast tumor sections analyzed in Engblom et al. Each data point corresponds to one H–L pair, and grouped by whether the same CDRH3 and CDRL3 have been observed together with the same cell barcode in the single-cell library generated on the same sample. The decision boundary of ImmunoMatch and ‘repair’ method from Engblom et al. is indicated by the dashed line. c, Examples of H–L pairs analyzed by the ‘repair’ method of Engblom et al. and ImmunoMatch. The expression of VH and VL sequences in each H–L pair (row) across the analyzed tissue section is shown and overlaid on top of another. Each dot correspond to one spot in the 10x Visium slide. The tissue section hematoxylin and eosin (H&E)-stained image is shown for reference.
ImmunoMatch facilitates pairing of heavy and light immunoglobulin chains in spatial transcriptomics data
The analysis above demonstrated that using ImmunoMatch-κ and ImmunoMatch-λ on H-κ and H-λ pairs respectively would be more accurate in H–L pairing prediction in comparison to the original ImmunoMatch model (Table 1). Using this approach, ImmunoMatch can be used to score and predict whether H and L chains given by the user form a cognate pair. This makes ImmunoMatch useful for comparing different single-cell BCR library preparation methods in their fidelity to generate well-resolved paired sequences (Supplementary Note 2, Supplementary Table 3 and Supplementary Fig. 4); however, there still remain data types that do not yet have single-cell resolution, and therefore necessitate H–L pairing annotation. A good example of this is spatial transcriptomics, where spatial VDJ methods have been described81 (amplifying VDJ sequences from 10x Genomics Visium slides for long-read sequencing). Spatial methods are increasingly adopted in cancer and tissue immunology studies, but here the Visium protocol precludes direct identification of paired H–L chains, which will be necessary for downstream analysis such as antibody production to investigate their specificity.
In the original spatial VDJ manuscript, the authors proposed to predict H–L pairs by examining, in the spatial data, the colocalization of heavy and light-chain clones within the same tissue section81. Here we believe that ImmunoMatch can complement this method, going beyond comparing the transcript counts of the clones, to consider the complementarity of the full-length VH and VL sequences. We analyzed two breast tumor samples presented by Engblom et al.81, which have been profiled using a multiregion Visium-based protocol and single-cell BCR sequencing. Applying ImmunoMatch-κ and ImmunoMatch-λ on their respective light-chain types, we first validated the performance of our models to correctly identify paired H–L chains in the single-cell libraries from these tumor samples (Fig. 3a). We then applied ImmunoMatch on the spatial VDJ data, and observed that ImmunoMatch pairing score successfully identified H–L pairs from the spatial data which overlapped with the single-cell data (Fig. 3b), especially when used in conjunction with the colocalization-based method (‘repair’) presented by Engblom et al.81. We further examined the H–L pair predictions on the tumor slides, and observed that ImmunoMatch can complement the Engblom et al. ‘repair’ method to predict H–L pairs from intratumoral B cells (Fig. 3c and Extended Data Fig. 4). Since ImmunoMatch directly considers the full-length VH and VL sequences, it addresses the pairing problem from an orthogonal perspective compared to identifying colocalized H and L chain transcripts. We believe ImmunoMatch therefore could be potentially used to facilitate the identification of cognate H–L pairs for antibody discovery applications in tissue immunology.
Extended Data Fig. 4. Additional examples of H–L pairs analyzed by Engblom et al.’s ‘repair’ method and ImmunoMatch.
The expression of VH and VL sequences in each H–L pair (row) across the analyzed tissue sections are shown and overlaid on top of another. Each dot corresponds to one spot in the 10X Visium slide. The tissue section hematoxylin and eosin (H&E) stained images are shown for reference.
Refinement of immunoglobulin chain pairing is a hallmark of B cell maturation
We next asked whether chain pairing likelihood would vary across stages of B cell development. The classical theory of B cell maturation posits that upon activation, naive B cells enter the germinal center (GC) to edit and optimize their BCRs to specifically bind their cognate antigens, with the successful binders subsequently exiting the GC and differentiating into memory B cells82–84. We collected paired H–L sequences from naive, GC and memory B cells from published studies39,85, and scored the H–κ and H–λ sequences with ImmunoMatch-κ and ImmunoMatch-λ respectively. Comparing the pairing scores from these ImmunoMatch models between the B cell subsets, we observed that memory B cells have substantially higher pairing score than their naive counterparts, with the distribution of GC B cells positioned between the two (Fig. 4a). We further defined clonally related naive and memory B cells based on CDRH3 sequence similarity, and identified examples of clonal expansions where pairing score increased as the clonotype diversified from the germline origin (Fig. 4b). We propose this continuum of chain pairing likelihood to be a feature of B cell maturation: as BCRs undergo class-switch recombination and somatic hypermutation, H–L chain pairing is refined together with these processes, both integral in B cell maturation17. To test this hypothesis, we utilized a single-cell RNA sequencing (scRNA-seq) dataset of B cells sampled from the tonsil85, and compared the H–L pairing score inferred using ImmunoMatch-κ and ImmunoMatch-λ for sequences of different heavy chain isotypes and mutational levels. Of note, pairing score increases as the H chain switches away from IgM and IgD isotypes to IgG and IgA (Fig. 4c). Moreover, pairing scores display an inverse relationship with the H chain germline (Fig. 4c), independent of B cell subtypes (Extended Data Fig. 5). ImmunoMatch pairing scores therefore embed information about B cell maturation, and highlight the increase in H–L pairing specificity as a feature as BCR undergo maturation processes.
Fig. 4. ImmunoMatch revealed a continuum of BCR chain pairing likelihood across B cell development stages.
The box-and-whisker plots depict distribution medians, lower and upper quartiles and 1.5 × interquartile range. a, Pairing scores of BCRs from naive B cells (n = 3 donors), GC B cells (n = 6 donors) and memory B cells (n = 3 donors). Each data point represents average score per donor, calculated separately for H-κ (panel ‘IGK’, scored using the ImmunoMatch-κ model) and H–λ (‘IGL’, scored using ImmunoMatch-λ) pairs. b, Example clonotype tree from Jaffe et al.39 data with ImmunoMatch pairing scores mapped to individual observations as colored dots in the tree leaves. The germline configuration of VH and VL sequences are illustrated. c, ImmunoMatch pairing scores for paired VH and VL sequences from tonsil B cells (n = 10,264) in the King et al. dataset85, grouped by their germline identity as a proxy of somatic hypermutation status (top), or by the heavy chain isotype to illustrate class-switch recombination status (bottom). As a control, analogous annotation with ImmunoMatch was performed on for each cell, retaining the same observed VH sequence, but each paired with a randomly reshuffled L chain partner. d, Boxplots (right) depicting pairing scores of BCRs from leukemia and lymphoma samples (n = 123) curated from the literature (Methods). Data were organized by cancer subtypes and ordered by their corresponding B cell development stages when oncogenesis was thought to occur, according to published reviews86,87 (schematic on the left). ALL, acute lymphoblastic leukemia; CLL, chronic lymphocytic leukemia; DLBCL, diffuse large B cell lymphoma; GCB, germinal center B cell; ABC, activated B cell.
Extended Data Fig. 5. Relationship between somatic hypermutation level and ImmunoMatch pairing score in the King et al. tonsil dataset.
In each panel somatic hypermutation level (horizontal axis, represented as sequence identity to germline in the heavy chain) was plotted against ImmunoMatch pairing score (vertical axis). Data from the King et al. Tonsil single-cell BCR repertoire dataset (n = 10,264 cells) were shown. This relationship is statistically tested (p <2e-16), using a mixed effect model to control for cell type (as a random effect). Each data point corresponds to a single B cell (that is one paired H–L chain). Data for different cell types were visualized in separate panels. To simplify the visualization we grouped together multiple cell type labels corresponding to germinal center (GC) B cells (original labels in King et al.: ‘LZ GC’, ‘GC’, ‘DZ GC’ and ‘FCRL2/3high GC’), (pre-)plasmablasts (original labels ‘prePB’ and ‘Plasmablast’) and memory B cells (original labels ‘MBC’ and ‘MBC FCRL4+’).
We further investigated whether a similar trend in the pairing scores can be found in BCRs isolated from diseases arising from B cell development. We collected n = 123 paired sequences from leukemia and lymphoma samples collated from the literature and publicly available databases of cancer cell lines, and mapped these samples to the different B cell developmental stages from which these cancers were thought to initiate86,87. Applying ImmunoMatch-κ and ImmunoMatch-λ, we observed a continuum of H–L pairing scores for these sequences (Fig. 4d). Specifically, leukemia originating from pre-B cells in the bone marrow displayed a notably low pairing likelihood, reflecting their immature origin88,89. In contrast, in agreement with the need of a functional BCR for B cell activation and antigen interactions in these cancers90,91, lymphoma samples typically displayed high pairing scores. These analyses suggest that ImmunoMatch models can be used to annotate immunoglobulin chain pairing, and that the refinement of chain pairing preference is a hallmark of B cell maturation in both health and disease.
ImmunoMatch is sensitive to sequence differences in therapeutic antibodies
We finally investigated whether our ImmunoMatch models can be applied in an antibody discovery context. Specifically, we simulated an antibody triaging application, where ImmunoMatch-κ and ImmunoMatch-λ was used to score a random library of germline recombinations of H chain V and J gene segments against the cognate L chain partner (Fig. 5a). We performed this experiment on n = 625 therapeutic antibodies, for which we generated random VH domain sequences, while preserving the observed CDRH3 fragment, for scoring against their cognate VL domains. We first verified that the ImmunoMatch pairing score was an effective discriminant (P < 2 × 10−16, Wilcoxon rank-sum test) of the observed H–L pairs in the therapeutic antibodies versus the random pairs (Fig. 5b). We hypothesized that the best random H–L pair from the ImmunoMatch models should resemble the wild-type sequence. Fig. 5c compares the best random VH match against the observed VH in terms of their sequence identities and ImmunoMatch pairing score difference. In line with our hypothesis, the higher the sequence identity, the more likely ImmunoMatch models were to output a similar pairing score. We however noted cases where the pairing score difference was substantial, despite sharing ≥80% sequence identity. We identified such cases and mapped the amino acid positions where the randomly generated VH differed from the observed sequence (Fig. 5d). In these cases, the sequence differences typically reside in the CDRH1 and CDRH2 regions, and also at positions in the framework facing the VL domain (Fig. 5d). Given the importance of these positions in the VH–VL interface, this highlights the sensitivity of the ImmunoMatch models to structurally important positions to infer VH–VL chain pairing.
Fig. 5. ImmunoMatch is sensitive to VH–VL interface positions in therapeutic antibodies.
a, Outline of experiments performed on paired H–L chains from n = 625 therapeutic antibodies. For each antibody, we grafted the observed CDRH3 sequence onto random recombinations of H chain germline V and J fragments. Together with the observed L chain, these random H chains were subject to scoring by ImmunoMatch-κ or ImmunoMatch-λ (depending on L chain type). The random VH sequence from the best H–L match according to the model was compared against the observed VH sequence of the antibody. b, Comparison of ImmunoMatch pairing scores between therapeutic antibodies (n = 625) versus the pairs formed between the VL from these antibodies with randomly generated VHs (‘random pairs’). The box-and-whisker plots depict distribution medians, lower and upper quartiles and 1.5 × interquartile range. c, Absolute difference of the pairing scores between the observed (wild-type; WT) VH for therapeutic antibodies considered in this experiment (n = 625), and the randomly generated VH ranked best in H–L pairing using ImmunoMatch-κ and ImmunoMatch-λ. A value of 0 means the pairing scores of the random VH and the WT are identical. The antibodies were binned (horizontal axis) by the sequence identity between the random VH and the WT. The box-and-whisker plots depict distribution medians, lower and upper quartiles and 1.5 × interquartile range. d, Examples of therapeutic antibodies where a large pairing score difference was observed between the random H–L pair to the true pair, despite high sequence identity between the VH sequences. Amino acids highlighted in sticks depict mismatched positions between the random VH and the observed VH used in the therapeutic antibodies.
Discussion
In this work, we identified an under-explored issue in antibody developability, namely antibody H–L chain pairing, and trained predictive models which specifically address this problem. We found that this was a tractable problem provided the right model architecture was sought, and the resulting modeling framework, ImmunoMatch, demonstrated notable improvement over the baseline and revealed insights into B cell development. ImmunoMatch has several direct, practical uses: it can be applied to facilitate the pairing of H and L chains in spatial transcriptomics data. Additionally, it can be used to assess the fidelity of newly generated single-cell antibody repertoire datasets. Given the intensive development of new methods for capturing high-quality paired immunoglobulin sequences from single cells at an increasing throughput40,81,92–94, a reliable method to assess the validity of the resulting datasets is crucial. Although the ImmunoMatch models were trained on healthy repertoires, we have successfully applied the method on sequences sampled from a variety of diseases, including both hematological cancers and B cells within solid tumors. The increased application of single-cell technologies to study these conditions will provide more data to further support the applicability of ImmunoMatch in these contexts. The ability of ImmunoMatch to identify cognate H–L pairs opens up possibilities to investigate how different positions on either chains contribute to pairing specificity and antibody stability, which could facilitate the design of new therapeutic antibodies. More broadly, the ImmunoMatch models can be part of a comprehensive assessment of antibody developability. A growing number of computational predictors have been developed to predict antibody solubility, immunogenicity and other characteristics based on their sequences25,26,31,95,96; here, ImmunoMatch can complement the array of existing predictors to streamline computational antibody design. One could couple ImmunoMatch predictions with a sequence sampling method to generate new H–L pairs with optimized pairing properties. Going forward, a unified AI model with a training objective combining these antibody developability measures will be valuable in harmonizing the optimization process of candidate antibody designs. Methods that combine predictions from pools of expert models already exist and have been applied in the evolution of proteins97. As antibody development is essentially an engineering process aiming to reach the Pareto front where all desirable characteristics of the antibody are optimized, a combined training objective will facilitate reaching this front by minimizing costly iterative processes looking to balance between different developability issues98,99. This is also interesting from a basic B cell biology perspective, as the development of a viable antibody repertoire in vivo also follows a similar paradigm, balancing between antigen specificity, cross-reactivity, self tolerance and stably assembled BCRs capable of sustaining B cell viability69. AI approaches which predict these different aspects of BCR biology can be potentially combined to simulate B cell maturation, opening up new avenues in investigating the fundamental pathological mechanisms in diseases where B cell maturation is disrupted.
An important distinction of ImmunoMatch is that its design principles are grounded on basic B cell biology. We reasoned that separate κ and λ models should produce superior performance compared to the original ImmunoMatch model, supported by the biological process to generate antibody L chains in the bone marrow. It is surprising that the ImmunoMatch-λ model could accurately predict both H–κ and H–λ pairing; we have identified the reason for this was the reduced number of false negative predictions in the ImmunoMatch-λ model when tested on H–κ pairs. This mirrors the in vivo mechanism of B cell development, where the κ locus is first rearranged to produce the light chain, followed by λ if the H–κ pair was eliminated during central tolerance. This secondary, ‘rescue’ role of λ light chains during light-chain development100,101 implies that many H–λ pairs are results of failed κ rearrangements, potentially facilitating ImmunoMatch-λ but not ImmunoMatch-κ to learn generalizable rules of chain pairing. Compared to the observed H–λ pairs, H–κ pairs represent a more heterogeneous data source, comprising both H chains which could only specifically pair with κ light chains, and those which were ambivalent of the type of light-chain partner, but were coupled with a stable κ sequence at the first attempt and did not have to undergo λ rearrangement. On the other hand, H–λ is less heterogeneous, potentially allowing fine-tuning to learn generalizable rules of H–L chain pairing. Additionally, we designed our randomization experiments on therapeutic antibodies simulating the assembly of the B cell receptor in vivo, screening for successful H–L pairs from libraries of randomly paired antibody chains, and observed a full spectrum of pairing likelihoods given the specifically chosen H and L chains. A number of computational methods aim to simulate antibody repertoires for benchmarking applications, or for learning the statistics of germline gene usage from repertoire data102–104. Our results suggest that simulation without constraining for paired H–L gene usage could substantially oversample dysfunctional sequences incapable of contributing to stable H–L pairs, which would call into question the reliability of these approaches in generating useful benchmarking single-cell, paired-chain datasets resembling real-life antibody repertoires. In general, most approaches to interrogate the functional antibody repertoire have been characterizing these molecules at the nucleic acid level, which is known to poorly correlate with protein-level measurements105,106. Going forward, mass spectrometry-based antibody repertoire sequencing would allow for direct assessment of observed antibody H–L protein chains107–110 to sample well-expressed, stable H–L protein complexes, which is more relevant to our problem.
Predicting H–L chain pairing is highly complex, most notably evidenced by the performance ceiling of the original ImmunoMatch model trained on a mix of H–κ and H–λ sequences. We have explored possible causes of this limitation and found that this heterogeneity of light-chain type is part of the reason, as evidenced by the improvement in performance for the two ImmunoMatch variants specific to either light-chain type. However, there are several additional reasons which impose an implicit ceiling to the performance of ImmunoMatch: first, due to the limitation in sampling negative examples for training, we devised a strategy to generate pseudo-negative H–L pairs. These examples might actually pair in vivo, but the incomplete sampling of the repertoire prevented us from observing these pairs. This is reflected in the false positive instances observed after prediction, and our results demonstrate that ImmunoMatch (and the training data used to train the models) have mitigated this issue. Another possible reason is an in-built promiscuity in the number of possible partners of H and L chains. In our positive training examples we observed cases where one H chain could have as many as 1,000 distinct L chain partners. Similarly, earlier experimental investigations expressing different combinations of H and L chains often find promiscuous H chains which can be expressed with high yield together with a number of distinct L chains34,35,111. While we know that antibody germline gene usage is highly biased toward a small handful of genes112, we do not know whether the stability of H–L chain pairing contributes to skew this distribution. ImmunoMatch can be used to score specific versus promiscuous H–L pairs (identify the number of distinct H/L partners with which a given sequence can pair) to further investigate the sequence and molecular determinants of those examples which are ambivalent toward chain partners, and whether this promiscuity gives such chains a biological advantage to be more represented in the repertoire.
Methods
Data curation
Data acquisition
Paired heavy and light-chain sequences were collected from six datasets across three publications, all derived from healthy human donors (Supplementary Table 1). Data from Rajan et al.38 and DeKosky et al.65 were downloaded following their associated publications, and the data from Jaffe et al.39 were obtained from the Observed Antibody Space database113. The collected sequences represent diverse B cell populations, including naive, memory and mixed B cells. The use of multiple donors served to minimize potential biases in pairing preferences that could arise from donors’ disease states or immunological background. These paired H–L sequences were annotated using IgBLAST114 to identify germline gene usage and delineate the CDR3 regions for downstream analysis. These collected sequences were subject to the pipeline described below, involving clustering, sampling, and pseudo-negative generation, before training and testing machine-learning models.
Clustering
Clustering was employed to reduce over-representation of specific clonotypes in the training dataset and to prevent data leakage between the training and testing datasets. We extracted paired CDRH3 and CDRL3 (the most variable antibody segments defining clonotypes115), concatenated per pair and clustered them under a 90% sequence identity threshold, using the CD-HIT algorithm116. An equal number of heavy and light-chain pairs were sampled from each dataset to avoid biases stemming from individual data sources, with details summarized in Supplementary Table 1.
Pseudo-negative data simulation
Nonviable heavy and light-chain pairs are naturally eliminated during B cell development69, implying a lack of negative labels to train machine-learning models. To address this, we simulated ‘pseudo-negatives’ by randomly selecting two H–L pairs and exchanging their light-chain partners. Given the variability in CDRL3 length, this randomization was further constrained by restricting that only H–L pairs with identical CDRL3 lengths were swapped. This procedure ensured that paired and pseudo-unpaired instances contributed equivalent CDR3 length features, and eliminating signals on H–L chain pairing originating from CDR3 length differences. Newly generated pseudo-negative pairs which were also present in the collected positive instances were rejected in the process, and another positive H–L pair would be sampled to exchange their L chain partners. A true H–L pair would be removed from the pool of positive examples if within 250 trials it failed to exchange the L chain with other pairs, to form a pseudo-negative example fulfilling the aforementioned criteria.
Following these steps, the dataset comprises 233,880 sequences with equal amount of positive and negative cases. The pre-processed data are split into training (90%) and testing (10%) subsets for model training and evaluation.
Baseline models
In establishing all baseline models, these models were trained using a K-fold cross-validation approach (K = 10).
Gene usage-based approach
V and J gene usage was one-hot encoded and used as input for logistic regression and tree-based XGBoost models. Logistic regression was fitted using the LogisticRegressionCV function in scikit-learn (v.1.5.0). XGBoost models were fitted using the XGBoost python package (v.2.1.0), with the function XGBClassifier and the parameter ‘tree_method = hist’.
CDR3-based approach
We leveraged the CNN architecture to capture local patterns in CDRH3 and CDRL3 sequences. Briefly, CDRH3 and CDRL3 amino acid sequences were separately one-hot encoded as a two-dimensional n × L matrix, where n is the length of the CDR3 fragment and L represents the 20 amino acids. The two matrices were passed separately through convolutional layers followed by concatenating outputs from the two streams for passing through to a multilayer perceptron. Various hyperparameters were tested in training the CNN, including the number of convolutional layers, choice of optimizers and the incorporation of a residual network (ResNet)72. Further details are in Supplementary Methods.
Language model-based approach
Entire VH and VL paired sequences from the curated training set were passed into pretrained language models (ESM-2 (150M parameters)73 and AntiBERTa2 (ref. 68)) appended with a classification head for binary classification (paired (positive) versus unpaired (pseudo-negative)). Pretrained weights of these models were obtained via HuggingFace. We compared the classification results between the fine-tuned models (for n = 3 epochs with learning rate 2 × 10−5 and weight decay 0.01, using the Trainer method in HuggingFace) versus the pretrained models (that is weights were frozen during fine-tuning). Supplementary Methods provide further details.
The final ImmunoMatch model
We compared the aforementioned model setups to identify the model with the highest accuracy in the withheld test set. The final model was selected as the fine-tuned AntiBERTa2 model. We call this fine-tuned instance ImmunoMatch. We noted that the withheld test sequences were sampled from the same donors exposed to ImmunoMatch during fine-tuning. We thus curated an external test set, comprising sequences from previously unseen donors in both pretraining and fine-tuning stages. We used paired H–L chains from donor 1, 2 and 4 from Jaffe et al.39 as this external test set (donor 3 was included in training; Supplementary Table 1). Data were downloaded from the associated figshare repository (10.25452/figshare.plus.20338177) of the Jaffe et al. publication. Annotated VH and VL sequences were obtained from the ‘filtered_contig_annotations.csv’ files processed from respective sequencing libraries.
ImmunoMatch-κ and ImmunoMatch-λ models
We further produced two variants of ImmunoMatch, ImmunoMatch-κ and ImmunoMatch-λ, by curating antibody heavy chain sequences with κ and λ light chains, respectively for fine-tuning AntiBERTa2. The previously collected paired antibody sequences were split into two groups according to their light-chain types, and separately subjected to clustering and pseudo-negatives generation steps. The sizes for training sets of ImmunoMatch-κ and ImmunoMatch-λ were kept constant (n = 194,374). ImmunoMatch-κ and ImmunoMatch-λ were trained using the same procedure and parameters as the original ImmunoMatch model.
Applying ImmunoMatch models
For ImmunoMatch, ImmunoMatch-κ and ImmunoMatch-λ, the epoch with the minimized evaluation loss was loaded for application on external datasets using the HuggingFace ‘RoFormerForSequenceClassification.from_pretrained’ interface. A pairing score was derived by applying the softmax transformation on the output obtained by passing through an H–L sequence pair. To apply ImmunoMatch in biological case studies, the annotated sequences were first grouped by their light-chain types, and the ImmunoMatch-κ and ImmunoMatch-λ models were applied on the H–κ and H–λ subsets, respectively.
Validation datasets
Spatial VDJ sequencing dataset
We analyzed data generated from n = 2 breast tumors reported in Engblom et al.81, where the VDJ sequences of intratumoral B cells were analyzed by (1) the authors’ spatial VDJ sequencing protocol, and (2) single-cell VDJ sequencing data generated from the same tumor sections in parallel to (1). Both spatial and single-cell datasets were obtained from the Zenodo repository associated to the Engblom et al. manuscript (10.5281/zenodo.7961605). For the spatial data, the complete VDJ amino acid sequences for the heavy and light chains were extracted from the .vdjca files provided by the authors, using the exportAlignments function in MiXCR (v.3.0.3). The complete sequences were merged with the authors’ clone annotations and results from the Engblom et al. ‘repair’ method described in their manuscript81 based on identical CDRH3 and CDRL3 amino acid sequences. In total for the spatial data, we considered n = 112 H–L pairs with complete, functional amino acid sequences for both chains. The spatial locations of H–L pairs were visualized by using the 10x Genomics Space Ranger output and Seurat objects provided by the authors. For the single-cell data, we obtained the Cell Ranger output provided by the authors in the Zenodo repository, and analyzed a total of n = 1,326 pairs of complete, functional H–L sequences. Overlap between the spatial and single-cell data was determined by identifying identical CDRH3 and CDRL3 amino acid sequences.
Healthy B cell single-cell datasets generated using different library preparation protocols
We collected single-cell BCR pairs from healthy individuals sampled using three different single-cell library preparation methods: (1) SMART-seq2 data from Lindeman et al.117; (2) 10x Genomics example data provided by the manufacturer (https://www.10xgenomics.com/datasets/human-b-cells-from-a-healthy-donor-1-k-cells-2-standard-6-0-0); and (3) Parse Bioscience example data from the manufacturer (https://www.parsebiosciences.com/datasets/bcr-sequencing-of-1-million-healthy-and-diseased-samples-in-a-single-experiment/). The Parse Bioscience dataset contain B cells from both n = 12 healthy donors and n = 12 patients with autoimmune diseases; here only the data corresponding to the healthy controls were considered. For each dataset, nonfunctional and incomplete sequences were removed. The H and L sequences were grouped by unique cell identifiers and all possible combinations of H–L pairs with the same cell identifier were considered by ImmunoMatch scoring. Supplementary Note 2 provides further details.
Naive, germinal center and memory B cell datasets
We collected the following repertoire datasets which comprised paired heavy and light-chain sequences traceable to the cell of origin. First, for naive and memory B cells from healthy individuals, we used data from donors 1, 2 and 4 in Jaffe et al.39. Here, only sequencing libraries containing purely sorted naive or memory (unswitched or switched) were considered. In total we considered n = 711,372 paired sequences from the three donors in the study. Second, we curated data from a single-cell study of tonsil B cells85. Quality filtered, annotated sequences were downloaded from EMBL-EBI ArrayExpress (accession code E-MTAB-9003) and overlapped with cell-type annotation available for the same dataset based on scRNA-seq85 (ArrayExpress accession code E-MTAB-9005). We used the King et al. dataset for two analyses: first, we extracted sequences corresponding to GC B cells by considering the associated cell cluster labels (‘GC’, ‘LZ GC’, ‘DZ GC’ and ‘FCRL2/3high GC’), for comparisons with the naive and memory B cell data from Jaffe et al. This amounted to n = 1,823 paired sequences. Second, we used the entire King et al. dataset to examine the differences in ImmunoMatch pairing scores across different heavy chain isotypes and germline identity levels. For this we considered n = 10,782 cells with paired VH and VL sequences, and generated pseudo-negative pairs as control using the identical procedure described in the section ‘Pseudo-negative data simulation’. Clonotype clustering analysis was performed on the Jaffe et al. dataset using the DefineClones.py function in the Change-O (v.1.3.0)118 package. Clonotype trees were constructed using the maximum parsimony method ‘dnapars’ in PHYLIP119, accessed through BrepPhylo120.
B cell leukemia and lymphoma datasets
We compiled paired heavy and light chains from B cell leukemia and lymphoma samples, from two sources: (1) searches on the GenBank database, which retrieved n = 92 paired heavy and light chains obtained from leukemia and lymphoma samples reported in published works121–127, and (2) leukemia/lymphoma cell lines from the Cancer Cell Line Encyclopaedia (CCLE) project128. For (1) sequences were annotated using IgBLAST114 (v.1.19.0) with germline sequences obtained from IMGT (accessed on 22 November 2024). For (2), paired-end bulk RNA sequencing FASTQ files were obtained from the Sequence Read Archive (project accession PRJNA523380), and heavy and light-chain sequences were assembled following Tan et al.129, using MiXCR (v.4.7.0) software130 run with default parameters. The following filtering steps were performed to ensure that paired sequences were from cancers of a B cell origin: (1) removed cell lines with at least 100 reads mapped to T cell receptors; (2) removed cell lines with fewer than 100 reads mapped to immunoglobulins; and (3) removed cell lines without any heavy chain reads. We noted in these cases, more than one distinct heavy and light-chain sequences can often be found129. We therefore implemented the following procedure to obtain a single H–L pair for each cancer cell line. First, we examined the ratio between the total number of reads mapped to the IGK locus to that for the IGL locus; we only considered the specific light-chain type if there were at least 100 times more reads mapped to this light-chain locus. Second, only the heavy and light chains with the highest fraction of read support were retained. Following removal of out-of-frame sequences we analyzed n = 31 cancer cell lines from the CCLE project. In total we analyzed n = 123 paired heavy and light chains from B cell leukemia and lymphoma.
Antibody structure analysis
We analyzed n = 3,832 human antibody structures curated in VCAb70. VH–VL interface surface area was calculated by estimating the change in solvent-exposed surface area upon VH–VL complex formation, using POPSCOMP71. Definitions of VH and VL domains were taken from VCAb annotations. CDR and FWR regions were delineated using IMGT numbering obtained using ANARCI131. For comparisons between κ and λ light chains, sequence identities were computed using MMseqs2 (v.13.45111)132.
Analysis of therapeutic antibodies
We downloaded therapeutic antibody annotations from TheraSAbDab133 (accessed 5 December 2024), and considered only those which were (1) monospecific (with only one unique H–L pair); (2) approved or under active development; and (3) either human or humanized. Sequences were numbered using ANARCI131. For each antibody, we kept the observed VL sequence as it was; for VH, we grafted the observed CDRH3 fragment onto all possible combinations of germline IGHV- and IGHJ-encoded amino acid sequences obtained from IMGT (5 December 2024). This constituted a library of random VH sequences while retaining the same CDRH3. This library was screened together with the observed VL using ImmunoMatch to obtain pairing scores, for comparison with the pairing score of the observed VH–VL pair. Sequence identities were computed using MMseqs2 (ref. 132; v.13.45111).
Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at 10.1038/s41592-025-02913-x.
Supplementary information
Supplementary Tables 1–4, Notes 1–2, Figs. 1–7 and Methods.
Source data
Statistical Source Data for Fig. 1.
Statistical Source Data for Fig. 2.
Statistical Source Data for Fig. 3.
Statistical Source Data for Fig. 4.
Statistical Source Data for Fig. 5.
Statistical Source Data for Extended Data Fig. 1.
Statistical Source Data for Extended Data Fig. 2.
Statistical Source Data for Extended Data Fig. 3.
Statistical Source Data for Extended Data Fig. 4.
Statistical Source Data for Extended Data Fig. 5.
Acknowledgements
We thank all members of the Fraternali group for comments and suggestions. This work was supported by the Biotechnology and Biological Sciences Research Council (https://bbsrc.ukri.org/, BB/T002212/1 to F.F., D.K.D.-W. and J.C.F.N.; BB/B000745/1 to F.F. and J.C.F.N.). D.G. was supported by a PhD scholarship from the China Scholarship Council (no. 202008440414). The funders had no role in study design, data collection and analysis, decision to publish or preparation of the article.
Extended data
Extended Data Fig. 3. Confusion matrices of ImmunoMatch-κ and ImmunoMatch-λ on H-κ and H- λ sequences.

Data for ImmunoMatch-κ and ImmunoMatch-λ models were organized by rows, tested on H-κ and H- λ pairs (columns). False negative (FN) predictions when models are tested on datasets of different L types are highlighted.
Author contributions
D.G. curated training, testing and validation datasets, implemented ImmunoMatch and performed model comparisons, supervised by F.F., J.C.F.N. and D.K.D.-W. J.C.F.N. and D.G. curated ImmunoMatch use cases and carried out computational analyses. F.F. conceived the project and acquired funding. J.C.F.N. and D.G. wrote the manuscript and the Methods section with critical input from F.F. All authors read, commented and approved the final manuscript.
Peer review
Peer review information
Nature Methods thanks Kenneth Hoehn who co-reviewed with Hunter Melton and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. Primary Handling Editor: Madhura Mukhopadhyay, in collaboration with the Nature Methods team. Peer reviewer reports are available.
Data availability
Publicly available antibody repertoire sequencing datasets were utilized for training and testing ImmunoMatch: Rajan et al.38 (SRP104286), DeKosky et al.65 (PRJNA315079, SRX709625 and SRX709626), King et al.85 (E-MTAB-9003), Lindeman et al.117 (Supplementary Table 5 in the paper), Jaffe et al.39 (authors’ data repository at 10.25452/figshare.plus.20338177) and Engblom et al.81 (authors’ data repository at 10.5281/zenodo.7961605). Curated sequences from leukemia and lymphoma samples, as well as therapeutic antibody sequences considered in this analysis, can be found in the project GitHub repository (https://github.com/Fraternalilab/ImmunoMatch). Source data are provided with this paper.
Code availability
Final checkpoints of ImmunoMatch, ImmunoMatch-κ and ImmunoMatch-λ are available on HuggingFace at https://huggingface.co/fraternalilab/immunomatch. Code to run ImmunoMatch to annotate sequences can be found on Google Collaboratory (https://colab.research.google.com/github/Fraternalilab/ImmunoMatch/blob/main/Run_ImmunoMatch.ipynb). A standalone Python package to apply ImmunoMatch is available on PyPI at https://pypi.org/project/ImmunoMatch/. Data and code to generate figures in this manuscript are available on GitHub (https://github.com/Fraternalilab/ImmunoMatch).
Competing interests
The authors declare no competing interests.
Footnotes
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Contributor Information
Franca Fraternali, Email: f.fraternali@ucl.ac.uk.
Joseph C. F. Ng, Email: joseph.ng@ucl.ac.uk
Extended data
is available for this paper at 10.1038/s41592-025-02913-x.
Supplementary information
The online version contains supplementary material available at 10.1038/s41592-025-02913-x.
References
- 1.Briney, B., Inderbitzin, A., Joyce, C. & Burton, D. R. Commonality despite exceptional diversity in the baseline human antibody repertoire. Nature566, 393–397 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Marks, C. & Deane, C. M. How repertoire data are changing antibody science. J. Biol. Chem.295, 9823–9837 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 3.Elhanati, Y. et al. Inferring processes underlying B-cell repertoire diversity. Phil. Trans. R. Soc. B Biol. Sci.370, 20140243 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 4.Greiff, V., Miho, E., Menzel, U. & Reddy, S. T. Bioinformatic and statistical analysis of adaptive immune repertoires. Trends Immunol.36, 738–749 (2015). [DOI] [PubMed] [Google Scholar]
- 5.Chi, X., Li, Y. & Qiu, X. V(D)J recombination, somatic hypermutation and class switch recombination of immunoglobulins: mechanism and regulation. Immunology160, 233–247 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 6.Schatz, D. G. & Ji, Y. Recombination centres and the orchestration of V(D)J recombination. Nat. Rev. Immunol.11, 251–263 (2011). [DOI] [PubMed] [Google Scholar]
- 7.Alt, F. et al. Ordered rearrangement of immunoglobulin heavy chain variable region segments. EMBO J.3, 1209–1219 (1984). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 8.Tonegawa, S. Somatic generation of antibody diversity. Nature302, 575–581 (1983). [DOI] [PubMed] [Google Scholar]
- 9.van der Burg, M. et al. Ordered recombination of immunoglobulin light chain genes occurs at the IGK locus but seems less strict at theIGL locus. Blood97, 1001–1008 (2001). [DOI] [PubMed] [Google Scholar]
- 10.Alberts, B. et al. in Molecular Biology of the Cell 7th edn (W. W. Norton & Company, 2022).
- 11.Taccioli, G. E. et al. Impairment of V(D)J recombination in double-strand break repair mutants. Science260, 207–210 (1993). [DOI] [PubMed] [Google Scholar]
- 12.Boboila, C., Alt, F. W. & Schwer, B. in Advances in Immunology, Vol. 116 (ed. Alt, F. W.) 1–49 (Academic Press, 2012). [DOI] [PubMed]
- 13.Rooney, S., Chaudhuri, J. & Alt, F. W. The role of the non-homologous end-joining pathway in lymphocyte development. Immunol. Rev.200, 115–131 (2004). [DOI] [PubMed] [Google Scholar]
- 14.Christie, S. M., Fijen, C. & Rothenberg, E. V(D)J recombination: recent insights in formation of the recombinase complex and recruitment of DNA repair machinery. Front. Cell Dev. Biol.10.3389/fcell.2022.886718 (2022). [DOI] [PMC free article] [PubMed]
- 15.Li, Z., Woo, C. J., Iglesias-Ussel, M. D., Ronai, D. & Scharff, M. D. The generation of antibody diversity through somatic hypermutation and class switch recombination. Genes Dev.18, 1–11 (2004). [DOI] [PubMed] [Google Scholar]
- 16.Martin, A. et al. Activation-induced cytidine deaminase turns on somatic hypermutation in hybridomas. Nature415, 802–806 (2002). [DOI] [PubMed] [Google Scholar]
- 17.Nussenzweig, M. C. & Alt, F. W. Antibody diversity: one enzyme to rule them all. Nat. Med.10, 1304–1305 (2004). [DOI] [PubMed] [Google Scholar]
- 18.Qin, Y. & Meng, F.-L. Taming AID mutator activity in somatic hypermutation. Trends Biochem. Sci.49, 622–632 (2024). [DOI] [PubMed] [Google Scholar]
- 19.Krammer, F. The human antibody response to influenza A virus infection and vaccination. Nat. Rev. Immunol.19, 383–397 (2019). [DOI] [PubMed] [Google Scholar]
- 20.Laidlaw, B. J. & Ellebedy, A. H. The germinal centre B cell response to SARS-CoV-2. Nat. Rev. Immunol.22, 7–18 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 21.Kim, W. et al. Germinal centre-driven maturation of B cell response to mRNA vaccination. Nature604, 141–145 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Baran, D. et al. Principles for computational design of binding antibodies. Proc. Natl Acad. Sci. USA114, 10900–10905 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 23.Carter, P. J. & Rajpal, A. Designing antibodies as therapeutics. Cell185, 2789–2805 (2022). [DOI] [PubMed] [Google Scholar]
- 24.Dudzic, P. et al. Large-scale data mining of four billion human antibody variable regions reveals convergence between therapeutic and natural antibodies that constrains search space for biologics drug discovery. mAbs16, 2361928 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 25.Raybould, M. I. J. et al. Five computational developability guidelines for therapeutic antibody profiling. Proc. Natl Acad. Sci. USA116, 4025–4030 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Wolf Pérez, A.-M. et al. In vitro and in silico assessment of the developability of a designed monoclonal antibody library. mAbs11, 388–400 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Negron, C., Fang, J., McPherson, M. J., Stine, W. B. & McCluskey, A. J. Separating clinical antibodies from repertoire antibodies, a path to in silico developability assessment. mAbs14, 2080628 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Bashour, H. et al. Biophysical cartography of the native and human-engineered antibody landscapes quantifies the plasticity of antibody developability. Commun. Biol.7, 1–25 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 29.Haidar, J. N. et al. A universal combinatorial design of antibody framework to graft distinct CDR sequences: a bioinformatics approach. Proteins80, 896–912 (2012). [DOI] [PubMed] [Google Scholar]
- 30.Warszawski, S. et al. Optimizing antibody affinity and stability by the automated design of the variable light-heavy chain interfaces. PLoS Comput. Biol.15, e1007207 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Wolf Pérez, A.-M., Lorenzen, N., Vendruscolo, M. & Sormanni, P. Assessment of therapeutic antibody developability by combinations of in vitro and in silico methods. Methods Mol. Biol.2313, 57–113 (2022). [DOI] [PubMed] [Google Scholar]
- 32.Wu, X. et al. Focused evolution of HIV-1 neutralizing antibodies revealed by structures and deep sequencing. Science333, 1593–1602 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 33.Zhu, J. et al. Somatic populations of PGT135-137 HIV-1-neutralizing antibodies identified by 454 pyrosequencing and bioinformatics. Front. Microbiol.3, 315 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Zhu, J. et al. De novo identification of VRC01 class HIV-1-neutralizing antibodies by next-generation sequencing of B-cell transcripts. Proc. Natl Acad. Sci. USA110, E4088–4097 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Zhu, J. et al. Mining the antibodyome for HIV-1-neutralizing antibodies with next-generation sequencing and phylogenetic pairing of heavy/light chains. Proc. Natl Acad. Sci. USA110, 6470–6475 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 36.DeKosky, B. J. et al. In-depth determination and analysis of the human paired heavy- and light-chain antibody repertoire. Nat. Med.21, 86–91 (2015). [DOI] [PubMed] [Google Scholar]
- 37.Devulapally, P. R. et al. Simple paired heavy- and light-chain antibody repertoire sequencing using endoplasmic reticulum microsomes. Genome Med.10, 34 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 38.Rajan, S. et al. Recombinant human B cell repertoires enable screening for rare, specific, and natively paired antibodies. Commun. Biol.1, 5 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 39.Jaffe, D. B. et al. Functional antibodies exhibit light chain coherence. Nature611, 352–357 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Irac, S. E., Soon, M. S. F., Borcherding, N. & Tuong, Z. K. Single-cell immune repertoire analysis. Nat. Methods21, 777–792 (2024). [DOI] [PubMed] [Google Scholar]
- 41.Venkitaraman, A. R., Williams, G. T., Dariavach, P. & Neuberger, M. S. The B-cell antigen receptor of the five immunoglobulin classes. Nature352, 777–781 (1991). [DOI] [PubMed] [Google Scholar]
- 42.Schamel, W. W. & Reth, M. Monomeric and oligomeric complexes of the B cell antigen receptor. Immunity13, 5–14 (2000). [DOI] [PubMed] [Google Scholar]
- 43.Klinman, N. R. The ‘clonal selection hypothesis’ and current concepts of B cell tolerance. Immunity5, 189–195 (1996). [DOI] [PubMed] [Google Scholar]
- 44.Tze, L. E. et al. Basal immunoglobulin signaling actively maintains developmental stage in immature B cells. PLoS Biol.3, e82 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Verkoczy, L. et al. Basal B cell receptor-directed phosphatidylinositol 3-kinase signaling turns off RAGs and promotes B cell-positive selection. J. Immunol.178, 6332–6341 (2007). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Cancro, M. P. & Kearney, J. F. B cell positive selection: road map to the primary repertoire? J. Immunol.173, 15–19 (2004). [DOI] [PubMed] [Google Scholar]
- 47.Nemazee, D. A. & Bürki, K. Clonal deletion of B lymphocytes in a transgenic mouse bearing anti-MHC class I antibody genes. Nature337, 562–566 (1989). [DOI] [PubMed] [Google Scholar]
- 48.Nemazee, D. & Buerki, K. Clonal deletion of autoreactive B lymphocytes in bone marrow chimeras. Proc. Natl Acad. Sci. USA86, 8039–8043 (1989). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Hartley, S. B. et al. Elimination from peripheral lymphoid tissues of self-reactive B lymphocytes recognizing membrane-bound antigens. Nature353, 765–769 (1991). [DOI] [PubMed] [Google Scholar]
- 50.Cyster, J. G., Hartley, S. B. & Goodnow, C. C. Competition for follicular niches excludes self-reactive cells from the recirculating B-cell repertoire. Nature371, 389–395 (1994). [DOI] [PubMed] [Google Scholar]
- 51.Meffre, E. The establishment of early B cell tolerance in humans: lessons from primary immunodeficiency diseases. Ann. NY Acad. Sci.1246, 1–10 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 52.Yurasov, S. et al. Defective B cell tolerance checkpoints in systemic lupus erythematosus. J. Exp. Med.201, 703–711 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 53.Chothia, C. & Lesk, A. M. Canonical structures for the hypervariable regions of immunoglobulins. J. Mol. Biol.196, 901–917 (1987). [DOI] [PubMed] [Google Scholar]
- 54.Chothia, C. et al. Conformations of immunoglobulin hypervariable regions. Nature342, 877–883 (1989). [DOI] [PubMed] [Google Scholar]
- 55.Sela-Culang, I., Kunik, V. & Ofran, Y. The structural basis of antibody-antigen recognition. Front. Immunol.4, 302 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 56.Sela-Culang, I., Alon, S. & Ofran, Y. A systematic comparison of free and bound antibodies reveals binding-related conformational changes. J. Immunol.189, 4890–4899 (2012). [DOI] [PubMed] [Google Scholar]
- 57.Koenig, P. et al. Mutational landscape of antibody variable domains reveals a switch modulating the interdomain conformational dynamics and antigen binding. Proc. Natl Acad. Sci. USA114, E486–E495 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 58.Fernández-Quintero, M. L. et al. Germline-dependent antibody paratope states and pairing specific VH-VL interface dynamics. Front. Immunol.12, 675655 (2021). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 59.Novobrantseva, T. et al. Stochastic pairing of Ig heavy and light chains frequently generates B cell antigen receptors that are subject to editing in vivo. Int. Immunol.17, 343–350 (2005). [DOI] [PubMed] [Google Scholar]
- 60.Andrews, S. F. et al. Global analysis of B cell selection using an immunoglobulin light chain-mediated model of autoreactivity. J. Exp. Med.210, 125–142 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 61.Florova, M. et al. Central tolerance shapes the neutralizing B cell repertoire against a persisting virus in its natural host. Proc. Natl Acad. Sci. USA121, e2318657121 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 62.Brezinschek, H. P., Foster, S. J., Dörner, T., Brezinschek, R. I. & Lipsky, P. E. Pairing of variable heavy and variable kappa chains in individual naive and memory B cells. J. Immunol.160, 4762–4767 (1998). [PubMed] [Google Scholar]
- 63.de Wildt, R. M., Hoet, R. M., van Venrooij, W. J., Tomlinson, I. M. & Winter, G. Analysis of heavy and light chain pairings indicates that receptor editing shapes the human antibody repertoire. J. Mol. Biol.285, 895–901 (1999). [DOI] [PubMed] [Google Scholar]
- 64.Jayaram, N., Bhowmick, P. & Martin, A. C. R. Germline VH/VL pairing in antibodies. Protein Eng. Des. Sel.25, 523–529 (2012). [DOI] [PubMed] [Google Scholar]
- 65.DeKosky, B. J. et al. Large-scale sequence and structural comparisons of human naive and antigen-experienced antibody repertoires. Proc. Natl Acad. Sci. USA113, E2636–E2645 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 66.Burbach, S. M. & Briney, B. Improving antibody language models with native pairing. Patterns5, 100967 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 67.Turnbull, O. M., Oglic, D., Croasdale-Wood, R. & Deane, C. M. p-IgGen: a paired antibody generative language model. Bioinformatics40, btae659 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 68.Barton, J., Galson, J. D. & Leem, J. Enhancing antibody language models with structural information. Preprint at bioRxiv10.1101/2023.12.12.569610 (2024).
- 69.Nemazee, D. Mechanisms of central tolerance for B cells. Nat. Rev. Immunol.17, 281–294 (2017). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 70.Guo, D., Ng, J. C.-F., Dunn-Walters, D. K. & Fraternali, F. VCAb: a web-tool for structure-guided exploration of antibodies. Bioinform. Adv.4, vbae137 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 71.Kleinjung, J. & Fraternali, F. POPSCOMP: an automated interaction analysis of biomolecular complexes. Nucleic Acids Res.33, W342–W346 (2005). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 72.He, K., Zhang, X., Ren, S. & Sun, J. Deep residual learning for image recognition. Preprint at 10.48550/arXiv.1512.03385 (2015).
- 73.Lin, Z. et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science379, 1123–1130 (2023). [DOI] [PubMed] [Google Scholar]
- 74.Vaswani, A. et al. Attention is all you need. Preprint at 10.48550/arXiv.1706.03762 (2017).
- 75.Sinha, K. et al. Masked language modeling and the distributional hypothesis: order word matters pre-training for little. Preprint at 10.48550/arXiv.2104.06644 (2021).
- 76.Collins, A. M. & Watson, C. T. Immunoglobulin light chain gene rearrangements, receptor editing and the development of a self-tolerant antibody repertoire. Front. Immunol.9, 2249 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 77.Hieter, P. A., Korsmeyer, S. J., Waldmann, T. A. & Leder, P. Human immunoglobulin κ light-chain genes are deleted or rearranged in λ-producing B cells. Nature290, 368–372 (1981). [DOI] [PubMed] [Google Scholar]
- 78.Bräuninger, A., Goossens, T., Rajewsky, K. & Küppers, R. Regulation of immunoglobulin light chain gene rearrangements during early B cell development in the human. Eur. J. Immunol.31, 3631–3637 (2001). [DOI] [PubMed] [Google Scholar]
- 79.Bannish, G., Fuentes-Pananá, E. M., Cambier, J. C., Pear, W. S. & Monroe, J. G. Ligand-independent signaling functions for the B lymphocyte antigen receptor and their role in positive selection during B lymphopoiesis. J. Exp. Med.194, 1583–1596 (2001). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 80.Louzoun, Y., Friedman, T., Luning Prak, E., Litwin, S. & Weigert, M. Analysis of B cell receptor production and rearrangement: part I. Light chain rearrangement. Semin. Immunol.14, 169–190 (2002). [DOI] [PubMed] [Google Scholar]
- 81.Engblom, C. et al. Spatial transcriptomics of B cell and T cell receptors reveals lymphocyte clonal dynamics. Science382, eadf8486 (2023). [DOI] [PubMed] [Google Scholar]
- 82.De Silva, N. S. & Klein, U. Dynamics of B cells in germinal centres. Nat. Rev. Immunol.15, 137–148 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 83.Young, C. & Brink, R. The unique biology of germinal center B cells. Immunity54, 1652–1664 (2021). [DOI] [PubMed] [Google Scholar]
- 84.Victora, G. D. & Nussenzweig, M. C. Germinal centers. Ann. Rev. Immunol.40, 413–442 (2022). [DOI] [PubMed] [Google Scholar]
- 85.King, H. W. et al. Single-cell analysis of human B cell maturation predicts how antibody class switching shapes selection dynamics. Sci. Immunol.6, eabe6291 (2021). [DOI] [PubMed] [Google Scholar]
- 86.Küppers, R. Mechanisms of B-cell lymphoma pathogenesis. Nat. Rev. Cancer5, 251–262 (2005). [DOI] [PubMed] [Google Scholar]
- 87.Shaffer, A. L. R., Young, R. M. & Staudt, L. M. Pathogenesis of human B cell lymphomas. Ann. Rev. Immunol.30, 565–610 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 88.le Viseur, C. et al. In childhood acute lymphoblastic leukemia, blasts at different stages of immunophenotypic maturation have stem cell properties. Cancer Cell14, 47–58 (2008). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 89.Moorman, A. V. The clinical relevance of chromosomal and genomic abnormalities in B-cell precursor acute lymphoblastic leukaemia. Blood Rev.26, 123–135 (2012). [DOI] [PubMed] [Google Scholar]
- 90.Davis, R. E. et al. Chronic active B-cell-receptor signalling in diffuse large B-cell lymphoma. Nature463, 88–92 (2010). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 91.Young, R. M. et al. Survival of human lymphoma cells requires B-cell receptor engagement by self-antigens. Proc. Natl Acad. Sci. USA112, 13447–13454 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 92.Attaf, N. et al. FB5P-seq: FACS-based 5-prime end single-cell RNA-seq for integrative analysis of transcriptome and antigen receptor repertoire in B and T cells. Front. Immunol.11, 216 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 93.Subas Satish, H. P. et al. NAb-seq: an accurate, rapid, and cost-effective method for antibody long-read sequencing in hybridoma cell lines and single B cells. mAbs14, 2106621 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 94.Setliff, I. et al. High-throughput mapping of B cell receptor sequences to antigen specificity. Cell179, 1636–1646.e15 (2019). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 95.Raybould, M. I. J., Turnbull, O. M., Suter, A., Guloglu, B. & Deane, C. M. Contextualising the developability risk of antibodies with lambda light chains using enhanced therapeutic antibody profiling. Commun. Biol.7, 62 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 96.Tennenhouse, A. et al. Computational optimization of antibody humanness and stability by systematic energy-based ranking. Nat. Biomed. Eng.8, 30–44 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 97.Emami, P., Perreault, A., Law, J., Biagioni, D. & John, P. S. Plug & play directed evolution of proteins with gradient-based discrete MCMC. Machine Learn. Sci. Technol.4, 025014 (2023). [Google Scholar]
- 98.Makowski, E. K. et al. Co-optimization of therapeutic antibody affinity and specificity using machine learning models that generalize to novel mutational space. Nat. Commun.13, 3788 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 99.Tučs, A. et al. Extensive antibody search with whole spectrum black-box optimization. Sci. Rep.14, 552 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 100.Townsend, C. L. et al. Significant differences in physicochemical properties of human immunoglobulin kappa and lambda CDR3 regions. Front. Immunol.7, 388 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 101.Chailyan, A., Marcatili, P., Cirillo, D. & Tramontano, A. Structural repertoire of immunoglobulin λ light chains. Proteins79, 1513–1524 (2011). [DOI] [PubMed] [Google Scholar]
- 102.Ralph, D. K. & Matsen, F. A. Consistency of VDJ rearrangement and substitution parameters enables accurate B cell receptor sequence annotation. PLoS Comput. Biol.12, e1004409 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 103.Marcou, Q., Mora, T. & Walczak, A. M. High-throughput immune repertoire analysis with IGoR. Nat. Commun.9, 561 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 104.Weber, C. R. et al. immuneSIM: tunable multi-feature simulation of B- and T-cell receptor repertoires for immunoinformatics benchmarking. Bioinformatics36, 3594–3596 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 105.Le Quý, K. et al. Benchmarking and integrating human B-cell receptor genomic and antibody proteomic profiling. NPJ Syst. Biol. Appl.10, 73 (2024). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 106.Bonissone, S. R. et al. Serum proteomics expands on high-affinity antibodies in immunized rabbits than deep B-cell repertoire sequencing alone. Preprint at bioRxiv10.1101/833871 (2020).
- 107.Cheung, W. C. et al. A proteomics approach for the identification and cloning of monoclonal antibodies from serum. Nat. Biotechnol.30, 447–452 (2012). [DOI] [PubMed] [Google Scholar]
- 108.Snapkov, I. et al. Progress and challenges in mass spectrometry-based analysis of antibody repertoires. Trends Biotechnol.40, 463–481 (2022). [DOI] [PubMed] [Google Scholar]
- 109.Chernigovskaya, M. et al. Systematic benchmarking of mass spectrometry-based antibody sequencing reveals methodological biases. Preprint at bioRxiv10.1101/2024.11.11.622451 (2024). [DOI] [PubMed]
- 110.Lavinder, J. J., Horton, A. P., Georgiou, G. & Ippolito, G. C. Next-generation sequencing and protein mass spectrometry for the comprehensive analysis of human cellular and serum antibody repertoires. Curr. Opin. Chem. Biol.24, 112–120 (2015). [DOI] [PubMed] [Google Scholar]
- 111.Edwards, B. M. et al. The remarkable flexibility of the human antibody repertoire; isolation of over one thousand different antibodies to a single protein, BLyS. J. Mol. Biol.334, 103–118 (2003). [DOI] [PubMed] [Google Scholar]
- 112.Yang, X. et al. Large-scale analysis of 2,152 Ig-seq datasets reveals key features of B cell biology and the antibody repertoire. Cell Rep.35, 109110 (2021). [DOI] [PubMed] [Google Scholar]
- 113.Olsen, T. H., Boyles, F. & Deane, C. M. Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci.31, 141–146 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 114.Ye, J., Ma, N., Madden, T. L. & Ostell, J. M. IgBLAST: an immunoglobulin variable domain sequence analysis tool. Nucleic Acids Res.41, W34–W40 (2013). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 115.Yaari, G. & Kleinstein, S. H. Practical guidelines for B-cell receptor repertoire sequencing analysis. Genome Med.7, 121 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 116.Fu, L., Niu, B., Zhu, Z., Wu, S. & Li, W. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics28, 3150–3152 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 117.Lindeman, I. et al. BraCeR: B-cell-receptor reconstruction and clonality inference from single-cell RNA-seq. Nat. Methods15, 563–565 (2018). [DOI] [PubMed] [Google Scholar]
- 118.Gupta, N. T. et al. Change-O: a toolkit for analyzing large-scale B cell immunoglobulin repertoire sequencing data. Bioinformatics31, 3356–3358 (2015). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 119.Felsenstein, J. PHYLIP - phylogeny inference package (version 3.2). Cladistics5, 164–166 (1989). [Google Scholar]
- 120.Stewart, A. et al. Pandemic, epidemic, endemic: B cell repertoire analysis reveals unique anti-viral responses to SARS-CoV-2, ebola and respiratory syncytial virus. Front. Immunol.13, 807104 (2022). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 121.McCann, K. J., Johnson, P. W. M., Stevenson, F. K. & Ottensmeier, C. H. Universal N-glycosylation sites introduced into the B-cell receptor of follicular lymphoma by somatic mutation: a second tumorigenic event? Leukemia20, 530–534 (2006). [DOI] [PubMed] [Google Scholar]
- 122.Fais, F. et al. Immunoglobulin V region gene use and structure suggest antigen selection in AIDS-related primary effusion lymphomas. Leukemia13, 1093–1099 (1999). [DOI] [PubMed] [Google Scholar]
- 123.Terness, P. et al. Idiotypic vaccine for treatment of human B-cell lymphoma. Construction of IgG variable regions from single malignant B cells. Hum. Immunol.56, 17–27 (1997). [DOI] [PubMed] [Google Scholar]
- 124.Ebeling, S. B., Schutte, M. E. & Logtenberg, T. Molecular analysis of VH and VL regions expressed in IgG-bearing chronic lymphocytic leukemia (CLL): further evidence that CLL is a heterogeneous group of tumors. Blood82, 1626–1631 (1993). [PubMed] [Google Scholar]
- 125.Fais, F. et al. CD1d is expressed on B-chronic lymphocytic leukemia cells and mediates α-galactosylceramide presentation to natural killer T lymphocytes. Int. J. Cancer109, 402–411 (2004). [DOI] [PubMed] [Google Scholar]
- 126.Messmer, B. T. et al. Multiple distinct sets of stereotyped antigen receptors indicate a role for antigen in promoting chronic lymphocytic leukemia. J. Exp. Med.200, 519–525 (2004). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 127.Colombo, M. et al. Intraclonal cell expansion and selection driven by B cell receptor in chronic lymphocytic leukemia. Mol. Med.17, 834–839 (2011). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 128.Barretina, J. et al. The Cancer Cell Line Encyclopedia enables predictive modelling of anticancer drug sensitivity. Nature483, 603–607 (2012). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 129.Tan, K.-T. et al. Profiling the B/T cell receptor repertoire of lymphocyte derived cell lines. BMC Cancer18, 940 (2018). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 130.Bolotin, D. A. et al. MiXCR: software for comprehensive adaptive immunity profiling. Nat. Methods12, 380–381 (2015). [DOI] [PubMed] [Google Scholar]
- 131.Dunbar, J. & Deane, C. M. ANARCI: antigen receptor numbering and receptor classification. Bioinformatics32, 298–300 (2016). [DOI] [PMC free article] [PubMed] [Google Scholar]
- 132.Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol.35, 1026–1028 (2017). [DOI] [PubMed] [Google Scholar]
- 133.Raybould, M. I. J. et al. Thera-SAbDab: the Therapeutic Structural Antibody Database. Nucleic Acids Res.48, D383–D388 (2020). [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Supplementary Tables 1–4, Notes 1–2, Figs. 1–7 and Methods.
Statistical Source Data for Fig. 1.
Statistical Source Data for Fig. 2.
Statistical Source Data for Fig. 3.
Statistical Source Data for Fig. 4.
Statistical Source Data for Fig. 5.
Statistical Source Data for Extended Data Fig. 1.
Statistical Source Data for Extended Data Fig. 2.
Statistical Source Data for Extended Data Fig. 3.
Statistical Source Data for Extended Data Fig. 4.
Statistical Source Data for Extended Data Fig. 5.
Data Availability Statement
Publicly available antibody repertoire sequencing datasets were utilized for training and testing ImmunoMatch: Rajan et al.38 (SRP104286), DeKosky et al.65 (PRJNA315079, SRX709625 and SRX709626), King et al.85 (E-MTAB-9003), Lindeman et al.117 (Supplementary Table 5 in the paper), Jaffe et al.39 (authors’ data repository at 10.25452/figshare.plus.20338177) and Engblom et al.81 (authors’ data repository at 10.5281/zenodo.7961605). Curated sequences from leukemia and lymphoma samples, as well as therapeutic antibody sequences considered in this analysis, can be found in the project GitHub repository (https://github.com/Fraternalilab/ImmunoMatch). Source data are provided with this paper.
Final checkpoints of ImmunoMatch, ImmunoMatch-κ and ImmunoMatch-λ are available on HuggingFace at https://huggingface.co/fraternalilab/immunomatch. Code to run ImmunoMatch to annotate sequences can be found on Google Collaboratory (https://colab.research.google.com/github/Fraternalilab/ImmunoMatch/blob/main/Run_ImmunoMatch.ipynb). A standalone Python package to apply ImmunoMatch is available on PyPI at https://pypi.org/project/ImmunoMatch/. Data and code to generate figures in this manuscript are available on GitHub (https://github.com/Fraternalilab/ImmunoMatch).








