ABSTRACT
Variable heavy (VH) and variable light (VL) chain pairing is a critical determinant of antibody diversity, stability, and antigen-binding specificity. Identifying productive VH – VL combinations experimentally is labor-intensive and costly, motivating the development of computational methods that can more efficiently predict compatible heavy – light chain pairs. In this work, we present a comprehensive framework that includes a new benchmark dataset and three deep learning models, each trained with a different negative sampling strategy: random pairing, V-gene mismatching, and full V(D)J germline mismatching. Our dataset includes natural pairs and these three types of synthetic negatives to simulate increasingly realistic biological constraints. Furthermore, we present a lightweight yet highly effective BERT-based model that achieves over 90% accuracy in discriminating natural from synthetic VH – VL pairs. Through extensive evaluation, we demonstrate that V(D)J-informed negative sampling significantly improves model generalization and biological interpretability. By providing reproducible baselines and a biologically grounded benchmark, this work lays the foundation for future development of efficient computational tools in antibody engineering.
KEYWORDS: Antibody language models, antibody pairing, benchmark, deep learning, germline
Introduction
Antibodies, or immunoglobulins, are Y-shaped proteins produced by B cells that serve as key effectors of the adaptive immune response. By binding specifically and tightly to antigens, such as pathogens or foreign molecules, they facilitate immune recognition and clearance. This high specificity is conferred by the antibody’s variable regions, which are capable of recognizing an immense diversity of molecular targets.1
The molecular basis for this diversity lies in V(D)J recombination, a somatic DNA rearrangement process that occurs during B cell development.2–5 In this process, variable (V), diversity (D), and joining (J) gene segments, encoded in the germline genome, are assembled to form functional immunoglobulin genes, as shown in Figure 1. The light chain variable region is generated by recombining one V and one J segment, while the heavy chain involves the sequential joining of V, D, and J segments.4 Further diversity is introduced at the junctions of these segments through nucleotide addition and deletion mediated by terminal deoxynucleotidyl transferase and exonuclease activity.6,7 Together, these mechanisms generate a vast and diverse antibody repertoire, equipping the immune system with the ability to recognize a wide array of antigens.
Figure 1.

V(D)J recombination joins V, (D), and J gene segments to assemble the variable (Fv) regions, comprising VH and VL, that confer antigen specificity, while downstream constant (C) segments encode the Fc framework.
Functional antibodies are formed through the non-covalent pairing of variable heavy (VH) and variable light (VL) chains, which together make up the antigen-binding fragment variable (Fv) region.8 The structure and function of this region, critical for antigen recognition, are shaped by inter-chain interactions that influence binding specificity, affinity, and stability.9 While some VH – VL pairings are more frequently seen in natural repertoires, the conventional view is that heavy and light chains pair largely at random to form functional antibodies.10–12 However, recent structural and computational studies challenge this view, showing that VH – VL interface geometry is influenced by the choice of germline V and J gene segments.13–15 Moreover, experimental data demonstrate that specific V(D)J combinations are critical for productive antibody assembly,16 and that non-native pairings can disrupt structural compatibility and antigen binding.17,18
To exploit this pairing diversity for therapeutic engineering, experimental strategies such as VL-shuffling recombine heavy chains with alternative light chains to identify high-performing variants.9 While effective, these methods require extensive cloning, expression, and screening, making them labor-intensive and time-consuming. Computational approaches offer a promising alternative by predicting VH – VL pairing compatibility directly from sequence. A major challenge in this setting is the lack of publicly available datasets containing confirmed non-pairing VH/VL sequences, which makes supervised model training and evaluation difficult.
Recent advances in deep learning (DL) have transformed protein bioinformatics, exemplified by the success of AlphaFold in structure prediction,19,20 and by transformer-based language models such as ESM,21 ProtTrans,22 and ProteinBERT,23 which have achieved state-of-the-art performance in diverse sequence analysis tasks.24 Inspired by these developments, antibody-specific language models have emerged, including AntiBERTa,25 AbLang,26 BALM,27 IgBERT,28 and PARA.29 Some of these models were trained or fine-tuned on paired VH – VL sequences to better capture inter-chain dependencies. For example, BALM-paired30 was trained exclusively on naturally paired antibodies, while IgBERT and IgT528 were initially trained on unpaired sequences and subsequently fine-tuned using paired data from the Observed Antibody Space (OAS).31,32
Although these models have shown strong performance in tasks such as antibody sequence recovery, structure prediction, and expression level estimation, only a few have directly tackled VH – VL pairing as a predictive task. One such model is PARA,29 which approached pairing classification by contrasting native pairs against mismatches generated via similarity-based shuffling, achieving high AUC-ROC scores. Another relevant model is p-IgGen,33 a generative VH – VL model trained on paired antibodies. While p-IgGen was not explicitly designed for classification, its ability to assign higher likelihoods to true VH – VL pairs, compared to randomly paired alternatives, indicates that it captures signals relevant to pairing compatibility. In parallel, alternative strategies have been proposed outside the transformer paradigm. HuMatch34 uses a convolutional neural network (CNN) trained on human antibody sequences annotated with V-germline labels. Although developed primarily for antibody humanization, HuMatch includes a pairing classification component. ImmunoMatch35 was built upon the AntiBERTa2 language model, which is fine-tuned using VH – VL pairs derived from single human B cells to discriminate between naturally cognate and randomly mismatched heavy – light chain combinations. SynPair36 is a new model based on contrastive learning and treats the pairing problem as dense – retrieval problem. The model achieves state-of-the-art prediction for VH-VL pairing, outperforming ImmunoMatch.
Despite growing interest in VH – VL pairing prediction, there is currently no standardized dataset or evaluation framework to guide model development or assess performance in a consistent and biologically meaningful way. Existing methods often rely on ad hoc mismatching strategies and lack rigorous comparisons, making it difficult to interpret results across studies. Recent work has shown that dataset composition – particularly the definition of negative examples – plays a critical role in shaping what models learn and how well they generalize. In the context of antibody – antigen binding, Ursu et al.37 demonstrated that negative sample selection not only influences predictive accuracy but also determines whether models recover biologically meaningful rules. These insights suggest that careful design of negative datasets is equally important for VH – VL pairing prediction, where ad hoc strategies risk introducing biases and limiting interpretability.
To address this gap, we introduce a dedicated benchmark dataset designed to test and compare deep learning models on VH – VL pairing prediction. We formulate the problem as a binary classification task and construct synthetic negative examples using three biologically motivated strategies: random recombination, V-gene mismatch, and full V(D)J germline mismatch. This approach yields structured and interpretable mismatched datasets that more closely reflect the biological constraints on antibody pairing.
Using this dataset, we train three deep learning models, each corresponding to a different negative sampling strategy, to serve as baselines for future benchmarking. These models are based on a simple yet effective architecture that combines IgBERT-derived embeddings with a multi-layer perceptron (MLP) classifier. By evaluating performance across diverse test splits (random, v-gene, and germlines), we provide a clear and reproducible framework for comparing pairing models under controlled conditions.
Our key contributions are as follows:
Benchmark Dataset: We release a benchmark dataset for VH – VL pairing classification, including positive examples from naturally paired antibodies and negative examples constructed via three complementary sampling strategies. Code for generating new mismatched samples is also provided.
Reference Models: We train and evaluate three IgBERT-based classifiers, each using a distinct negative sampling method, to serve as standardized baselines for future comparison.
Model Development: We present a lightweight yet effective DL framework that achieves 90% accuracy in distinguishing natural from synthetic VH – VL pairs. Notably, we tested the inter-chain predicted TM-score (ipTM) of AlphaFold3, a metric commonly used to assess interface quality, in this classification task, indicating that a dedicated method is needed.
Biological Insights: We assess how germline identity influences model performance, confirming that full V(D)J-based mismatching yields the most biologically discriminative features for accurate pairing prediction.
Developability Correlations: We compared different VH – VL pairing models on developability properties and observed preliminary correlations with experimental thermostability, suggesting that pairing predictions may hold potential utility in downstream antibody optimization and early-stage developability assessment.
These resources establish a much-needed foundation for systematic evaluation of antibody pairing models, enabling reproducibility, biological interpretability, and fair comparisons between emerging deep learning approaches.
Results
Building a large-scale paired antibody dataset with negative sampling
We created a large-scale dataset of antibody variable region pairs from the Observed Antibody Space (OAS) database, starting with 1,954,079 VH/VL pairs. After germline annotation (VJ for light chains and VDJ for heavy chains), pre-processing, and sequence clustering, we retained 1,357,155 high-quality native VH/VL pairs, consisting of 1,348,625 unique heavy chains and 595,539 unique light chains. In total, we identified 11,976 unique germline combinations. To facilitate evaluation under realistic biological scenarios, where test antibodies may originate from rare or unseen germline configurations, we partitioned the data based on germline origin into 716,325 training, 271,233 validation, and 369,597 test pairs.
To train models for predicting VH-VL compatibility, we augmented the dataset with mismatched (i.e., non-native) VH/VL pairs using two negative sampling strategies. In the first strategy, called the randomly paired dataset, we shuffled VH and VL sequences so that no synthetic pair matched any naturally observed one. The second strategy was inspired by the hypothesis of Jayaram et al.13 that germline origin influences VH – VL pairing. In this approach, called the germline-paired dataset, we generated synthetic pairs from germline combinations that were statistically unlikely based on the observed data. Specifically, we sampled absent germline pairs from the dataset, and for each selected combination, we independently sampled a VH and a VL sequence from the respective germline pools. To ensure diversity and avoid repeated or overrepresented pairs, we defined a probability distribution over germline pairs proportional to the product of available sequences for each germline. A smoothing parameter was applied to this distribution to reduce the skew toward high-frequency pairs.
We considered this procedure under two germline encoding schemes:
V-germline: using only the V-segment (e.g., H: VH1, L: KV1),
Full germline: using the full set of V, D, and J segments (e.g., H: VH1-VD2-VJ3, L: KV1-KJ2).
Figure 2 shows the distribution of germline co-occurrences in the training set and the corresponding synthetic pair distributions for the full germline and V-germline settings, respectively.
Figure 2.

Co-occurrence matrix for germline and V-germline datasets. The positive panels show the number of natural sequence pairs in the training split for each heavy (VDJ) and light (VJ) germline combination. The negative panels show the number of synthetic sequence pairs generated for each unobserved germline combination.
In all negative sampling strategies, the number of synthetic pairs was matched to the number of native (positive) pairs to maintain dataset balance. For further details on the negative sampling algorithm and dataset composition, see Materials and Methods.
Latent space visualization reveals separation in germline-based pairings
Our model is based on IgBert embeddings and a classification head. For each of the three datasets (randomly paired sequences, germline paired sequences, and germline paired using V-only germline sequences), a sample of 4,096 elements containing both positive (paired) and negative (synthetically paired) instances is drawn. The encoder, IgBert, processes these pairs to generate VH/VL pair embeddings of size 1,024, which are subsequently reduced to two-dimensional vectors using the t-SNE algorithm (Figure 3a–c), and in parallel with UMAP (Figure 3d–f). The t-SNE scatter plots reveal that randomly paired sequences significantly overlap with naive paired ones, indicating that models trained on the randomly paired dataset may struggle to discern meaningful patterns. In contrast, the germline paired datasets exhibit distinct clusters of paired and mismatched VH/VL pairs in the latent space projection, particularly in the V-only germline paired dataset, where the classes are well separated. UMAP further accentuates this trend, producing near-complete separation in the V-only case, while also revealing that full germline and random mismatching remain the most challenging settings, with substantial overlap between paired and mismatched sequences.
Figure 3.

Top row: t-SNE embeddings of IgBERT features reduced to 2D. Bottom row: UMAP embeddings of the same features. Light blue denotes paired sequences; magenta denotes mismatched sequences. Columns: (a,d) randomly paired dataset; (b,e) germline paired (V-only) dataset; (c,f) germline paired dataset.
Pairwise sequence similarity reveals divergence of synthetic pairings
To assess sequence-level coherence across different VH/VL pairing strategies, we analyzed four data sets: one consisting of native VH/VL pairs (referred to as naive), and three containing mismatched pairs generated using different synthetic strategies (random, germline, and germline-V). We define intra-similarity as the similarity among pairs within the same dataset, and inter-similarity as the similarity between mismatched VH/VL pairs and their corresponding naive VH/VL pair sharing the same heavy chain.
A common set of 1,000 heavy chains was sampled such that each heavy chain was present in all four datasets. From these, we extracted the paired VH and VL sequences and concatenated each pair into a single string. This yielded four sets of concatenated VH/VL sequences.
Intra-similarity was computed for each dataset by evaluating all pairwise combinations of sequences within the same set using the normalized Levenshtein similarity score:
| (1) |
The results of the intra-similarity analysis are shown in Table 1. Notably, the germline-V dataset exhibits the highest mean similarity, whereas germline displays the lowest, indicating differences in sequence homogeneity introduced by the pairing strategy.
Table 1.
Intra-similarity scores computed within each dataset. STD = standard deviation.
| Dataset | Mean | STD | Min | Max |
|---|---|---|---|---|
| naive | 0.62 | 0.08 | 0.42 | 0.98 |
| random | 0.62 | 0.08 | 0.43 | 0.97 |
| germline-V | 0.67 | 0.12 | 0.40 | 0.99 |
| germline | 0.59 | 0.09 | 0.40 | 0.99 |
Next, we calculated inter-similarity by comparing each mismatched pair to the naive pair sharing the same VH sequence, again using Eq. 1. Table 2 presents these results, showing the degree to which synthetic pairings diverge from native configurations. The germline-V dataset shows the greatest divergence from the naive sequences.
Table 2.
Inter-similarity scores between naive and mismatched VH/VL pairs. STD = standard deviation.
| Dataset | Mean | STD | Min | Max |
|---|---|---|---|---|
| random | 0.62 | 0.08 | 0.41 | 0.99 |
| germline-V | 0.53 | 0.06 | 0.39 | 0.83 |
| germline | 0.59 | 0.07 | 0.41 | 0.99 |
Germline overlap in randomly mismatched VH/VL pairs limits class separability
In this section, we investigate the VDJ/VJ germline combinations underlying the VH/VL sequence pairs in the randomly mismatched set derived from the training split of naive pairs. Specifically, we aim to quantify the probability that a given randomly paired VH/VL sequence originates from a VDJ/VJ germline combination that was never observed among the germline combinations of the original naive set.
To this end, we sample 10,000 VH/VL pairs from the randomly mismatched training set. For each VH and VL sequence, we extract the corresponding VDJ and VJ germlines, respectively, thereby obtaining a set of paired VDJ/VJ germline combinations associated with these mismatched VH/VL pairs. We then compare this set to the germline combinations observed in the naive training pairs. Following the approach described in Materials and Methods, we first compile all VDJ and VJ germline combinations used to generate the naive training set. For each observed VDJ/VJ pair, we count the number of naive VH/VL sequences derived from it. Finally, we assess the proportion of germline combinations in the randomly mismatched sample that were not present in the naive training set. We find that only 0.3% of the VDJ/VJ pairs in the mismatched sample correspond to entirely novel germline combinations (i.e., combinations that were never observed among the naive pairs).
This result indicates that the vast majority of germline combinations in the sampled mismatched pairs are already present in the training set. This likely explains the overlap observed in the embedding space shown in Figure 3a, where the random pairing class cannot be separated from the native pairs. In contrast, the other negative pairing strategies are explicitly constructed using germline information, leading to more distinct distributions. Altogether, these findings underscore the pivotal role of germline identity in determining VH/VL pair compatibility and emphasize its significance in devising effective negative pairing strategies.
Model evaluation under diverse negative pairing strategies
Our model architecture consists of an IgBERT encoder followed by a multi-layer perceptron classification head. We trained three variants of this model, each using a different negative sampling strategy: random, germline, and germline-V. We evaluated each model across three hold-out dataset splits (random, v-gene, and germlines) using three classification metrics: Accuracy, F1 Score, and AUC-ROC (Figure 4). For full metric details – containing precision, Recall, Accuracy, F1, Matthews Correlation Coefficient (MCC), AUC-ROC, and AUC-PR – refer to Supplementary Table 1.
Figure 4.

Heatmaps showing the performance of three models (random, V germline, and VDJ germline) across three dataset splits (random, v-gene, and germlines). Each cell displays the score achieved by a given model on a given split, with color intensity reflecting relative performance.
The VDJ germline model consistently achieved high performance across all datasets and metrics, with values exceeding 0.9 for Accuracy, F1, and AUC score on the v-gene and germlines splits. This reflects strong generalization and robust predictive capacity when leveraging full V(D)J germline information. However, the model performs less effectively on the random dataset, which can be attributed to the lack of unseen VDJ/VJ combinations in the mismatched pairs.
The V germline model also performed well, particularly on the v-gene split, achieving near-perfect classification scores (F1 = 0.98 and AUC-ROC = 1.0). However, its performance decreased significantly on the random and germlines splits, suggesting reduced generalization when only partial germline information is used during training.
The Random model, which uses no germline-aware partitioning, exhibited the weakest performance overall. Interestingly, its performance improves on the VDJ and V germline datasets compared to the random split, likely due to the better separability of the two problems.
These results underscore the importance of germline-aware dataset construction. Models trained and evaluated using V(D)J-consistent partitions yield more biologically grounded and generalizable predictions. The performance drop of the Random model on controlled splits further emphasizes the risks of data leakage and inflated performance in germline-agnostic benchmarks.
VDJ influence on the final output
To assess whether the VDJ model outputs depend on the heavy-chain D gene, we analyzed the full dataset of naturally paired antibodies, where D identity is biologically meaningful and not confounded by heavy – light mismatches. For each sequence, we defined VJ = HV – HJ and constructed a VJ×D matrix. For every (HD, VJ) cell, we computed (i) the number of sequences and (ii) the mean model probability (Figure 5). To account for imbalance, counts are min – max normalized to [0,1] in the top panel. The middle panel reports mean probabilities on their original scale, while the bottom panel applies a within-VJ (column-wise) min – max normalization, rescaling each VJ column to [0,1] to highlight the relative contribution of D in that specific VJ context.
Figure 5.

Top: counts per (HD, VJ) cell (min–max normalized). Middle: mean predicted probability per (HD, VJ); vertical banding highlights a dominant VJ effect. Bottom: within-VJ normalized mean probability (0–1 per column), emphasizing the relative impact of D for each VJ.
Because D segments are short and heavily influenced by junctional diversity, D-gene assignments are inherently less reliable than V and J. As expected, the figure (mid panel) shows a dominant V – J effect (vertical banding), with subtle, context-dependent D contributions supported by the data (bottom panel). Overall, the model remains robust – it captures D-specific nuances without becoming overly sensitive to noisy D calls.
Establishing a benchmark framework for antibody pairing prediction
To enable fair and interpretable comparisons between models for antibody chain-pairing prediction, we established a benchmark comprising three dataset splits (random, v-gene, and germlines) and defined two reference points: Topline, representing the best performance for each dataset, and Bottomline, representing the weakest. These references act as empirical performance bounds, offering a practical framework for evaluating new models in a controlled and biologically meaningful context. Any method falling below the Bottomline may be considered ineffective for this task, while approaches nearing or matching the Topline reflect optimal use of the available signal.
Using this framework, we evaluated three state-of-the-art models: p-IgGen, Humatch, and ImmunoMatch. SynPair was not analyzed as its code is currently unavailable. The results are shown in Figure 6.
Figure 6.

Benchmarking antibody pairing models across dataset splits.
p-IgGen performs relatively well on the random split, nearly matching Humatch in F1 score. Still, its performance declines on the more challenging v-gene and germlines splits, especially in terms of AUC-ROC and Accuracy. Humatch and ImmunoMatch, on the other hand, perform consistently across splits, achieving strong AUC-ROC scores, particularly on v-gene, but fall short of Topline performance in F1 Score and Accuracy, highlighting limitations in binary classification calibration.
Importantly, the Topline consistently dominates across all metrics and splits, underscoring the headroom available for future models. Meanwhile, Bottomline performs competitively in specific cases, especially on v-gene, reflecting the simplicity of some pairing signals in that split. This benchmark and evaluation framework supports robust, interpretable, and reproducible comparisons, providing clear targets for improvement in antibody pairing prediction. For detailed metrics, see Supplementary Table 1.
Assessing generalization and competitiveness on the PARA task
Our evaluation procedure involves three model variants, each tested across three distinct datasets using binary classification. However, due to the lack of a standardized benchmark for this task, fair comparison with existing state-of-the-art methods is challenging. To address this, we additionally implemented a separate evaluation inspired by the PARA framework, which frames VH – VL pairing as a ranking task based on sequence similarity. Specifically, we constructed test triplets , where is the known binding partner and is a mismatched chain with low sequence similarity, as described in Materials and Methods. The model must assign a higher pairing score to than to . While this comparison is inherently indirect, as the dataset and classification weights used by PARA are not publicly available, we align with its evaluation design to provide a reasonable point of reference.
Figure 7 shows model performance across Accuracy, F1 Score, and AUC-ROC (see Supplementary Table 1 for detailed metrics). Our VDJ germline model achieves AUC-ROC scores that closely approach the reference value for PARA (0.82), indicating competitive performance in external benchmarks. As the Random model also performs well, particularly in Accuracy and F1 Score, it suggests that PARA’s negative construction may align more closely with random mismatching than with biologically informed strategies. This hypothesis is further supported by the strong performance of ImmunoMatch, which is trained on randomly mismatched VH – VL pairs. Although the PARA benchmark does not constitute a definitive ground truth, these results reinforce the robustness of our method and highlight the value of germline-aware training for generalizable VH – VL pairing prediction.
Figure 7.

Results on the PARA pairing task. The results for PARA, AntiBERTy, and AbLang (dashed lines) are sourced from the original PARA paper.
Germline-based models achieve strong performance on a 7.2M-sequence dataset
Dudzic et al. recently presented PairedAbNGS,15 a comprehensive dataset of natively paired heavy – light antibody chains compiled from 58 single-cell studies ( productive chains; paired sequences; unique amino-acid pairs). Alongside the resource, the authors analyzed germline pairing patterns and conserved inter-chain contacts. PairedAbNGS complements OAS by expanding the diversity of available paired data, enriching germline coverage and sequence variability, and providing a valuable resource for antibody engineering and machine learning.
Since all models were trained OAS corpus, we further assessed their performance on PairedAbNGS, used as an external benchmark. To ensure a fair evaluation, we removed all sequences overlapping with our training, validation, or test splits. The results, summarized in Figure 8, highlight clear performance differences across approaches. In particular, germline-based models (V and VDJ) consistently achieved the highest accuracy, closely followed by Humatch. These findings confirm the robustness of germline information as a key determinant of pairing compatibility, even when evaluated on a large, independent dataset. However, with this dataset, we are testing the model to determine if the sequences are paired, but we cannot test the data on the negative (experimentally validated unpaired) sequences.
Figure 8.

Performance on the PairedAbNGS dataset. Horizontal bar plot showing the accuracy of the different models. Germline-based models outperform state-of-the-art and random models. Results are presented in terms of accuracy, given that the dataset exclusively comprises sequences from the paired class.
AlphaFold3 ipTM does not distinguish correct from incorrect VH/VL pairings
AlphaFold3’s ipTM score has previously been shown to correlate with the probability of binding,38 supporting its potential utility in interface evaluation.39 Based on this premise, we investigated whether ipTM could be used to distinguish correct from incorrect VH/VL pairings, as the ipTM metric reflects the predicted interface quality between two protein chains. Due to the limited number of submissions on the AF3 server, we evaluated 180 randomly selected antibody sequences under three conditions: correctly paired, random synthetic pairs, and germline-based synthetic pairs. In this analysis, we used a single random seed for each antibody pairing, since ipTM scores in VH – VL modeling are generally stable for the different sequences (see Supplementary Figure S1). The mean ipTM values observed were , , and for original, random, and germline, respectively. Statistical comparison using the Mann-Whitney U test revealed no significant differences between the original and control groups. This outcome may be due to the antibodies in the Protein Data Bank that were used for AlphaFold3’s training, which are mostly engineered or affinity-matured,38 which do not represent the entire antibody space. Moreover, germline pairing biases are connected to transcriptional and genomic factors13–15 that cannot be fully captured by interface quality metrics alone. These findings highlight the need for dedicated models that incorporate antibody-specific pairing determinants beyond general interface quality.
Early signals of thermostability in VH/VL pairing prediction models
Developability refers to a broad spectrum of biophysical and biochemical properties that critically influence an antibody’s manufacturability, safety, and clinical viability. Key attributes include aggregation propensity, solubility, viscosity, thermal and chemical stability, immunogenicity risk, expression yield, and pharmacokinetic behavior.40–42 Balancing these factors is crucial to ensure that an antibody candidate can be produced at scale, stored under stable conditions, and administered safely. Early identification and optimization of these properties during the discovery phase can help mitigate late-stage attrition, streamline development pipelines, and reduce overall costs. Computational approaches offer promising tools for the early assessment of developability-relevant features.43–50 For example, IgBERT demonstrated the ability to predict binding and expression properties.
To investigate whether the VH – VL pairing models capture signals related to antibody developability, we examined their relationship with two key properties: expression and thermostability. We used data from Jain et al.,40 which includes 137 antibodies with Fab melting temperatures measured by differential scanning fluorimetry and HEK expression titers (mg/L). For each antibody, we computed Pearson’s and Spearman’s between model-derived pairing scores and the experimental measurements. The analysis was performed across different negative sampling strategies (random, V germline, and VDJ germline) as well as competing models (p-IgGen, Humatch, and ImmunoMatch). The resulting correlations are summarized in Table 3.
Table 3.
Correlation between model-derived VH – VL pairing scores and experimental developability. Reported are Pearson’s and Spearman’s values for expression (left) and thermostability (right) across different negative sampling strategies and competing models. Bold values indicate correlations with .
| Expression |
Thermostability |
|||
|---|---|---|---|---|
| Person | Spearman | Person | Spearman | |
| Random | −0.06 | −0.13 | 0.06 | 0.04 |
| V germline | −0.11 | −0.14 | −0.15 | 0.06 |
| VDJ germline | 0.14 | 0.11 | 0.16 | 0.18 |
| p-IgGen | 0.17 | 0.15 | 0.02 | 0.04 |
| Humatch | −0.15 | 0.01 | 0.25 | 0.24 |
| ImmunoMatch | −0.11 | −0.12 | −0.03 | −0.02 |
The observed correlations with developability properties are modest. Humatch shows the strongest association with thermostability, while p-IgGen displays the clearest trend with expression. However, the effect sizes remain small, indicating that current AI models struggle to capture developability signals from VH – VL pairing alone. These findings should be interpreted as preliminary, as the analysis was conducted on a limited dataset of 137 antibodies.
Discussion and conclusion
Antibody diversity arises from the stochastic processes of V(D)J recombination and somatic hypermutation, generating a vast array of variable heavy (VH) and light (VL) chains. Functional antibodies depend on the non-covalent pairing of VH and VL, which forms the antigen-binding site and influences key properties such as stability, expression, and specificity. Despite its significance in natural immunity and therapeutic design, predicting VH – VL pairing remains underexplored, with no widely accepted benchmarks or large-scale datasets that include experimentally confirmed non-pairings.
A key challenge in this area is the lack of biologically realistic negative examples, leading many prior studies to adopt ad hoc or implausible mismatching strategies. This undermines model generalization and interpretability, limiting the ability to compare methods fairly and assess their utility in real-world antibody engineering. Recent work on antibody-antigen binding has shown that the choice of negative examples is a key determinant of model generalization and biological interpretability.37 This highlights the need for biologically grounded negative sampling in VH – VL pairing prediction to build robust and interpretable models.
To address these limitations, we present a structured and reproducible benchmark for VH – VL pairing prediction, based on large-scale data from naturally paired human antibodies. We define three biologically motivated negative sampling strategies: random pairing, V-gene mismatching, and full V(D)J mismatching, each representing different levels of pairing difficulty. Through comprehensive evaluation, we demonstrate that full V(D)J mismatching provides the most informative negative set, enabling robust classification without introducing excessive noise or trivial separability.
We contribute three IgBERT-based reference models, each trained on a different negative sampling strategy, serving as baselines for future studies. Our benchmark enables reproducible comparisons against recent methods like PARA, p-IgGen, and ImmunoMatch, revealing that performance can vary significantly depending on the pairing challenge and data split used. This highlights the need for germline-aware model evaluation, especially when generalization to unseen germline combinations is crucial. Germline-based models show strong performance on the large PairedAbNGS dataset ( pairs).
Beyond benchmarking, our framework has broader implications for generative antibody design. The ability to construct realistic, non-pairable examples lays the groundwork for developing or refining generative loss functions that penalize biologically implausible VH – VL combinations. Additionally, the VDJ-based model shows an early, statistically significant signal of association with thermostability (), suggesting a potential link to developability. While preliminary, this finding highlights a promising direction for future work, where VH – VL pairing models could contribute to early-stage developability screening.
Finally, while we focus on human antibodies, the germline-aware data partitioning strategy and flexible sampling procedures can be easily extended to other species, including murine or bovine repertoires. We believe this work facilitates the development of new models for accurate, efficient, and biologically grounded predictions of VH – VL pairings.
Materials and Methods
Dataset
Paired human antibody sequences were downloaded from the OAS database, producing an initial collection of 1,954,079 pairs. We removed truncated sequences. Each sequence was annotated with its corresponding germline segments, three for heavy chains (V, D, and J) and two for light chains (V and J), using the IMGT nomenclature. After filtering, 1,622,802 sequence pairs with complete germline information were retained, corresponding to 1,622,674 unique pairs (99.99% of the total), 1,604,717 unique heavy sequences, and 699,889 unique light sequences. Regarding germline diversity, we identified 7 unique heavy V segments, 7 heavy D segments, and 6 heavy J segments, resulting in 294 observed VDJ heavy germline combinations. In addition, 18 unique light V segments and 12 light J segments produced 76 unique VJ light germline combinations. Out of all possible heavy and light germline combinations (), only 12,416 pairs were observed. A summary of the dataset is provided in Table 4.
Table 4.
Summary of the dataset.
| Statistic | Value |
|---|---|
| Rows | 1 622 802 |
| Unique sequences | 1 622 674 |
| Unique heavy sequences | 1 604 717 |
| Unique light sequences | 699 889 |
| Heavy germlines | 294 |
| Light germlines | 76 |
| Germline pairs | 12 416 |
| V heavy germlines | 7 |
| D heavy germlines | 7 |
| J heavy germlines | 6 |
| V light germlines | 18 |
| J light germlines | 12 |
To reduce data set redundancy, we use Linclust51 by specifying a minimal sequence identity of . The algorithm extracted groups, with a mean size of . Almost all clusters are singletons (), and the largest one has size . We then selected the representative sequence for each cluster. The resulting dataset consists of pairs with unique heavy sequences and unique light sequences. The number of germline combinations observed has also been reduced to .
Germline-aware split
The dataset was partitioned into three folds for training, validation, and testing using a germline – aware splitting strategy rather than pure random assignment. Negative samples (“mismatched” sequence pairs) were generated based on germline origin to ensure that, in real – world applications, where test pairs may derive from germlines unseen during training, the model still correctly discriminates true from spurious antibody pairings. This approach enables us to extend our model to nonhuman species, for example, mice, where the repertoire of V(D)J gene segments is substantially larger than in humans.
Let be a dataset of paired VH/VL sequences. Denote by and the sets of heavy (VDJ) and light (VJ) germline combinations that give rise to the heavy and light sequences in , respectively. For each (respectively, ), define (respectively, ) as the subset of consisting of VH/VL pairs whose heavy (light) chain originates from (). The objective is to partition into two subsets, (training) and (testing), such that:
This ensures that some VH/VL pairs in include heavy and light sequences with germline combinations not seen in .
The germline-aware splitting algorithm proceeds as follows (see Algorithm 1). It begins by sampling a subset of heavy germline combinations and adding to an initially empty test set all VH/VL pairs from . A similar step is then performed for the light germlines: a subset is sampled, and the corresponding VH/VL pairs are added to .
Once these pairs are removed from the original dataset , a subset is sampled from the remaining data. The rest of the remaining VH/VL pairs, i.e., , are then added to . The resulting pair of disjoint sets forms the final partition.
| Algorithm 1. Partitioning algorithm |
|---|
|
The sampling of and can influence the number of pairs included in . If in is included an heavy germline that is associated with a large number of s, can cover a large portion of the sequences in , creating an unbalanced division. On the other hand, if contains only heavy germlines such that for each , contains only few pairs, their exclusive inclusion in does not contribute enough to an evaluation of the performance the model has when presented with never-seen germlines. Similar reasoning also applies to light germlines.
In light of these observations, for each VDJ heavy germline combination in the dataset, the number of heavy sequences originating from them is counted. Then, of the combinations with a sequence count between the 80th and 20th quantiles are sampled and included in . The same approach is applied to the light germline combinations. A graphical representation is provided in Figure 9. From the remaining sequences in , are sampled and added to , while the rest are placed in . The dataset split composition is shown in Table 5.
Figure 9.

(a) Distribution of heavy V(D)J germline combinations on a log–scale. (b) Distribution of light VJ germline combinations on a log–scale. Orange bars indicate germline combinations reserved exclusively for the test split.
Table 5.
Summary of training, validation, and test splits.
| Training | Training | Validation | Test |
|---|---|---|---|
| Paired sequences | |||
| Unique heavy | |||
| Unique light | |||
| Unique pairs | |||
| Heavy germlines | |||
| Light germlines | |||
| Germlines pair | |||
| V heavy germlines | |||
| D heavy germlines | |||
| J heavy germlines | |||
| V light germlines | |||
| J light germlines |
Negative dataset preparation
Random pairing
The random pairing procedure constructs synthetic VH – VL combinations by repeatedly sampling one heavy chain and one light chain until a new pair (previously unobserved) is obtained, this is considered as the baseline of our work. Formally, let and denote the sets of all unique heavy and light sequences, and let be the set of real (positive) pairs. At each iteration, a candidate pair is drawn uniformly from ; if , it is discarded and a new draw is performed. This sampling continues until distinct non-positive pairs have been collected, typically with to ensure that positives and negatives are balanced. Finally, the synthetic negatives are merged with to form the randomly paired dataset. Algorithm 2 outlines this process.
| Algorithm 2. Random pairing algorithm |
|---|
![]() |
The previous algorithm is repeated three times, where in each execution, is set to be equal to the training set, validation set, or test set, thus obtaining three randomly paired datasets. Results are shown in Table 6.
Table 6.
Summary of the splits for the random dataset.
| Training | Training | Validation | Test |
|---|---|---|---|
| Generated pairs | 716325 | 271233 | 369597 |
| Unique heavy | 451185 | 171061 | 232860 |
| Unique light | 212387 | 87773 | 115172 |
| Unique pairs | 716097 | 271142 | 369481 |
| Total size | 1432650 | 542466 | 739194 |
Germline pairing
Following the hypothesis in,13 we investigated how germline origin constrains VH – VL pairing. Let denote a set of naive heavy – light sequence pairs, where each heavy chain derives from germline(s) , and each light chain from . We evaluated two encoding schemes for and : V-germline and VDJ germlines.
Let be the set of all heavy germlines that generated the heavy sequences in and be the same but for the light sequences. The set of all possible germline pair combinations is , where indicates the cartesian product. For a , denote to be the subset of containing all the and pairs such that is generated by and is generated by . Similarly, denote as the set of heavy sequences from having germline , and similarly for . The following procedure works as long as there exists some such that is zero. In the following, refers to the subset of such that for each , is zero.
Let be the number of synthetic pairs to generate. The algorithm starts by sampling germline pairs from . Then, for each sampled , a random heavy sequence is sampled from and a random light one is sampled from . This process generates random sequence pairs whose germlines are never observed together in the dataset. This approach is showed in Algorithm 3.
|
Algorithm 3. Germline pairing algorithm. | |
![]() |
The last gap to fill is to specify a probability distribution for the elements of . The most trivial approach is to sample elements in using a uniform distribution, which will result in that, if , on average a germline pair is picked times. However, the number of sequence a and a are associated to. For instance, consider the extreme case where the heavy germline and the light germline appear just once in the dataset, with their corresponding and sequences never observed paired in the dataset. This means that, on average, is sampled times from , implying that the VH/VL pair formed by and will be repeated the same number of times.
This example motivated the use of the maximal number of possible combinations. Let be a germline pair from , the maximal number of possible combinations is given by . Taking all the and dividing them by , the obtained quantities characterize a probability distribution over .
While this approach solves the problem of repeated sequences, to avoid a skewed distribution, we added smoothing: for a , its probability is given by where is a positive constant.
As in the random pairing case, the generated germline paired sequences are combined with the naturally paired ones to create a single dataset. In addition, it is necessary to specify whether the germline-paired dataset was generated with only the V germline or all germlines.
For each of the training, validation, and test splits of the dataset, a germline-paired dataset using only the V germline and one using all the germlines is generated with an alpha equal to 1000. Summaries are given in Tables 7 and 8.
Table 7.
Summary of the synthetic pairs generated by the germline pairing strategy using all the germlines.
| Training | Training | Validation | Test |
|---|---|---|---|
| Generated pairs | 716325 | 271233 | 369597 |
| Unique heavy | 172284 | 79278 | 109264 |
| Unique light | 102598 | 50571 | 62194 |
| Unique pairs | 702818 | 268702 | 366160 |
| Total size | 1432650 | 542466 | 739194 |
Table 8.
Summary of the synthetic pairs generated by the germline pairing strategy using only the V germlines.
| Training | Training | Validation | Test |
|---|---|---|---|
| Generated pairs | 716325 | 271233 | 369597 |
| Unique heavy | 259312 | 68424 | 78250 |
| Unique light | 150 | 73 | 71 |
| Unique pairs | 511619 | 154522 | 208502 |
| Total size | 1432650 | 542466 | 739194 |
Model
Architecture
Our classifier builds on the IgBERT model,28 which we augment with a lightweight feedforward classification head to distinguish naturally paired VH – VL sequences (from OAS) from synthetically generated mismatches. An overview of the model is shown in Figure 10. Each VH and VL sequence is tokenized at the residue level, concatenated with a special [SEP] token to denote the chain boundary, and passed through the (frozen) IgBERT encoder. The encoder outputs a 1,024-dimensional contextual embedding per residue, which we average across the sequence to obtain a single representation for the pair. This embedding is then passed through a two-layer classifier that outputs logits for binary classification (real vs. synthetic), which are converted into probabilities via a softmax layer. During training, only the classification head is optimized using Adam and cross-entropy loss, while IgBERT’s parameters remain fixed.
Figure 10.

Architecture of the model. The heavy (in red) and the light (in blue) sequences are tokenized by assigning a token for each residue.
Although alternative models such as PARA were considered, we selected IgBERT due to its deeper architecture (400 M parameters vs. 45 M in PARA) and stronger representational capacity. Given our study’s focus on analyzing the effect of different negative sampling strategies, we opted for a more complex encoder to isolate the impact of training data design rather than model capacity.
Training
The model was trained for one epoch using the Adam optimizer with 1,432,650 samples in the training set and 739,194 samples in the validation set, performing a total of 44,771 updates. The loss function used was cross-entropy loss, and the batch size was set to 32. The hyperparameters employed for training each of the three models are summarized in Table 9.
Table 9.
Optimal hyperparameters for the models trained on the respective datasets.
| Hyperparameter | Random pairing | Germline pairing, V-only | Germline pairing |
|---|---|---|---|
| Number of layers | 3 | 3 | 5 |
| Size of the layers | 1024 | 2048 | 2048 |
| Learning rate (LR) | |||
| Weight decay | |||
| LR scheduler | linear | None | linear |
| End factor | 0.05 | None | 0.1 |
| Steps | 1000 | None | 1000 |
| Adam | 0.85 | 0.90 | 0.95 |
| Adam | 0.960 | 0.999 | 0.960 |
The experiments were carried out using Python 3.11.9 and PyTorch 2.5.0.dev20240909+cu124, on a system equipped with an ARM Neoverse N1 251 CPU and an NVIDIA A100 GPU (CUDA 12.5).
Test procedure
To benchmark our models in a manner analogous to PARA, we construct evaluation triplets under the constraint that is known to bind , while is chosen to low similarity to . An overview of the test procedure is given in Figure 11. Each pair is fed independently through the model, yielding confidence scores for the “real”class. By labeling as positive and as negative, we accumulate (score, label) observations over triplets. This procedure mirrors the PARA challenge of distinguishing true binders from close decoys, ensuring that performance reflects model robustness to varying degrees of chain similarity. Final metrics are then computed on this pooled set of predictions. In the original paper, the author did not publish the fine-tuned model or the dataset used; therefore, we compared our model with the results presented in the original paper.
Figure 11.

Overview of the pipeline to get the test measures. A triplet is split into two pairs, which the model independently processes. The scores regarding the confidence the model has in the pairing between sequences are stored and tagged as , if the sequences are known to bind, otherwise.
To build the set of triples, a sample of light sequences is first collected. Then, the similarity defined in Equation 1 is computed for each possible pair of these sampled light sequences, and the mean () and standard deviation () are derived. The function in Equation 1 represents the Levenshtein distance, defined as the minimum number of insertions, deletions, and substitutions needed to transform one string into another.
Given a dataset of paired VH/VL sequences, for each pair , a light sequence is searched such that is less than a threshold defined by . Two datasets are generated: one from the validation set and the other from the test set.
Supplementary Material
Acknowledgments
We would like to thank the University of Pisa Data Center for providing the necessary hardware resources for this project. We acknowledge that grammar checks were conducted using EditGPT.
Funding Statement
This work was supported by the European Union – Next Generation EU, under the Italian National Recovery and Resilience Plan (PNRR), Mission 4 Component 2 Investment 1.5, project ECS00000017 Tuscany Health Ecosystem – Spoke 6 [CUP I53C22000780001] and by the project “Hub multidisciplinare e interregionale di ricerca e sperimentazione clinica per il contrasto alle pandemie e all’antibioticoresistenza (PAN-HUB)” funded by the Italian Ministry of Health (POS 2014-2020), project ID: [T4-AN-07], [CUP: I53C22001300001]. Funding for the publication of this work was generously provided by Fondazione Toscana Life Sciences. G.M. was employed by TLS during the course of this work, and S.J. is a PhD student partially funded by TLS. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Nomenclature/Notation
- The following abbreviations are used in this manuscript:
- AF3
AlphaFold 3
- AUC-ROC
Area Under the Receiver Operating Characteristic Curve
- CDR
Complementarity-Determining Region
- DL
Deep Learning
- MCC
Matthews Correlation Coefficient
- Fc
Constant Fragment
- Fv
Variable Fragment
- ipTM
Interface predicted template modelling
- MLP
MultiLayer Perceptron
- OAS
Observed Antibody Space
- STD
Standard Deviation
- t-SNE
t-distributed Stochastic Neighbor Embedding
- UMAP
Uniform Manifold Approximation and Projection
- VDJ/V(D)J
Variable, Diversity, and Joining gene recombination process
- VH
Variable Heavy
- VL
Variable Light
Disclosure statement
No potential conflict of interest was reported by the author(s).
Author contributions
Conceptualization, G.M., and S.J.; methodology, E.D. and S.J.; software, E.D.; validation, E.D., and S.J.; formal analysis, E.D. and S.J; investigation, E.D. and S.J.; resources, A.M. and P.M.; data curation, E.D.; writing – original draft preparation, S.J. and E.D; writing – review and editing, A.M., P.M., G.M., E.D. and S.J.; visualization, S.J.; supervision, A.M., G.M., P.M. and S.J.; project administration, A.M.; funding acquisition, G.M. All authors have read and agreed to the published version of the manuscript.
Data availability statement
Data and code are available at https://github.com/darcoenr/thesis-2.
Supplementary material
Supplemental data for this article can be accessed online at https://doi.org/10.1080/19420862.2025.2570749
References
- 1.Joubbi S, Micheli A, Milazzo P, Maccari G, Ciano G, Cardamone D, Medini D.. Antibody design using deep learning: from sequence and structure design to affinity maturation. Briefings Bioinf. 2024;25(4). doi: 10.1093/bib/bbae307. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 2.Tonegawa S. Somatic generation of antibody diversity. Nature. 1983;302(5909):575–21. doi: 10.1038/302575a0. [DOI] [PubMed] [Google Scholar]
- 3.Alt FW, Oltz EM, Young F, Gorman J, Taccioli G, Chen J. Vdj recombination. Immunol Today. 1992;13(8):306–314. doi: 10.1016/0167-5699(92)90043-7. [DOI] [PubMed] [Google Scholar]
- 4.Jung D, Giallourakis C, Mostoslavsky R, Alt FW. Mechanism and control of V (D) J recombination at the immunoglobulin heavy chain locus. Annu Rev Immunol. 2006;24(1):541–s570. doi: 10.1146/annurev.immunol.23.021704.115830. [DOI] [PubMed] [Google Scholar]
- 5.Schatz DG, Ji Y. Recombination centres and the orchestration of V (D) J recombination. Nat Rev Immunol. 2011;11(4):251–263. doi: 10.1038/nri2941. [DOI] [PubMed] [Google Scholar]
- 6.Schatz DG, Swanson PC. V (d) j recombination: mechanisms of initiation. Annu Rev Genet. 2011;45(1):167–202. doi: 10.1146/annurev-genet-110410-132552. [DOI] [PubMed] [Google Scholar]
- 7.Jung D, Alt FW. Unraveling V (D) J recombination: insights into gene regulation. Cell. 2004;116(2):299–311. doi: 10.1016/S0092-8674(04)00039-X. [DOI] [PubMed] [Google Scholar]
- 8.Guloglu B, Deane CM. Specific attributes of the VL domain influence both the structure and structural variability of CDR-H3 through steric effects. Front Immunol. 2023;14:1223802. doi: 10.3389/fimmu.2023.1223802. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 9.Christensen PA, Danielczyk A, Ravn P, Larsen M, Stahn R, Karsten U, Goletz S. Modifying antibody specificity by chain shuffling of vh/vl between antibodies with related specificities. Scand J Immunol. 2009;69(1):1–10. doi: 10.1111/j.1365-3083.2008.02164.x. [DOI] [PubMed] [Google Scholar]
- 10.De Wildt RM, Hoet RM, van Venrooij WJ, Tomlinson IM, Winter G. Analysis of heavy and light chain pairings indicates that receptor editing shapes the human antibody repertoire. J Mol Biol. 1999;285(3):895–901. doi: 10.1006/jmbi.1998.2396. [DOI] [PubMed] [Google Scholar]
- 11.Brezinschek H-P, Foster SJ, Dorner T, Brezinschek RI, Lipsky PE. Pairing of variable heavy and variable chains in individual naive and memory B cells. J Immunol. 1998;160(10):4762–4767. [PubMed] [Google Scholar]
- 12.Parren PW, Lugovskoy AA. Therapeutic antibody engineering: current and future advances driving the strongest growth area in the pharmaceutical industry. MAbs-Austin. 2013;5(2):175–177. doi: 10.4161/mabs.23654. [DOI] [Google Scholar]
- 13.Jayaram N, Bhowmick P, Martin AC. Germline VH/VL pairing in antibodies. Protein Eng Des Sel. 2012;25(10):523–530. doi: 10.1093/protein/gzs043. [DOI] [PubMed] [Google Scholar]
- 14.Madsen AV, Mejias-Gomez O, Pedersen LE, Preben Morth J, Kristensen P, Jenkins TP, Goletz S. Structural trends in antibody-antigen binding interfaces: a computational analysis of 1833 experimentally determined 3D structures. Comput Struct Biotechnol J. 2024;23:199–211. doi: 10.1016/j.csbj.2023.11.056. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 15.Dudzic P, Chomicz D, Bielska W, Jaszczyszyn I, Zieliski M, Janusz B, Wróbel S, Le Pannérer M-M, Philips A, Ponraj P, et al. Conserved heavy/light contacts and germline preferences revealed by a large-scale analysis of natively paired human antibody sequences and structural data. Commun Biol. 2025;8(1):1110. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 16.Seidler CA, Spanke VA, Gamper J, Bujotzek A, Georges G, Liedl KR. Data-driven analyses of human antibody variable domain germlines: pairings, sequences and structural features. MAbs-Austin. 2025;17(1):2507950. doi: 10.1080/19420862.2025.2507950. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 17.Robert R, Lefranc M-P, Ghochikyan A, Agadjanyan MG, Cribbs DH, Van Nostrand WE, Wark KL, Dolezal O. Restricted V gene usage and VH/VL pairing of mouse humoral response against the N-terminal immunodominant epitope of the amyloid peptide. Mol Immunol. 2010;48(1–3):59–72. doi: 10.1016/j.molimm.2010.09.012. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 18.Tanno H, McDaniel JR, Stevens CA, Voss WN, Li J, Durrett R, Lee J, Gollihar J, Tanno Y, Delidakis G, et al. A facile technology for the high-throughput sequencing of the paired VH: VL and TCR: TCR repertoires. Sci Adv. 2020;6(17):eaay9093. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 19.Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Ždek A, Potapenko A, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–589. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 20.Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Ždek A, Nelson AW, Bridgland A. et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–710. [DOI] [PubMed] [Google Scholar]
- 21.Rives A, Meier J, Sercu T, Goyal S, Lin Z, Liu J, Guo D, Myle Ott CLZ, Ma J, Fergus R. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proc Natl Acad Sci India Sect B Biol Sci. 2021;118(15):e2016239118. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 22.Elnaggar A, Heinzinger M, Dallago C, Rehawi G, Wang Y, Jones L, Gibbs T, Feher T, Angerer C, Steinegger M, et al. Prottrans: towards cracking the language of life’s code through self-supervised learning. Ieee T Pattern Anal. 2021;44(10):7112–7127. doi: 10.1109/TPAMI.2021.3095381. [DOI] [PubMed] [Google Scholar]
- 23.Brandes N, Ofer D, Peleg Y, Rappoport N, Linial M. ProteinBERT: a universal deep-learning model of protein sequence and function. Bioinformatics. 2022;38(8):2102–2110. doi: 10.1093/bioinformatics/btac020. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 24.Wang L, Li X, Zhang H, Wang J, Jiang D, Xue Z, Wang Y. A comprehensive review of protein language models. 2025.
- 25.Leem J, Mitchell LS, Farmery JH, Barton J, Galson JD. Deciphering the language of antibodies using self-supervised learning. Patterns. 2022;3(7):100513. doi: 10.1016/j.patter.2022.100513. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 26.Olsen TH, Moal IH, Deane CM. Ablang: an antibody language model for completing antibody sequences. Bioinf Adv. 2022;2(1):vbac046. doi: 10.1093/bioadv/vbac046. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 27.Jing H, Gao Z, Xu S, Shen T, Peng Z, He S, You T, Ye S, Lin W, Sun S. Accurate prediction of antibody function and structure using bio-inspired antibody language model. Briefings Bioinf. 2024;25(4):bbae245. doi: 10.1093/bib/bbae245. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 28.Kenlay H, Dreyer FA, Kovaltsuk A, Miketa D, Pires D, Deane CM. Large scale paired antibody language models. arXiv preprint arXiv:2403.17889. 2024. [DOI] [PMC free article] [PubMed]
- 29.Gao X, Cao C, He C, Lai L. Pre-training with a rational approach for antibody sequence representation. Front Immunol. 2024;15:1468599. doi: 10.3389/fimmu.2024.1468599. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 30.Burbach SM, Briney B. Improving antibody language models with native pairing. Patterns. 2024;5(5):100967. doi: 10.1016/j.patter.2024.100967. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 31.Olsen TH, Boyles F, Deane CM. Observed antibody space: a diverse database of cleaned, annotated, and translated unpaired and paired antibody sequences. Protein Sci. 2022;31(1):141–146. doi: 10.1002/pro.4205. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 32.Kovaltsuk A, Leem J, Kelm S, Snowden J, Deane CM, Krawczyk K. Observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires. J Immunol. 2018;201(8):2502–2509. doi: 10.4049/jimmunol.1800708. [DOI] [PubMed] [Google Scholar]
- 33.Turnbull OM, Oglic D, Croasdale-Wood R, Deane CM. P-iggen: a paired antibody generative language model. Bioinformatics. 2024;40(11):btae659. doi: 10.1093/bioinformatics/btae659. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 34.Chinery L, Jeliazkov JR, Deane CM. Humatch-fast, gene-specific joint humanisation of antibody heavy and light chains. MAbs-Austin. 2024;16(1):2434121. doi: 10.1080/19420862.2024.2434121. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 35.Guo D, Dunn-Walters DK, Fraternali F, Ng JC. Immunomatch learns and predicts cognate pairing of heavy and light immunoglobulin chains. bioRxiv. 2025; 2025–02.
- 36.Turnbull OM, Deane C. Synpair: pairing unpaired antibody chains at billion-sequence scale with contrastive learning. ICML 2025 Generative AI and Biology (GenBio) Workshop; Vancouver, BC, Canada. [Google Scholar]
- 37.Ursu E, Minnegalieva A, Rawat P, Chernigovskaya M, Tacutu R, Kjetil Sandve G, Robert PA, Greiff V. Training data composition determines machine learning generalization and biological rule discovery. Nat Mach Intell. 2025;7(8):1206–1219. doi: 10.1038/s42256-025-01089-5. [DOI] [Google Scholar]
- 38.Bennett NR, Watson JL, Ragotte RJ, Borst AJ, See DL, Weidle C, Biswas R, Yu Y, Shrock EL, Ault R, et al. Atomically accurate de novo design of antibodies with RFdiffusion. bioRxiv. 2025; 2024–03.
- 39.Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, Ronneberger O, Willmore L, Ballard AJ, Bambrick J, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630(8016):493–500. doi: 10.1038/s41586-024-07487-w. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 40.Jain T, Sun T, Durand S, Hall A, Rewa Houston N, Nett JH, Sharkey B, Bobrowicz B, Caffry I, Yu Y. et al. Biophysical properties of the clinical-stage antibody landscape. Proc Natl Acad Sci India Sect B Biol Sci. 2017;114(5):944–949. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 41.Waight AB, Prihoda D, Shrestha R, Metcalf K, Bailly M, Ancona M, Widatalla T, Rollins Z, Cheng AC, Bitton DA, et al. A machine learning strategy for the identification of key in silico descriptors and prediction models for IgG monoclonal antibody developability properties. MAbs-austin. 2023;15(1):2248671. doi: 10.1080/19420862.2023.2248671. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 42.Raybould MI, Marks C, Krawczyk K, Taddese B, Nowak J, Lewis AP, Bujotzek A, Shi J, Deane CM. Five computational developability guidelines for therapeutic antibody profiling. Proc Natl Acad Sci India Sect B Biol Sci. 2019;116(10):4025–4030. doi: 10.1073/pnas.1810576116. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 43.Prihoda D, Maamary J, Waight A, Juan V, Fayadat-Dilman L, Svozil D, Bitton DA. Biophi: a platform for antibody design, humanization, and humanness evaluation based on natural antibody repertoires and deep learning. MAbs-Austin. 2022;14(1):2020203. doi: 10.1080/19420862.2021.2020203. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 44.Hon J, Marusiak M, Martinek T, Kunka A, Zendulka J, Bednar D, Damborsky J. Soluprot: prediction of soluble protein expression in Escherichia coli. Bioinformatics. 2021;37(1):23–28. doi: 10.1093/bioinformatics/btaa1102. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 45.Mitternacht S. Freesasa: an open source C library for solvent accessible surface area calculations. F1000Research. 2016;5:189. doi: 10.12688/f1000research.7931.1. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 46.Søndergaard CR, Olsson MH, Rostkowski M, Jensen JH. Improved treatment of ligands and coupling effects in empirical calculation and rationalization of pka values. J Chem Theory Comput. 2011;7(7):2284–2295. doi: 10.1021/ct200133y. [DOI] [PubMed] [Google Scholar]
- 47.Olsson MH. Protein electrostatics and pKa blind predictions; contribution from empirical predictions of internal ionizable residues. Proteins: Struct Func Bioinf. 2011;79(12):3333–3345. doi: 10.1002/prot.23113. [DOI] [PubMed] [Google Scholar]
- 48.Akbar R, Bashour H, Rawat P, Robert PA, Smorodina E, Cotet T-S, Flem-Karlsen K, Frank R, Bhushan Mehta B, Vu MH, et al. Progress and challenges for the machine learning-based design of fit-for-purpose monoclonal antibodies. MAbs-Austin. 2022;14(1):2008790. doi: 10.1080/19420862.2021.2008790. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 49.Reynisson B, Barra C, Kaabinejadian S, Hildebrand WH, Peters B, Nielsen M. Improved prediction of MHC II antigen presentation through integration and motif deconvolution of mass spectrometry MHC eluted ligand data. J Proteome Res. 2020;19(6):2304–2315. doi: 10.1021/acs.jproteome.9b00874. [DOI] [PubMed] [Google Scholar]
- 50.Zhou Y, Xie S, Yang Y, Jiang L, Liu S, Li W, Bukari Abagna H, Ning L, Huang J. Ssh2.0: a better tool for predicting the hydrophobic interaction risk of monoclonal antibody. Front Genet. 2022;13:842127. doi: 10.3389/fgene.2022.842127. [DOI] [PMC free article] [PubMed] [Google Scholar]
- 51.Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1). doi: 10.1038/s41467-018-04964-5. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.
Supplementary Materials
Data Availability Statement
Data and code are available at https://github.com/darcoenr/thesis-2.


